Neo Technology, company behind the graph NoSQL database Neo4j, recently released version 2.3 of the database. They also announced the openCypher initiative to help with creating a standard graph query language.
Version 2.3 supports an in-memory page cache component for better performance, Docker tools, and an enhanced query planner. Philip Rathle, VP of Products at Neo Technology, wrote about the new features in the latest release of Neo4j.
openCypher aims at accelerating the usage of graph processing and analysis by making it easier graph data management platform. It consists of four key artifacts:
- Language specification
- Reference implementation
- Technology compatibility kit
- Cypher reference documentation
InfoQ spoke with Philip, about the features in the latest release of Neo4j and the openCypher announcement.
InfoQ: Can you discuss the new in-memory page cache component that's fully off-heap cache? How does it work? What was the rationale to build a native cache rather than reusing an existing caching framework?
Philip Rathle: Moving the caching of disk content off-heap gives much greater control over how memory is managed within the entire database. Neo4j is based on Java, which provides excellent memory management for a typical application, and performance, yet falls short when large amounts of data are cycled through the database cache as the result of ongoing query operations.
The function of the on-heap cache in Neo4j has been to provide ultra-fast performance. This has worked well... up until the point at which very large caches combined with very large queries begins to stress the generational garbage collectors of Java, which turns the JVM heap into a limiting factor for scalability. The new off-heap cache was designed to be as fast (actually faster) than the on-heap cache, while far exceeding the levels of graph query performance possible when the cache is limited by the JVM heap. The new cache is also far simpler to manage, providing tuning flexibility with fewer knobs.
InfoQ: How is the concurrency managed with the in-memory page cache?
Rathle: Concurrency is managed using a variety of mechanisms. One is to provide low level data structures that are sufficiently granular not to get in the way of each other, while at the same time sufficiently large to provide a useful level of cache affinity. Another is very fast locking combined with smart locking strategies to ensure threads accessing the cache work independently in the majority of situations. Another is checkpointing, which we also added in Neo4j 2.3, and providing the right knobs and defaults for flushing the cache on an ongoing basis. This allows queries to continue operating seamlessly during log switches by spreading the load over time.
InfoQ: How does the enhanced Cypher query planner work? Are there any special use cases where this is more efficient than other use cases?
Rathle: The Cypher query planner in Neo4j 2.3 explores many more plans before deciding which execution plan to use for a query. Previously, it would aggressively prune plans from consideration to avoid excessive memory use - which was great for large queries. Unfortunately, this often lead to sub-optimal plans as the planner would get stuck in a local optimum. The updated Cypher query planner has a better algorithm, that lets it explore many more plans, whilst also keeping the memory requirements low.
Neo4j 2.3 also has new operators that it can use to solve your query - for example the new TriadicSelection operator makes queries where it is applicable many times faster than before.
InfoQ: Can you discuss some examples of String-enhanced graph search feature in the new release? What is this new feature built on?
Rathle: Neo4j 2.3 adds new operators to Cypher - “STARTS WITH”, “ENDS WITH” and “CONTAINS”. These operators make it much easier to use Cypher for searching string properties in your graph. In addition, the STARTS WITH operator can use your normal indexes to quickly find matching nodes.
Similarly to STARTS WITH, the Cypher planner in Neo4j 2.3 also recognizes range-based number queries, and will use your normal indexes when searching nodes in your graph based on numeric properties. The key for all of this is an optimizer that’s smart enough to combine graph pattern matching with the text-based indexes and optimize hybrid graph-text search.
InfoQ: Regarding Docker support, what tools are available for developers to start using Neo4j in Docker container?
Rathle: We have built a Neo4j Docker image which is part of the official Docker image library. This can be used to run Neo4j containers using the standard Docker tools. To learn more about Neo4j on Docker and available tooling, the general documentation is available here.
InfoQ: Is Docker support production ready? What are the deployment considerations when using Neo4j with Docker?
Rathle: Yes. As of now, Neo has an official Docker repository that we officially support for our customers. As far as best practices: perhaps the most important tuning parameter is memory. The underlying hardware must provide sufficient memory for the containers running on it; the Neo4j image allows memory usage to be configured as appropriate. And the Enterprise Edition of Neo4j, which in contrast to the Community Edition is primarily commercial, has quite a few operational features that aren’t in the Community Edition, including clustering.
Docker containers are essentially ephemeral, but Neo4j needs durable storage for its data. The underlying hardware must include a disk which is mounted into the container for this purpose. Docker containers are isolated from one another by default. When running a Neo4j cluster the containers must be carefully configured to ensure that they can all communicate with one another.
InfoQ: OpenCypher support from Databricks and Oracle. Can you elaborate on how Spark GraphX graph data processing library can benefit from the openCypher initiative?
Rathle: Ultimately it will lead to Cypher running on Spark… or at least the graph capable part of Spark. Right now to do graph querying on Spark, your main option is Scala, which is great: Scala is super powerful. (Actually we’ve implemented Cypher in Scala.) But there’s an opportunity to make the power of graph analysis available in GraphX more broadly accessible to data analysts who are used to working with SQL. We’re very excited about this because Spark very naturally complements Neo4j. Spark is primarily a set of tools for data scientists to arrive at insights. Neo4j is primarily a platform that enables graph analysis in real time by applications. Insight and action: it’s a great complementarity.
That’s our vision and motivation. We’ve formed a good friendship with Databricks, and were happy for Databricks Co-Founder and CEO Ion Stoica to lend his public support to openCypher. In his words: “Graph processing is becoming an indispensable part of the modern big data stack. Neo4j’s Cypher query language has greatly accelerated graph database adoption. We look forward to bringing Cypher’s graph pattern matching capabilities into the Spark stack, making graph querying more accessible to the masses.”
About the Interviewee
Philip Rathle is the VP of Products at Neo Technology. He has a passion for building great products that help users solve tomorrow's challenges. He spent the first decade of his career building information solutions for some of the world's largest companies: first with Accenture, and then at Tanning Technology, one of the world's top database consultancies of the time, as a solution architect focusing on data warehousing and BI strategy.