Today at GraphConnect Europe 2016, Neo Technology announced the release of their namesake Neo4j 3.0 database, focusing on scale and developer productivity. In the keynote, CEO and co-creator of Neo4j Emil Efrem announced that the new version of Neo4j is far more scalable than it had been in the past, with a view to looking at the future of particularly the internet of things when billions of devices are connected. Neo4j aims to be able to represent those connections, and the storage system has been rearchitected to massively increase some of the hard-limits for number of nodes (which were previously limited to 32 billion). Emil observed that today there are around 50 billion devices connected to the internet, and that by the end of this decade the number of connected devices will rival the number of neurons in a human brain.
As well as allowing the database to store far more data, a new binary networking protocol (called bolt) and standardised set of drivers have been created. The initial supported languages include Java, JavaScript, .Net and Python; although community created drivers for PHP and C are around the corner. The driver provides a way of executing Cypher queries and have a returned result set in an API which follows the idioms for the platform. For example, the JavaScript driver provides a means to provide a callback which will be invoked when each result record is returned, and the Java driver provides Stream instance for processing the data objects. The driver provides the unpacking of data automatically from the returned result, and has a significant over-the-wire benefit for data, in that for large result sets the key isn't repeated for every record (like it is in JSON or XML or other human readable formats). The reduced data set is one of the reasons for the speed improvements in applications, although they will need to be amended to use the bolt driver instead of the JSON response format.
InfoQ spoke to Will Lyon, a developer relations engineer at Neo Technology, and started by asking what the bolt protocol was:
Lyon: Bolt is the new binary protocol for Neo4j. Previously all requests went over HTTP with some overhead; so Bolt enables us to reduce the overhead of HTTP and therefore we can stream data out of Neo4j much faster. It also allows us to connect with other technologies, such as a Spark connector that Micheal Hunger has been working on, which uses the standard Java connector and allows data to be streamed out of Neo4j into Spark, perform some calculations in Spark, and then directly insert the results back into Neo4j much faster.
InfoQ: Given that Bolt is a binary protocol, how can you use it with web applications?
Lyon: We have two JavaScript drivers; a Node version and a JavaScript/WebSockets version. This allows you to use embedded visualisations within a JavaScript application, and use Bolt over WebSockets to connect to Neo4j to populate visualisations. We are announcing four officially supported drivers: Java, JavaScript, Python and .Net, and other community drivers are available as well. These allow us to integrate with other technologies with those drivers.
InfoQ: How does that compare with the stored procedures?
Lyon: Stored procedures are Java code that’s deployed to the database and is accessible with Cypher. Previously in Neo4j there were managed extensions, but these extended the Neo4j REST API which meant that they couldn’t be mixed with cypher. So now we have a library - APOC - which has over 100 stored procedures for all kinds of things; inspecting your data model, exploring the graph, connecting to other databases. It’s even possible to connect to a JDBC driver to import data and use Cypher to direct how the graph is built.
InfoQ: How are stored procedures installed?
Lyon: There’s a template project which can be used to build a JAR file, which can be deployed to the server. It’s essentially copying a file into the directory. For the 3.0 release uploads of stored procedures require a server restart.
InfoQ: What other performance changes have been made with Neo4j 3.0?
Lyon: Previously the address spacing in the file store had a hard limit, with billions of nodes and relationships. The file store has been rewritten with a new system that lifts the hard limits many orders of magnitude larger. Furthermore the Bolt protocol uses a binary encoding for transmitting data and does not require parsing, so the information is compactly transferred between systems and is faster to access. Finally there has also been improvements to the Cypher query planner, using a cost based planner for both reads and writes. Previously there was a read planner but there wasn’t a cost-based one or for writes. The cost based planner uses a mechanism to estimate how many database hits will be involved with the query; so not just the structure of the query but the actual volume of data in the database. This allows the database to avoid doing an index scan over a million records if there’s a more performant way of processing a smaller set of data to modify how the data is read.
InfoQ: What about importing data into the database?
Lyon: There’s a Neo4j import tool, which writes directly into the filestore without the transactions that has been present for some time. There’s an improved mechanism for loading millions of documents using LoadCSV which can perform batch transactional updates which allowed us to import 10m Stack Overflow questions into a Neo4j graph in a little over three minutes.
The GraphConnect conference closed with Dr Jim Webber looking forward to the future
Just before the final keynote, a presentation by Mar Cabra of International Consortium of Investigative Journalists (the team behind the Panama papers) confirmed that they would be releasing a subset of data on May 9 at 18:00 UTC, which will be hosted by Neo Technologies and remotely queryable with Cyper.
The technology behind the 2.6TB of data from the leak used a parallelised set of OCR tools (to convert images and PDFs into readable and searchable text) followed by importing into a Neo4j database. The ICIJ had previously used Neo4j for the Swiss Leaks earlier on; but Neo4j themselves weren't aware of the Panama use case until it hit the papers. Mar announced that a subset of the data will be made available for researchers to investigate, and Neo4j will be hosting the graph database on which the data is running for others to use. This will be an extension of the Power Players, an interactive visualisation that shows some of the data in the Neo4j database showing relationships between parties and their holdings. Mar Cabra wrote yesterday:
We had the data in a relational database format in SQL, and thanks to ETL (Extract, Transform, and Load) software Talend, we were able to easily transform the data from SQL to Neo4j (the graph-database format we used). Once the data was transformed, it was just a matter of plugging it into Linkurious, and in a couple of minutes, you have it visualized in a networked way, so anyone can log in from anywhere in the world. That was another reason we really liked Linkurious and Neo4j — they’re very quick when representing graph data, and the visualizations were easy to understand for everybody. The not-very-tech-savvy reporter could expand the docs like magic, and more technically expert reporters and programmers could use the Neo4j query language, Cypher, to do more complex queries, like show me everybody within two degrees of separation of this person, or show me all the connected dots…
Neo4j is available for immediate download and the documentation and tutorials are already available.