1. Michael, could you please tell us a little bit about yourself and your role at Neo Technology?
Yes. I’m Michael Hunger, I’m with Neo Technology and I’m working mainly in the Spring Neo4j project as project lead, and in the Neo4j community team but I am also engaged in lots of other areas at Neo Technology or kind of connecting the different parts of the company, which is great.
2. How have data access technologies in the Java space evolved since the days of JDBC?
That’s actually an interesting question. So, as of a few years ago, not so much has happened, we have seen object-relational mappers coming on, like Hibernate and then later JPA, but except for that there has been not so much. We have also seen the emergence of some SQL DSLs and libraries like MyBatis. That has changed a bit in the last few years with all the new NoSQL databases coming up, so the NoSQL database vendors provided their own Java drivers for their different database servers, and they’ve been of different qualities and different APIs, so the usage patterns of those drivers were quite different one from the other. So that’s kind of what happened since JDBC.
3. For Java application developers, why is now the right time for NoSQL technologies?
There have been a number of challenges that have appeared since the 2000’s. So, first of all the big data, data volume growth is one of the important factors and another important factor is also that data is getting more and more semi -structured, so we can’t rely that much on table-based, rigidly structured information.
Another really important factor is the “connectideness” of the data so you can see more and more richer data models appearing and companies trying to connect data from different sources and from different domains into one bigger data model. Another aspect is that the architecture of Java enterprise applications has changed quite a lot from monolithic, simple applications - just dealing with one database - to more and more loosely coupled services which have their own data storage and interactive API of some kind.
So all these trends led to needs that were not addressed by relational databases and so the big companies came up with their own solutions, like Amazon with DynamoDB, Google with Bigtable, Facebook with Cassandra. And out of these solutions from the big companies - some of which were even open sourced – emerged a large number of alternative data storage solutions. Now you have a wide range of possibilities to choose from and can determine which solution is the right one based on the shape of your data, the size of your data and also based on your use cases.
4. What is the Spring Data project and how did it come about?
Yes, the Spring Data project, that’s what I am working on and it came about by our CEO, Emil Eifrem sitting together with Rod Johnson and starting on the first version of Spring Data Neo4j. The aim of this project was to create a convenient data access that developers have used with the Spring framework but also to add in NoSQL solutions. So Spring Data Neo4j was actually the first Spring Data project, the incubator, so to speak. And since then the Spring Data project has evolved quite a bit to cover not only NoSQL databases like MongoDB and Neo4j but also data processing solutions like Hadoop, relational databases, etc. So there’s interesting support for JPA in Spring Data and also some additions for JDBC which make looking for collections of databases much easier.
The Spring Data project is actually the way the Spring framework evolved to address these requirements and actually this Spring Data project is right now a strategic project within VMware and SpringSource and quite a lot of infrastructure was added there to support, for instance, object mapping tools to relational databases and also an interesting new repository approach that allows you to create these typical DAO objects that you use in the application with almost no code, just by writing some interfaces. The main effort in the Spring Data project has been really in providing the infrastructure and implementing it for different data and NoSQL data solutions.
There are different approaches in this area. On the one hand the ‘aggregate-oriented’ databases like document databases and query stores have addressed this by actually mapping aggregates from the domain-driven design approach to a composite data structure, so it’s very easy to take one of the domain-driven design aggregates and map a document in a document database.
On the other hand, graph databases have a very natural way of storing data by focusing on elements and their relationships in data storage which can be mapped one-to-one to objects and their references in the object model, so there is almost no impedance mismatch of storing an object model in a graph database. The graph database has even more features than the object model by being schema free and being a traversal in all direction,s not just in the outgoing directions like in the object model. So there’s been quite a lot of development especially in the Spring Data project, which provides a really interesting and useful way of storing annotated objects into non-relational databases.
The primary difference is the number of relationships that you have in your object model. In a relational database you try to minimize the number of foreign key relationships because each foreign key relationship means JOIN and then retrieve to data, so I try to reduce the number of JOINs, because a high number of JOINs puts a high load on the relational database.
The graph database retrieves relationships which have actually been pre-materialized at insertion time, whichh is a constant time operation so it’s really fast and actually very cheap. Therefore, in a graph database we tend to normalize much more than in a relational database. For instance putting attributes of an object like addresses, hair color, or gender into separate entities which then can be used to do really interesting recommendation algorithms or color featuring or to retrieve really interesting answers out of your data model.
8. What is Cypher and how is it different from SQL?
Cypher is a query language that Neo Technology developed to address the needs of querying graph databases, with a non-programmatic API, but with a perfectly declarative querying language. The main difference with SQL is that Cypher has the ability to express paths, impedance, and graphs in a very concise way. We do it by using SQR, so we use kind of SQR arrows and circles and parenthesis and so on, elements to describe impedance.
Another important aspect of Cypher is that it allows you to change queries. So there’s a statement that allows you to change queries much like Unix pipe operator which allows you for instance to express a HAVING clause without having a special clause for that just by chaining two phrases and Cypher also excels at lots of collection functions that are integrated in the query language itself. And of course SQR query language being executed on a graph database like Neo4j is a very fast way to retrieve really deep connected structures out of the database without putting too much stress on CPU memory like in a relational database.
So a typical use case is that I want to find arbitrary lenght patterns in a graph, so for instance you have me are connected in a social graph but I don’t know how many steps we have to go to find the connection. So in Cypher I can say “Please start at the “Nitin” node and then at the “Michael” node and then try to find the path between the two, I don’t know how many steps there are between the two”, and then Cypher will go ahead and rewrite the query to match the Neo4j APIs and execute it on the graph database which would be very fast because it would be just a few direct relationship traversals.
In SQL you have to have quite a large number of JOINS in your query because you don’t know how far out you can go, you either have to have n outer JOINS, which might be all, null or not, or you have to have a cascade of 1 2 3 4 5 6 7 JOINs, each to achieve the same information. So the requirements of a relational database are much higher in this case.
10. Have you conducted any performance benchmarks comparing Cypher and SQL ?
The performance comparison we’ve done is with the Java API not Cypher because as of Neo4j 1.8 we’ve been mostly focused on Cypher language, on expressiveness and the ease of use and ease of readability of Cypher and starting with Neo4j 1.9 we are now optimizing service performance to be as close to the core Java API as possible, so that’s why these performance comparisons happened with the Java core API and Neo4j.
So in Spring Data Neo4j 2.1, which was released in SpringOne 2012, the most important features there were support for unique entities which allow you to use primary keys from another database or business primary keys to create other unique entities in the graph. We also added quite a lot support for the handling of relationships so you can have much more fine grained control now on how relationships are retrieved and of course we added to the latest version, updated to the latest version of Neo4j. This release was actually a minor release - just a dot release - the next version will be a much bigger change in Spring Data Neo4j.
The Spring Data book by O’Reilly was written by all the Spring Data project authors and it offers a really deep insight into the Spring Data project and a comprehensive intro to how to get started with Spring Data in general.
13. How can people can involved in Spring Data and Neo4j project?
There are quite a number of ways to do that. So, first of Spring Data Neo4j projects are open source projects and the source code is on GitHub. The comprehensive documentation on Spring Data Neo4j is online with lots of example projects, but it was also published as an InfoQ minibook that you can download for free at InfoQ or purchase online.
Nitin: Michael, on behalf of the InfoQ community, thank you for your time today.
Thank you, it was a pleasure.