BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Interviews Ian Robinson on Neo4j's History, Data Structure and Use Cases

Ian Robinson on Neo4j's History, Data Structure and Use Cases

Bookmarks
   

1. My name is Charles Humble and I am here at QCon New York 2014 with Ian Robinson. Ian, can you introduce yourself to the InfoQ community?

Hello, I am Ian Robinson, I am engineer at Neo Technology, I am based in London and I work on the Neo4j graph database.

   

2. What is a graph database?

A graph database? It’s a database, it’s an OLTP system, the kind of database that you could insert between a web request and a web response, but it allows you to model, store and query your data as a graph or as a network structure. So instead of modeling your data as a tabular structure and querying it with set-based operations, as you do in a relational database, with a graph database you are modeling your data as a network structure, and storing it as a network structure and then querying it as a network structure. So that structure comprises nodes and relationships and properties.

   

3. I guess social networks have done a lot to popularise the idea of graphs; I think everyone’s heard of the Facebook social graph for instance. So what is it about social data that lends itself naturally to being treated as a graph?

Social data or social behaviors are intrinsically based upon relationships. People behave within groups and they have friends and there are friends of friends, and so on, and there is an enormous body of research that demonstrates that there are emergent behaviors within these group structures, or these network structures, and that there are properties that we can identify and effectively predict certain behaviors or the transmission of certain behaviors within those social groups. But those groups are always structured as a network, or as a graph-like structure, and therefore I think ... There have been many, many years of analysis about these kinds of things, but it’s really within the last five years or so where it’s become prominent in the public’s mind because of things like Facebook, which have really given us, everybody, the ability to make concrete all of those connections and to record them and take advantage of them over the internet.

   

4. So presumably, you can exploit that for things like recommendation engines and that sort of thing?

Yes, for recommendations engines, both in terms of being able to recommend other people within that network who you might want to connect with - there are emergent properties within that network that say if we have friends in common, we’re more likely to get to know one another than if we are completely disconnected within a network, and we can use those kinds of properties within a network to suggest or recommend people to connect with. And similarly, if you start introducing other elements into the network, things that people buy or actions that they perform or sports and hobbies and so on, again we can begin to identify or predict likely paths that will be pursued in the future and use that to effectively create recommendations.

   

5. Tell me a bit about the history of Neo4j. When did you start development on it?

As a product, it’s more than ten years old; I guess it’s about 12 years old now. It started life as a database backing a content management system for the Swedish military, I believe, and it’s still being used there in that capacity. But several years after it was broken out as a separate product and then open-sourced, so it’s been an open source graph database for nearly ten years now; lots of contributions from the community. But then about five/six years ago Neo Technology was founded to sponsor that development and Neo employ a lot of the developers who work on the kernel and we still accept a lot of contributions from the community, but Neo sponsors the development of the kernel and ensures that the integrity and the quality of the product is at it should be moving forward.

Charles: And it’s JVM based, I understand.

Yes, it’s JVM based and hence the 4j, a wonderful piece of foresight in naming the product all those years ago. But, yes, it’s 4j, it’s JVM based, it’s written almost entirely in Java, although some of the query language or the parsing for the query language is written in Scala.

Charles: So, tell me a bit about the data structure. It’s some kind of property graph I guess.

Yes. I think it’s probably fair to say that Neo invented the term property graph and our original data model was those three fundamental elements: nodes, relationships, and properties. We use nodes to represent the things in your domain that you are interested in, things with identity, the entities in your domain; and then you can attach properties to those nodes, simple key value pairs to represent the attributes of those entities. So at that point you have a very simple record-like structure. But then you can begin to introduce even more structure by connecting those nodes with relationships; so the relationships are first class elements in the data model. Every relationship has a name and a direction, so that lends semantic clarity. We are kind of baking in the semantics of the ways in which things are connected at the level of data model. And then we can also attach properties to these relationships: so we can qualify the relationships with their strength or their weight or their quality. That’s really useful when you’ve got a network and you want to be able to calculate the shortest route between two parts of that network: you can take advantage of the properties on the relationships in order to calculate the shortest path.

So they’re the three fundamental parts of the model that we’ve had for years: nodes, relationships and properties. And then more recently we revised the data model, at the end of last year, introducing a fourth primitive which we call labels. Labels you can attach to nodes; you can ef-fectively label or tag nodes; and we use those labels to help you identify the role that a node plays within your dataset: this node represents a person or this node represents a product. So that gives us a very simple grouping semantic where you can say to the database, “Find me all of the nodes that are labeled product”, for example. But we also then use the labels to associate certain behaviors with those nodes, things such as indexing and constraints. So we can associate indexes and constraints with labels and we can say to the database, “For every node that’s labeled ‘book’, I want you to index it based on its ISBN property” or “Every node that’s labeled ‘book’, I want you to ensure that its ISBN property is unique within the context of this dataset”. So the label is a very, very simple way of tagging nodes, of helping express the role that they play in the dataset, but they also give us those additional capabilities of indexing and optional constraints that we can layer onto the structure.

   

6. What kind of constraints can you do with that?

Today we have two kind of behaviors: indexing and unique constraints; this is what we released at the end of last year; but we hang them all off labels. But effectively the labels in the future will be adding additional constraints. So, I think you can expect within the next year or so to see additional constraints, such as every node that is labeled ‘person’ you can expect to see a first name, and a last name property; or every node that is labeled ‘author’ can only be connected by way of outgoing rote relationships to nodes labeled ‘book’; so all different kinds of constraints. We don’t have those today, they are things that you have to build in the application, but they are some of the kind of constraints that we have in mind for the future.

   

7. Neo4j is quite unusual in NoSQL in that it supports ACID transactions. Is it the data structure that informs that design choice?

Yes. Well, the data structure plus the fact that we know that enterprises still care very much about their data and care about the kind of transactional characteristics when they are dealing with their data. Traditionally Neo4j has been seen as part of that larger family of NoSQL databases, but whereas other NoSQL databases can surrender certain ACID properties in order to gain some other benefits: benefits around horizontal scalability or high availability; and they can do that very often because they are dealing with what Martin Fowler calls middle aggregates, simple key-value pairs, or discrete documents, and so on.

We are slightly different. When you are working with a graph you are working with an intrinsically connected structure and very often what you want to do is to create a sub-graph in its entirety: lots of nodes, lots of relationships that connect them, and all the properties attached to them. And when you persist that, you want to be confident that entire structure, not just the discrete elements, but the entire structure has been made durable on disk. Hence we have transactions; every chain to the database has to take place in the context of a transaction. And that way we can guarantee that when the control returns to the client, that entire sub-graph has been made durable on disk. And the advantage there is we slot very easily into enterprises that require of us all of those guarantees around the quality of their data and the durability of their data.

Charles: Presumably there are trade-offs though for having ACID compliance.

Yes. Again, one of the benefits of relaxing some of those ACID properties and dealing with simple, aggregate oriented data - as you get in a key-value store or a column-orientated database - it makes it very, very easy to distribute that data and provide for very, very large horizontal scalability. With Neo4j today, it’s not so easy to distribute the graph data. We cluster, so Neo4j is a clustered database product, but every instance in that cluster has a full copy of the dataset. So, actually distributing the data across machine boundaries and querying across machine boundaries is a really, really tricky problem to solve in the general case. We know for sure that by far the best performance when you are creating and querying a graph structure is going to be achieved when you can work within the context of a single machine boundary; you know, within the context of a single machine. The ACID requirements, and the fact that we are dealing with an intrinsically connected structure, means it makes it more difficult for us to easily distribute that data across different machines. At the same time, we get the benefits of working locally within a single instance when we are satisfying a request and that can give us orders of magnitude improved performance over the equivalent kind of query in a relational database.

Charles: And the data replication itself is master-slave, I think.

Yes. So, we have a cluster of machines and one of those machines at any point in time will be acting as the master and coordinating all of the rights to the system, and then the other ma-chines will be serving as slaves and polling at frequent intervals, and pulling across recent tran-sactions and applying them locally. This means that if you write directly to the master, by the time the control returns to the client, that’s immediately consistent with regard to that particular interaction; but the overall system is eventually consistent, in the order of milliseconds. So you can configure that with a polling interval, but there is that window of eventual consistency where the slaves are pulling across the recent transactions and catching up.

   

8. Do you have any upper limit on the number of slaves that you can have?

I think we are probably talking here in terms of clusters in single digit, perhaps just touching the double digits. I’ve spun up clusters of 30-40 instances, but I only know of customers who have instances of 11-12 or so. But we are talking those kind of numbers.

   

9. We talked earlier on in the conversation about Facebook as an example of a use case. I’d like to look at some other examples. So, route finding and logistics I guess is one for you, is it?

Route finding goes back to the roots of graph theory itself: discovering routes and shortest routes through the system; so being able to calculate routes on a transport network, for example, this is one of the ways that graph databases and Neo4j is being employed today. In fact, the largest parcel network in Europe is using Neo4j today to route all of the parcels through that network; there are 2,000, at peak time 3,000, parcels per second entering the network, at post offices and other delivery centers. And at that point in time they have a window of up to a second to calculate the most effective or efficient route through the parcel network to the ultimate destination; because the parcel is actually travelling down a chute and at the bottom of the chute it’s going to get switched one way or the other depending on where we think it should be headed.

So, we’ve got that little window of time in which to calculate the route through the entire parcel network; and you are talking about millions of locations all connected by way of different delivery routes. And then the problem is made even more difficult because those delivery routes are time-sensitive. You’ve got more trucks on the road at Christmas time, when there are more parcels in the network, than you do during the summer, so when you are calculating the route you have to be aware of which routes are available today and in the next few days looking forwards. But as I say, Neo4j there is being used to effectively calculate the routes for 2,000 - 3,000 parcels per second and it has to do that within milliseconds; before that parcel has reached the bottom of the chute.

   

10. And do you have customers using Neo4j for network impact analysis?

Yes, several different telcos who are employing it in one form or another for doing network impact analysis. The nice thing here is in the past a lot of those telcos would know about the different parts of their network through having several different inventory systems; you’ve actually got a very heterogeneous set of elements in that network, from the customers themselves, through the applications and the services, the machines, the virtual machines and the data centers, and then all those lower level network elements: the fibers and the switches and the routers and so on. These companies now are using Neo4j to model the entirety of that network in the database and having done that they can then use it for impact analysis. And you can do two kinds of impact analysis: you can do top-down - given this important customer, which parts of the network do they actually depend upon, which applications and services and machines and data centers and so on; do we have redundancy throughout the network on behalf of these people and if we haven’t, how can we introduce it? That’s the kind of top-down analysis. The bottom-up analysis is, given this particular network element, this switch or this router, who is it going to impact if it fails or if we have to replace or repair it. So companies are using it to plan a lot of their repair programs so that they don’t knock out a path through the network that would impact their customers. And having done all of that, having modeled that in Neo4j, it’s helping reduce problem resolution time. In the past they would have been hours or days just to query all of those inventory systems, to identify the likely cause of the problem. Now you can identify the routes in seconds or milliseconds and that naturally improves the response times.

Another pattern I am seeing with some of those customers: they’ll have the model of the network which they will refresh at least daily, probably more often; but then they may have separate systems, complex event processes perhaps, dealing with the stream of very low level network events - hundreds of thousands of events per second coming off the network - and if those complex event processes identify a significant business event, having looked at that window of low level events will suddenly discover it looks like there is an outage in this particular region, they can then interrogate the graph to discover the likely impact and to anticipate what kind of mitigation they need to take.

   

11. Do you have financial services companies maybe doing fraud detection work with Neo, as well?

Yes. In fact, I think we’ve written recently something on the Neo4j blog about using Neo for fraud detection. One of the use cases there is something called first party fraud, where you have a group of people who work together to defraud financial institutions. You’ve got a fraud ring, a number of people constitute a fraud ring, and the pool of identity information, things like social security numbers, addresses, telephone numbers and so on. And those people will open multiple lines of credit, multiple accounts, credit cards and so on, picking different bits of identity in-formation from that pool in order to support their application for that credit card or whatever. No two applications need share exactly the same set of identity information, but there is a lot of overlap: you’ve got a large number of people using the pool of identity information to open mul-tiple lines of credit. And then for some period of time they will probably act like good citizens: they’ll pay their bills, they’ll pay off the credit card and so on. But at some point they’ll begin to extend those lines of credit, max out their credit cards, and then all of a sudden the entire ring will just close out, disappear off the radar and take all that money with them.

So, the graph there is being used to connect all those different things: the people, the accounts and the little bits of identity information, so that you can begin to identify the likelihood of dealing with a fraud ring. And you can apply this stuff at the point in time when someone is applying for a new account; you can take their current details, the identity details that they are submitting, see whether there are existing pieces in the graph, connect them to those existing pieces, introduce the new bits of identity where they are talking about something that is a property move, and so on. And then you can look at the graph that surrounds these bits of identity information and see how broad the span is. So, you can imagine if you‘ve got a family living at a particular address, a couple of the people there may have credit cards, bank accounts and so on; so you’d expect to see a small amount of shared information between several different accounts and several different people, but if the span is a lot broader, if you’ve got tens of people sharing that identity information, then the risk is that you are dealing with a fraud ring. So, the graph is being used there to help identify exposure and risk, and again, to help qualify applications at the point in time when someone is submitting an application online.

Charles: That’s very interesting. Ian, thank you very much.

Aug 25, 2014

BT