Graph databases are NoSQL database systems that use graph data model for storage and processing of data.
Matt Aslett from the 451 group notes that graphs are now emerging from the general NOSQL umbrella as a category in their own right. In the last 12-18 months there has been a shift in the graph space with growth in the category for all things graph.
Data modeling effort when using a Graph database follows a different paradigm than how you usually model the data stored in Relational or other NoSQL databases like Document databases, Key Value data stores, or Column Family databases. Graph data models can be used to create rich and highly connected data to represent the real world use cases and applications.
InfoQ spoke with Jim Webber and Ian Robinson in Neo Technologies team (also co-authors of Graph Databases book) about the data modeling efforts and best practices when using Graph databases for data management and analytics.
InfoQ: What type of data is not suitable for storing in a Relational Database but is a good candidate to store in a Graph Database?
Jim & Ian: That’s pretty straightforward to answer: anything that’s interconnected either immediately, because the coding and schema design is complicated, or eventually, because of the join bomb problem [http://blog.neo4j.org/2013/01/demining-join-bomb-with-graph-queries.html] inherent in any practical application of the relational model.
Relational databases are fine things, even for large data sets, up to the point where you have to join. And in every relational database use case that we’ve seen, there’s always a join — and in extreme cases, when an ORM has written and hidden particularly poor SQL, many indiscriminate joins.
The problem with a join is that you never know what intermediate set will be produced, meaning you never quite know the memory use or latency of a query with a join. Multiplying that out with several joins means you have enormous potential for queries to run slowly while consuming lots of (scarce) resources.
InfoQ: What are the advantages of using a Graph Database over a relational database?
Jim & Ian: As an industry we've become rather creative at forcing all kinds of data into a relational database (and we've become philosophical about the consequences!). Relational databases are truly the golden hammer of computer systems development, to the point where many of us are reluctant to drop RDBMS from our tool chain in favour of something better, simply because of familiarity.
However, there are often two drivers underlying a move from a relational database to Neo4j. The first is the observation that your domain is a connected data structure (e.g. social network, healthcare, rights management, real-time logistics, recommendations...) and then understanding that such domains are easy and pleasant to store and query in Neo4j, but difficult and unpleasant in a relational database. Typically these cases are driven by a technologists who understand — at least to some degree — that they are dealing with a graph problem and are prepared to use Neo4j to solve that graph problem elegantly and quickly — they don't want to be stuck in a miasma of sparse tables and the über-join table.
The second driver is performance. Join pain in relational databases is debilitating for systems that use them. Perhaps your first join is performant; if you're lucky, maybe even your second too. But as data set sizes grow, confidence in those joins diminishes as query times get longer and longer. Join-intensive models usually come about because they're trying to solve some kind of connection or path problem, but the maths underlying relational databases simply isn't well suited to emulating path operations. Neo4j has no such penalties for path operations: query times scale linearly with the amount of data you choose to explore as part of your query, not with the overall size of the dataset (which can be enormous). So if you have join pain, that's another indicator that a Neo4j graph will be a superior solution to a complex data model in a relational database.
InfoQ: What are the advantages of using a Graph Database over other NoSQL databases?
Jim & Ian: Fowler and Sadalage answer this, we think, very clearly in their book NoSQL Distilled. They point out that of the four types of NoSQL store (graph, key-value, column, and document) three of these, key-value, column and document, are what they term “aggregate stores.” Aggregate stores work well when the pattern of storage and retrieval is symmetric. Store a shopping basket by key, retrieve it by key; store a customer document and retrieve it, and so on.
But when you want to analyse the data across aggregates things get trickier. How do you find the popular products for different customer demographics? How do you do this in real-time, as you’re serving a customer on your system right now?
These activities, though basic in domain terms, are tricky to solve with aggregate stores. As a result, developers using these stores are forced to compute rather than query to get an answer. This is why aggregate stores are so keen on map-reduce style interactions.
Neo4j’s data model is far more expressive than aggregate stores or relational databases. Importantly, the model stresses connectivity as a first class concept. It is connectivity in the graph between customers, products, demographics and trends that yields the answers to these kinds of real-time analytics problems. Neo4j provides answers by traversing (querying) the graph rather than resorting to latent map-reduce computations.
In a graph you bring together arbitrary dimensions (different relationship types) at query time to answer sophisticated questions with ease and excellent performance. In non-native graph databases (which includes the other kinds of NoSQL stores) traversals are faked: they happen at the application level in code you have to write and maintain. They're also over the network and are orders of magnitude slower than the graph-native, highly optimised graph query engine that underpins Neo4j.
InfoQ: Can you discuss the typical data modelling process when using a Graph database?
Jim & Ian: Neo4j uses a graph model called the "labelled property graph." This is a pragmatic model that eschews some of the more esoteric mathematical bits of graph theory in favour of ease-of-understanding and design.
The labelled property graph consists of nodes (which we typically use to represent entities) connected by relationships. Nodes can be tagged with one or more labels to indicate the role each node plays within our dataset. Every relationship has a name and a direction, which together provide sematic context for the nodes connected by the relationship. Both nodes and relationships can contain one or more properties. We typically use node properties to represent the attributes of an entity, relationship properties to specify the weight, strength or quality of that particular relationship.
These graph primitives provide us with a simple, compact, and easy to reason about modelling kit. For example, using Neo4j's Cypher query language, we can easily describe that Alice loves cookies:
(:Person {name: 'Alice'})-[:LOVES]->(:Food {type: 'Cookie'})
This path expression says that there's a node representing a Person named Alice who loves a particular food type, cookie. It's easy to imagine how repeating this many times gives a large and interesting graph of people and their food preferences (or allergies, or dislikes, etc).
Data modelling consists of using the property graph primitives — nodes, relationships, properties and labels — to build an application-specific graph data model that allows us to easily express the questions we want to ask of that application's domain.
When building an application with Neo4j, we typically employ a multi-step process, which starts with a description of the problem we're trying to solve and ends with the queries we want to execute against an application-specific graph data model. This process can be applied in an iterative and incremental manner to build a data model that evolves in step with the iterative and incremental development of the rest of the application.
Step 1. Describe the client or end-user goals that motivate our modelWhat's the problem we're trying to solve? We've found that agile user stories are great for providing concise, natural-language descriptions of the problems our model is intended to address, but pretty much any description of a requirement can serve as the basis of our modelling efforts.
Step 2. Rewrite those goals as questions we would have to ask of our domainAn agile user story describes what we're trying to achieve with our application. By rewriting each application goal in terms of the questions the domain would have to answer to achieve that goal, we take a step towards identifying how we might go about implementing that application. We're still dealing with an informal description of the solution at this stage, but the domain-specific questions we describe now provide rich input for the next step of the process.
Step 3. Identify the entities and the relationships between them that appear in these questionsLanguage itself is a structuring of logical relationships. By attending closely to the language we use to describe our domain and the questions we want to ask of it, we can readily identify a graph structure that represents this logical structuring in terms of nodes, relationships, properties and labels. Common nouns — "person" or "company", for example — tend to refer to groups or classes of things, or the roles that such things perform: these common nouns become candidate label names. Verbs that take an object indicate how things are connected: these then become candidate relationship names. Proper nouns — a person or company's name, for example — tend to refer to an instance of a thing, which we model as a node and its properties.
Step 4. Translate these entities and relationships into Cypher path expressionsIn this step we formalize the description of our candidate nodes, relationships, properties and labels using Cypher path expressions. These path expressions form the basis of our application graph data model in that they describe the paths, and hence the structure, that we would expect to find in a graph dedicated to addressing our application's needs.
Step 5. Express the questions we want to ask of our domain as graph patterns using path expressions similar to the ones we used to model the domain
Having derived the basic graph structure from the questions we would want to ask of our domain, we're now in a position to express the questions themselves as graph queries that target this same structure. At the heart of most Cypher queries is a MATCH clause containing one or more path expressions that describe the kind of graph structure we want either to find or create inside our dataset. If we've been diligent in allowing our natural language descriptions of the domain guide the basic graph structure, we will find that many of the queries we execute against our data will use similar path expressions to the ones we uses to structure the graph. The key point here is that the resulting structure is an expression of the questions we want to ask of the domain: the model is isomorphic to the queries we want to execute against the model.
InfoQ: Where should the modelling happen, in the database or application layer?
Jim & Ian: As will be evident from our description of the modelling process, much of the modelling takes place in the database. The labelled property graph primitives allow us to create extremely expressive, semantically rich graph models, with little or no accidental complexity — there are no foreign keys or join tables, for example, to obscure our modelling intent.
That said, there are several characteristics of the domain that are best implemented in the application. Neo4j doesn't store behaviour. Nor does it have any strict concept of what Domain-Driven Design calls an aggregate — a composite structure bounded by a root entity that controls access to and manages the lifecycle of the whole.
That the graph has no notion of an aggregate boundary (beyond the boundary imposed by a node's record-like structure) is not necessarily a drawback. The graph model, after all, emphasizes interconnectedness. Some of the most valuable insights we can generate from our data require us to take account of connections between things that in any other context would be considered discrete entities. Many predictive analysis and forensic analysis techniques depend on our being able to infer or identity new composites, new boundary-breaking connected structures that don’t appear in our initial conception of the domain. Cypher’s rich path expression syntax, with its support for variable-length paths and optional subgraph structures, effectively allows us to identify and materialize these new composite structures at query time.
InfoQ: What are "Hyperedges" and how should they be modelled?a
Jim & Ian: Hyperedges come from a different graph model known as "hypergraphs." A hyperedge is a special kind of relationship that connects more than two nodes. You can imagine a (somewhat contrived) example where you 'like' something on Facebook and your friend 'likes' that like. While they're beloved of theoreticians and some other graph databases, hyperedges are not a first class citizen in Neo4j. Our experience is that they’re only useful in a relatively small number of use cases. Their narrow utility is offset by their additional complexity and intellectual cost.
You can, of course, model hyperedges in Neo4j just by adding another node (an intermediate node) into the graph. For example we could cast the original like which was (alice)-[:LIKES]-(post) as (alice)-[:CREATED]->(like)-[:FOR]->(post) and now that we have the (like) node it's easy to like it as follows: (bob)-[:LIKES]->(like) giving hyperedge equivalent functionality when you need it, and avoiding those complexities when you don't (which is most of the time).
InfoQ: What are the best practices when modelling the graph data?
Jim Webber & Ian Robinson:
- Derive your relationship names from your use cases
Doing so creates paths in your model that align easily with the patterns you want to find in your data. This ensures that the queries you derive from your use cases only see these paths, thereby eliminating irrelevant parts of the surrounding graph from consideration. As new use cases emerge, you can reuse existing relationships or introduce new ones, as needs dictate.
- Use intermediate nodes to connect multiple dimensions
Intermediate nodes represent domain-meaningful hubs that connect an arbitrary number of entities. A job node, for example, can connect a person, a role and a company to represent a time-bounded instance of employment. As we learn more about the context of that job (the location, for example, where the individual worked), we can enrich our initial structure with additional nodes.
- Connect nodes in linked lists to represent logical or temporal ordering
Using different relationships, we can even interleave linked lists. The episodes of a TV programme, for example, can be connected in one linked list to represent the order in which they were broadcast (using, say, NEXT_BROADCAST relationships), while at the same time being connected in another linked list to represent the order in which they were made (using NEXT_IN_PRODUCTION relationships). Same nodes, same entities, but two different structures.
InfoQ: Can you discuss the design considerations for the graph traversal requirements?
Jim & Ian: Unlike RDBMS, query latency in a graph is proportional to how much of the graph you choose to search. That means as a query designer you should try to minimize the amount of graph you search. You should search just enough to answer your query.
In Neo4j that means adding as many constraints into your query as you can. Constraints will prevent the database from searching paths that you already know will be fruitless for your result. For example, in a large social graph if you know that you’re only interested in other people who are single and share a compatible sexual orientation/hobbies/interests, you should constrain your search accordingly. That way you’ll avoid visiting parts of that social network with incompatible sexual orientation/boring hobbies/bizarre interests, and you’ll get an answer much more quickly.
In Cypher, you can express a poor match match very loosely, like this:
(me)-[*]->(other)
This query matches all relationship types, at any depth (that’s the ‘*’), to any kind of thing. As a result, it would likely visit the whole graph many times over, thereby impacting performance. In contrast, because we know some domain invariants, we can instead cast this query as:
(me)-[:FRIEND]->()-[:FRIEND]->(other)
Here we’ve constrained both the relationship type and the depth to find only friends-of-friends (because dating friends is yucky). This is better, but we can go even further, adding additional constraints such as gender, sexual preference, and interests:
(me)-[:FRIEND]->()-[:FRIEND]->
(other:Heterosexual:Female)-[:LIKES]->
(:TV_SHOW {title: 'Doctor Who'}).Now we’ll only match against heterosexual females (the two colon-separated labels on the other node) who also like the TV show entitled Doctor Who. Neo4j can now aggressively prune any parts of the graph that don’t match, significantly reducing the amount of work that needs to be done, and thereby keeping latency very low (typically small milliseconds even for large data sets).
InfoQ: Are there any anti-patterns when working with graph data?
Jim & Ian: There certainly are anti-patterns that can catch out the unwary, or people who are new to graph modelling. For example, in the O’Reilly book “Graph Databases” we discuss a case of email forensics (in Chapter 3). In that example, we’re looking for potentially illegal activities by individuals in an organization swapping information for purposes like insider trading (think: Enron).
When we design graph models, we sanity check the graph by reading it. For example, if we had the structure (Alice)-[:EMAILED]->(Bob) then we might think we’ve built a sound model since it reads well left-to-right (Alice emailed Bob) and still makes sense the other way round (Bob was emailed by Alice).
Initially we went with this model, but it soon became apparent that it was lossy. When it came to query an actual email that could violate communications policy, we found the email didn’t actually exist – quite a problem! Furthermore, where we expected to see several emails confirming the corrupt activity, all we saw was that Alice and Bob had emailed each other several times. Because of our imprecise use of English we’d accidentally encoded a core domain entity — the email itself — into a relationship when it should have been a node.
Fortunately once we’d understood that we’d created a lossy model, it was straightforward to correct it using an intermediate node: (Alice)-[:SENT]->(email)-[:TO]->(Bob). Now we have that intermediate node representing the email, we know who sent it (Alice) and who it was addressed to (Bob). It’s also easy to extend the model so that we can capture who was CC’d or BCC’d like so: (Charlie)<-[: CC]-(email)-[:BCC]->(Daisy). From here it’s easy to see how we’d construct a large-scale graph of all email exchanged and map out patterns that would catch anyone violating the rules, but we’d have missed them if we hadn’t thought carefully about nodes and relationships.
If there was just one piece of advice for people coming fresh to graph modelling, it would be: Don’t (accidentally) encode entities as relationships.
InfoQ: Can you talk about any gotchas or limitations of graph databases?
Jim & Ian: The biggest gotcha we both found when moving from RDBMS to graphs some years back (when we were users of early Neo4j versions, way before we were developers on the product) was the years of entrenched learning of relational modelling. We found it difficult to leave behind the design process of creating a normalized model and then denormalizing it for performance — we didn’t feel like we’d worked hard enough with the graph to be successful.
It was agonizing in a way. You’d finish up some piece of work quickly and the model would be performant and high fidelity, then you’d spend ages worrying that you’d missed something.
Fortunately today there are materials to help get people off the ground far more quickly. There are good books, blog posts, Google groups and a healthy meetup community all focussed around graphs and Neo4j. That sense of not working hard enough is quickly dispelled once you’ve gotten involved with graphs and you get on the real work of modelling and querying your domain — the data model gets out of the way, which is as it should be.
InfoQ: What is the current status of standards in graph data management space in the areas of data queries, traversal, analytics etc?
Jim & Ian: As with most NOSQL stores, it’s a little early to think about standards (though the folks at Oracle would seemingly disagree). At last year’s NOSQL Now! conference the general opinion was that operational interfaces should be standardized so that users can plug their databases into a standard monitoring and alerting infrastructure. Indeed that’s now happening.
In terms of standards specific to the graph space, it’s still premature. For example, Neo4j recently introduced labels for nodes, something that didn’t exist in the previous ten-year history of the database. Now that graph databases are taking off, this kind of rapid innovation in the model and in the implementation of graph databases means it’s too early to try to standardize: our industry is still too immature and the lack of standards is definitely not inhibiting take-up of graph databases which are now the fastest growing segment of the whole data market (RDBMS and NOSQL).
In the meantime, since Neo4j is in excess of 90% of all the graph database market, it’s acts as a de facto standard itself with a plethora of third-party connectors and frameworks to plug it into your application or monitoring stack.
InfoQ: What is the future road map of graph databases in general and Neo4j in particular?
Jim & Ian: Graphs are a very general-purpose data model. Much as we have seen RDBMS become a technology that has been widely deployed in many domains in the past, we expect graph databases to be even more widely deployed across many domains in the future. That is, we expect graph databases to be the default choice and the first model than you think of when you hear the word “database.”
To do that, there are a few things that we need to solve. Firstly, graph databases need to be even easier to use — the out of box experience has to be painless or even pleasurable. Although the model is always going to take a little more learning than a document store (it is, after all, a richer model) there are certain mechanical things we can do just to make that learning curve easier.
You’ve seen this with the recent release of Neo4j 2.0 where we introduced labels to the graph model, provided declarative indexing based on those labels, and built a fabulous new UI with excellent visualization and a new version of Cypher with a productive REPL. But our efforts don’t end there: soon we’ll be releasing a point version of Neo4j that takes all the pain out of bulk imports to the database (something all users do at least once) and then we’ll keep refining our ease-of-use.
The other research and development thread that we’re following is performance and scale. For example, we’ve got some great ideas to vertically scale ACID transactional writes in Neo4j by orders of magnitude using a write window to batch IO. From the user’s perspective Neo4j remains ACID compliant, but under the covers we amortize the cost of IO across many writes. This is only possible in native graph databases like Neo4j because we own the code all the way down to the disk and can optimize the whole stack accordingly. Non-native graph databases simply don’t have this option.
For performance in the large, we’re excited by the work by Bailis et al on Highly-Available Transactions, which provide non-blocking transactional agreement across a cluster, and by RAMP transactions which maintain ACID constraints in a database cluster while allowing non-contending transactions to execute in parallel. These kind of ideas for the mechanical bedrock for our work on highly distributed graph databases, on which you’ll hear more from us in the coming months.
About the Interviewees
Jim Webber is Chief Scientist at Neo Technology, a distributed systems specialist working on very large-scale graph data technology.
Ian Robinson is an engineer at Neo Technology, has also served as a Neo's Director of Customer Success, working with customers to design and developer mission-critical graph database solutions.