BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Full Stack Web Development Using Neo4j

Full Stack Web Development Using Neo4j

When building a full stack web application there are a lot of choices for the database which you will put on the bottom of the stack. As the source of truth you want to use a database which is dependable certainly, but which also allows you to model your data well. In this article, I’ll discuss why Neo4j is a good choice as your web application stack’s foundation if your data model contains lot of connected data and relationships.

What is Neo4j?

Figure 1. Neo4j Web Console

Neo4j is a Graph database which means, simply, that rather than data being stored in tables or collections it is stored as nodes and relationships between nodes. In Neo4j both nodes and relationships can contain properties with values. In addition:

  • Nodes can have zero or more labels (like Author or Book)
  • Relationships have exactly one type (like WROTE or FRIEND_OF)
  • Relationships are always directed from one node to another (but can be queried regardless of direction)

Why Neo4j?

To start thinking about choosing a database for a web application we should consider what it is that we want. Top criteria include:

  • Is it easy to use?
  • Will it let you easily respond to changes in requirements?
  • Is it capable of high performance queries?
  • Does it allow for easy data modelling?
  • Is it transactional?
  • Does in scale?
  • Is it fun (sadly an often overlooked quality in a database)?

In this respect Neo4j fits the bill nicely. Neo4j...

  • Has its own easy-to-learn query language (called Cypher)
  • Is schema-less, which allows it to be whatever you want it to be
  • Can perform queries on highly related data (graph data) much faster than traditional databases
  • Has an entity and relationship structure which naturally fits human intuition
  • Supports ACID-compliant transactions
  • Has a High Availability mode for query throughput scaling, backups, data locality, and redundancy.
  • Has a visual query console which is hard to get tired of

When to not use Neo4j?

While Neo4j, as a Graph NoSQL database, has a lot to offer, no solution can be perfect. Some use cases where Neo4j isn’t as good of a fit:

  • Recording large amounts of event-based data (such as log entries or sensor data)
  • Large scale distributed data processing like with Hadoop
  • Binary data storage
  • Structured data that’s a good candidate to be stored in a relational database

In the example above you can see a graph of Authors, Cities, Books, and Categories as well as the relationships that tie them together. If you wanted to use Cypher to show that result in the Neo4j web console you could execute the following:

MATCH
   (city:City)<-[:LIVES_IN]-(:Author)-[:WROTE]->
   (book:Book)-[:HAS_CATEGORY]->(category:Category)
 WHERE city.name = “Chicago”
 RETURN *

Note the ASCII-art syntax showing nodes surrounded by parenthesis and an arrow representing the relationship pointing from one node to the other. This is Cypher’s way of allowing you to match a given subgraph pattern.

Of course Neo4j isn’t just about showing pretty graphs. If you wanted to count the categories of books by the location (city) of the author, you can use the same MATCH pattern and just return a different a set of columns, like so:

 MATCH
   (city:City)<-[:LIVES_IN]-(:Author)-[:WROTE]->
   (book:Book)-[:HAS_CATEGORY]->(category:Category)
 RETURN city.name, category.name, COUNT(book)

That would return the following:

city.name

category.name

COUNT(category)

Chicago

Fantasy

1

Chicago

Non-Fiction

2

While Neo4j can handle "big data" it isn't Hadoop, HBase or Cassandra and you won't typically be crunching massive (petabyte) analytics directly in your Neo4j database. But when you are interested in serving up information about an entity and its data neighborhood (like you would when generating a web-page or an API result) it is a great choice. From simple CRUD access to a complicated, deeply nested view of a resource.

Which stack should you use with Neo4j?

All major programming languages have support for Neo4j via the HTTP API, either via a basic HTTP library or via a number of native libraries which offer higher level abstractions. Also, since Neo4j is written in Java, all languages which have a JVM interface can take advantage of the high-performance APIs in Neo4j.

Neo4j also has its own “stack” to allow to you choose different access methods ranging from easy access to raw performance. It offers:

  • A HTTP API for making Cypher queries and retrieving results in JSON
  • An "unmanaged extension" facility in which you can write your own endpoints for your Neo4j database
  • A Java API for specifying traversals of nodes and relationships at a higher level
  • A low level batch-loading API for massive initial data ingestion
  • A core Java API for direct access to nodes and relationships for maximum performance

An application example

Recently I had the opportunity to take on a project to expand a Neo4j-based application. The application (which you can see at graphgist.neo4j.com) is a portal for GraphGists. A GraphGist is an interactively rendered (in your browser) document based on a simple text file (AsciiDoctor) which contains prose and images describing the data-model, setup and use-case queries that are executed and visualized live. It's much like an iPython notebook or an interactive whitepaper. GraphGists also allow readers to write their own queries to explore the dataset from the browser.

Neo Technology, the creators of Neo4j, wanted to provide a showcase for GraphGists created by the community-at-large. Neo4j was used as the back-end, of course, but for the rest of the stack I used:

All of the code is open sourced and can be viewed on GitHub.

The GraphGist portal is a simple app conceptually, providing a list of GraphGists and allowing users to view details about each as well as the GraphGist itself. The data domain consists of Gists, Keywords/Domains/Use Cases (as Gist categories) and People (as authors):

Now that you’re familiar with the model, I’d like to give you a quick intro to the Cypher query language before we dig in deeper. For example, if we wanted to return all Gists and their keywords, we could do the following:

MATCH (gist:Gist)-[:HAS_KEYWORD]->(keyword:Keyword)
RETURN gist.title, keyword.name

This would give a table with one row for every Gist and Keyword combination, just like an SQL join. To get a bit deeper, what if we wanted to find all of the Domains that a given Person has written Gists for, we could perform the following query:

MATCH (person:Person)-[:WRITER_OF]->(gist:Gist)-[:HAS_DOMAIN]->(domain:Domain)
WHERE person.name = “John Doe”
RETURN domain.name, COUNT(gist)

This would return another table of results. Each row of the table would have the name of the Domain accompanied by the number of Gists that the Person has written for that Domain. No need for a GROUP BY clause because when we use an aggregate function like COUNT() Neo4j automatically groups by the other columns in the RETURN clause.

Now that you’ve gotten a feel for Cypher, let’s look at a real-world query from our app. When building the portal, it was useful to be able to provide a way to make just one request to the database and retrieve back all of the data that we need in a structure in almost exactly the format which we want it in.

Let’s build the query which is used by the portal’s API (and can be viewed on GitHub). First, we need to match the Gist in question by its title property and match any related Gist nodes:

 // Match Gists based on title
 MATCH (gist:Gist) WHERE gist.title =~ {search_query}
 // Optionally match Gists with the same keyword
 // and pass on these related Gists with the
 // most common keywords first
 OPTIONAL MATCH (gist)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(related_gist)

There are a couple of things to note here. Firstly, the WHERE clause is matching the title using a regular expression (that’s the =~ operator) and a parameter. Parameters are a Neo4j feature which separate the query from the data which the query uses. Using parameters lets Neo4j cache queries and query plans, but it also means that you don’t need to worry about query injection attacks. Secondly, we’re using an OPTIONAL MATCH clause here which simply means that we still want to return our original Gists that we’re matching on, even if there are no related gists.

Now let’s take that part of the query and expand on it by replacing the RETURN clause with a WITH clause:

MATCH (gist:Gist) WHERE gist.title =~ {search_query}
 OPTIONAL MATCH (gist)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(related_gist)
 WITH gist, related_gist, COUNT(DISTINCT keyword.name) AS keyword_count
 ORDER BY keyword_count DESC

 RETURN
   gist,
   COLLECT(DISTINCT {related: { id: related_gist.id, title: related_gist.title, poster_image: related_gist.poster_image, url: related_gist.url }, weight: keyword_count }) AS related

The COLLECT() in the RETURN serves to transform a result with pairs of gist and related_gist nodes into a result where each row has the gist only once along with an array of related_gist nodes. Inside of the COLLECT() we specify only the data from the related gists that we need in order to reduce the size of our response.

Lastly, we’ll take the query so far and use WITH one last time:

 MATCH (gist:Gist) WHERE gist.title =~ {search_query}
 OPTIONAL MATCH (gist)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(related_gist)
 WITH gist, related_gist, COUNT(DISTINCT keyword.name) AS keyword_count
 ORDER BY keyword_count DESC

 WITH
   gist,
   COLLECT(DISTINCT {related: { id: related_gist.id, title: related_gist.title, poster_image: related_gist.poster_image, url: related_gist.url }, weight: keyword_count }) AS related

 // Optionally match domains, use cases, writers, and keywords for each Gist
 OPTIONAL MATCH (gist)-[:HAS_DOMAIN]->(domain:Domain)
 OPTIONAL MATCH (gist)-[:HAS_USECASE]->(usecase:UseCase)
 OPTIONAL MATCH (gist)<-[:WRITER_OF]-(writer:Person)
 OPTIONAL MATCH (gist)-[:HAS_KEYWORD]->(keyword:Keyword)

 // Return one Gist per row with arrays of domains, use cases, writers, and keywords
 RETURN
   gist,
   related,
   COLLECT(DISTINCT domain.name) AS domains,
   COLLECT(DISTINCT usecase.name) AS usecases,
   COLLECT(DISTINCT keyword.name) AS keywords
   COLLECT(DISTINCT writer.name) AS writers,
 ORDER BY gist.title

In this last part we optionally match all associated Domains, Use Cases, Keywords, and Person nodes and collect them together just like we did with related Gists. Rather than having a flat, denormalized result we can now return a list of Gists with arrays of associated “has many” relationships without any duplication. Pretty cool!

In addition, if tables of data are too old school for you, then Cypher can return objects as well:

 RETURN
   {
gist: gist,
   	domains: collect(DISTINCT domain.name) AS domains,
   	usecases: collect(DISTINCT usecase.name) AS usecases,
   	writers: collect(DISTINCT writer.name) AS writers,
   	keywords: collect(DISTINCT keyword.name) AS keywords,
   	related_gists: related
   }
 ORDER BY gist.title

Traditionally in a decently sized web application a number of database calls are needed to populate the HTTP response. Even if you can execute queries in parallel, it is often necessary to get the results of one query before you can make a second to get related data. In SQL you can generate complicated and expensive joins on tables to get results from many tables in one query, but anybody who has done more than a couple of SQL joins in the same query knows how quickly that can get complicated. Not to mention that the database still needs to do table or index scans to get the associated data. In Neo4j, retrieving entities via relationships uses pointers directly to the related nodes so that the server can traverse right to where it needs to go.

That said, there are a couple of downsides to this approach. While it’s possible to retrieve all of the data required in one query, the query is quite long. I haven’t yet found a way to modularize it for reuse. Along the same lines: we might want to use this same endpoint in another place but show some more information about the related Gists. We could modify the query to return that data but then it would be returning unnecessary data for the original use case.

We are fortunate today to have many excellent database choices. While Relational databases are still the best choice for storing structured data, NoSQL databases are a good choice for managing semi-structured, unstructured, and graph data. If you have a data model with lot of connected data and want a database which is intuitive, fun, and fast you should get to know Neo4j.

This article was authored by Brian Underwood with contributions from Michael Hunger.

About the Authors

Brian Underwood is a Software Engineer and lover of all things data.  As a Developer Advocate for Neo4j and co-maintainer of the neo4j ruby gem Brian regularly gives talks and writes on his blog about the power and simplicity of Graph databases.  Brian is also currently traveling the world with his wife and son.  Follow Brian on Twitter or join him on LinkedIn.

 

BT