Introduction
Graph database management systems store data in a network of related entities. This article explains how to manage and query the network to obtain result sets that would be almost impossible to achieve by other means.
Graph Databases
Graph databases are ideal for storing related data. The sort of data that would require lots of JOIN
statements if it was stored in a conventional data table. They are not constrained by a rigid schema and their efficiency depends upon the length of the pathways that are searched, rather than the overall size of the graph. Their structure is very simple, it’s a network of nodes connected to each other by way of a relationship object, as shown in the diagram.

The node with the label Actor
is connected to the node labelled ‘Movie
’ through the relationship ACTED_IN
. The direction of the arrow defines the direction of the relationship, in this case, it is Tom Hanks acted in Cast Away. Nodes can have properties as well as labels. In this example, the Actor node has a single property Name
with the value Tom Hanks
and the Movie node has the property Title
set to the value Cast Away
. Relationships can also have properties but their use should be limited as relationships are used as conduits for node transversal when searching rather than data storage repositories. Searching is carried out by using label-based indexes and following node and relationship pathways. Returning all the movies acted in by Tom Hanks would involve looking in the Actor index for the name Tom Hanks
, in order to locate his node, and then following all the outgoing ACTED_IN
relationships to find all of the Movie nodes. Relationships can be transversed in either direction, so, finding the cast of Cast Away would involve a similar process as before. First, locate the Castaway node and then follow all incoming ACTED_IN
relationships.
Getting Started with Neo4j
The graph database management system illustrated in this article is Neo4j, the community edition can be downloaded here. It’s open source, free and fully ACID compliant. Neo4j employs a REST service interface and provides an admin console hosted in a web browser. You can access the console after starting the service by clicking on the link provided on the popup window. Just run neo4j-community.exe and follow the prompts. The console provides a series of simple steps to build some example graphs. The Movie graph is a good one to start with. When you have built it, you will see a stunning visual representation of the graph displayed in the browser.
Cypher
Cypher is the query language used by Neo4j. It employs a neat way of expressing nodes and their relationship. The relationship shown in the first diagram would be expressed as
(:Actor {name:’Tom Hanks’})-[:ACTED_IN]->(:Movie {title:’Cast Away’})
There are many excellent examples given in the console app, it’s instructive to run them all and observe the results. All the commands are very well explained but I’d like to highlight the use of the MERGE
command as it is very easy to trip up with it. The MERGE
command will create new nodes and relationships if the whole of the given pattern does not match an existing pattern. So, to avoid duplicating existing nodes,
MERGE (:User {name: "Bob"})-[Knows]->(:User {name: "Alice"})
is best written as:
MERGE (bob:User {name: "Bob"})
MERGE (alice:User {name: "Alice"})
MERGE(bob)-[:KNOWS]->(alice)
The parameters 'bob
' and 'alice
' attached to the first two statements uniquely identify their nodes in the last statement.
Database Management
Creating Indexes
Indexes are used to find the starting node for a query. They are Label based.
CREATE INDEX ON :Person(name)
Creates an index based on the Person
label and the name
property. The index is updated automatically and Neo4j is smart enough to know which index to use, so there is no need to specify any particular index when searching. To remove an index, use
DROP INDEX ON :Person(name)
Constraints
Constraints specify that a node with a given label and property should have a unique value on that property.
CREATE CONSTRAINT ON (person:Person) ASSERT person.name IS UNIQUE
This statement will create a new index based on the Person.name
property and enforce the constraint on it. Dropping an index that has a constraint will throw an error. To drop a constrained index, just drop the constraint
DROP CONSTRAINT ON (person:Person) ASSERT person.name IS UNIQUE
Schema
To view the schema, enter the following command:
:SCHEMA
You will get back a list of the graph’s indexes and constraints.
Backing up the Graph
There is no inbuilt backup facility in the community edition. But data can be backed up by stopping the service and using something like Windows Zip or 7-Zip to backup the database folder (default.graphdb) to a compressed folder.
Deleting the Graph
The cleanest way to start with an empty graph is to delete all the contents and sub-directories of the database folder. But you can, if you wish, delete all nodes and relationships using the command:
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE r,n
This does not, however, remove any indexes that you may have created. They need to be removed using the Drop
command.
Building the Graph by Importing Data
One of the best methods to load external data is to use the LOAD CSV
command from the management console. Details of how to use this are here. Creating indexes before loading the data speed things up considerably. A good way to ensure that the data is correctly formatted is to load the data into Excel and save it as a CSV file. Aberrant commas inside addresses are something to look out for.
Some Graph Design Considerations
Anyone can design a graph database, all you need is a white board, a felt-tipped pen and the ability to draw ellipses, arrows and square brackets, but there are a couple of things to bear in mind before starting on the artwork. Relationships are optimised for rapid transversal when searching across them. They are fixed-length objects containing pointers to other nodes and relationships and their index number is multiplied by the object’s length in order to calculate relative positions quickly. They are not optimised for examining their properties, as the properties are stored elsewhere and require a ‘lookup’ to locate them. So, ideally, searches should only examine properties when they have reached their destination. You don’t want to have to stop at every station, get out and read the time-table.
Careful consideration needs to be given to the type of data that’s stored within relationships as that data cannot be linked to other data. Here’s an example from the Movie graph. The graph has the following relationship:
Creating an intermediate node between Person
and Movie
nodes would allow the domain to be expanded more easily, something like:
The number of relationships has increased substantially but that’s ok as only a small part of the graph is being transversed when searching. With this arrangement, we are free to add as many relationships to the character node as needed and we could get the cast of Cloud Atlas with a simple query:
MATCH (:Movie {title:"Cloud Atlas"})<-[:CAST_IN]-(role:Role)<-[:GOT_ROLE]-(person:Person)
RETURN person,role
Querying data stored within relationships can get a bit messy as you need to know, in advance, the Type
of the data. This query, using the Movie example graph, groups people by their role in the cast of Cloud Atlas. It uses the pipe character |
to select alternative matches.
MATCH (movie:Movie {title: "Cloud Atlas"})
OPTIONAL MATCH (person)-[r:ACTED_IN|:DIRECTED|:PRODUCED|:WROTE]->(movie)
RETURN person.name,type(r)
It’s a matter of judgement as to which properties are stored together within nodes. In the extreme case, all nodes could have just one property. The Graphgist Project has an interesting collection of graph designs and models for use in a wide range of domains from bank fraud detection to Scotch Whisky retailing.
Using Neo4jClient
Neo4jClient is an excellent C# .NET client for the Neo4J server, it can be downloaded as the Neo4jClient package on NuGet. The documentation is here. I don’t want to duplicate the examples given, but I’d like to mention a couple of things that I’ve found useful with reference to the Movie graph example. Let’s start by connecting to the database and finding Tom Hanks.
Connecting to the Database Service
var graphClient = new GraphClient(new Uri("http://localhost:7474/db/data"),
"neo4j", "myPassword");
graphClient.Connect();
Simple Searching
var tomHanks =
graphClient.Cypher.Match("(person:Person)")
.Where((Person person) => person.name == "Tom Hanks")
.Return(person => person.As<Person>())
.Results
.Single();
As far as Neo4J is concerned, every entity is a Json object, that is, a collection of Key Value pairs. It’s up to you to tell it how to deserialise the object. In this case, we use person.As<Person>()
to return a Person
entity. If you want to inspect the Key Value pairs, you can use person.As<Dictionary<string, string>>()
. Person
is defined as:
public class Person
{
public int born { get; set; }
public string name { get; set; }
}
The best-match approach is used. So, if there is no match for the key born
but there is one for name
, just the name will be returned and born
will be set to its default value. This leaves you free to expand the Person
class at a later date. Each entity is also given an Id
property, but I would advise against using it as it’s reserved for the server and its value is changed by magic. Neo4j follows the Java convention of using lower case for the first character of a property name but that’s not sacrosanct and you are free to do your own thing.
Adding Indexes and Labels
graphClient.Cypher.Create("INDEX ON :Person(name)").ExecuteWithoutResults();
graphClient.Cypher.Create("INDEX ON :Movie(title)").ExecuteWithoutResults();
graphClient.Cypher
.Match("(person:Person)-[:ACTED_IN]->(m)")
.Set("person :Actor")
.ExecuteWithoutResults();
Updating
This example illustrates how parameters can be passed by using an anonymous class.
var updatedTom = graphClient.Cypher.Match("(person:Person)")
.Where((Person person) => person.name == "Tom Hanks")
.Set("person.born = {year},person.lastName={lastName}")
.WithParams(new { year = 1066, lastName = "Hanks" })
.Return(person => person.As<Person>())
.Result
.Single();
Searching
Graphs are excellent for finding relationships between entities that would be difficult to reveal with a conventional data table structure. Here, we are looking for actors that would be good candidates for working with Tom Hanks and who have not worked with him before. So we are looking for people who have been cast with people that have worked with Tom but they themselves have not appeared with Tom. The Cypher command for this query is given in the console application. Here is its equivalent using Neo4jClient.
string actorsName = "Tom Hanks";
List<string> cocoActors =
graphClient.Cypher
.Match("(tom:Person {name:{nameParam}})-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActor:Person),
(coActor)-[:ACTED_IN]->(m2)<-[:ACTED_IN]-(cocoActor:Person)")
.WithParam("nameParam", actorsName)
.Where("NOT (tom)-[:ACTED_IN]->(m2)")
.ReturnDistinct(cocoActor => cocoActor.As<Person>().name)
.Results.ToList();
Building the Graph
There is a serious omission in the sample Movie
database – Daniel Craig is not in it. So let’s correct that.
var danielCraig = new Person { born = 1968, name = "Daniel Craig" };
var skyfall = new Movie { released =2012 , title = "Skyfall" };
var actedIn = new ActedIn { roles =new List<string> {"James Bond"}};
graphClient.Cypher.Merge("(person:Person:Actor { name: {name}, born:{born} })")
.OnCreate()
.Set("person = {danielCraig}")
.WithParams(new { danielCraig.name, danielCraig.born, danielCraig })
.Merge("(movie:Movie { title: {title}, released:{released} })")
.OnCreate()
.Set("movie = {skyfall}")
.WithParams(new { skyfall.title, skyfall.released, skyfall })
.Merge("(person)-[rs:ACTED_IN ]->(movie)")
.OnCreate()
.Set("rs = {actedIn}")
.WithParam("actedIn", actedIn )
.ExecuteWithoutResults();
Using Transaction Scope
Placing transactions within a transaction scope enables multiple transactions to be committed to the database as if they were a single transaction. Either they all succeed or they all fail so you don’t end up with a half-built graph if something goes wrong. There is a substantial improvement in efficiency over committing transactions individually so it’s well worth doing. Here’s an example that adds multiple labels to the graph.
using (var scope = new TransactionScope())
{
graphClient.Cypher.Match("(person:Person)-[:ACTED_IN]->(m)")
.Set("person :Actor")
.ExecuteWithoutResults();
graphClient.Cypher.Match("(person:Person)-[:DIRECTED]->(m)")
.Set("person :Director")
.ExecuteWithoutResults();
scope.Complete();
}
Asynchronous Transactions
To run async transactions, end the query with ResultsAsync
or ExecuteWithoutResultsAsync()
. You can do something like this:
public async Task<IEnumerable<Crew>> GetCrewOfMovieAsync(string movieTitle)
{
var movieCrew = await this.graphClient.Cypher.Match
("(movie:Movie {title: {titleParam}})")
.OptionalMatch("(person:Person)-[r:DIRECTED|:PRODUCED|:WROTE]->(movie:Movie)")
.WithParam("titleParam", movieTitle)
.Return((person, r) => new Crew
{
Name = person.As<Person>().name,
Role = r.Type()
})
.OrderBy("person.name")
.ResultsAsync;
return movieCrew;
}
It’s as well to keep in mind when running asynchronous transactions that operations on Relationships can write lock the Relationship and both the nodes that are connected to it. So care is needed to avoid the situation where transaction A is waiting for transaction B to unlock and B is waiting for A to unlock and deadlock ensues. Personally, I’ve not found that running queries asynchronously is advantageous. They often take much longer to run than the synchronous versions but this may not be the case with leviathan sized graphs.
Debugging
There was an in-memory graph database utility that was useful for running unit tests with Neo4jClient but Neo4j’s latest upgrade appears to have broken it . An alternative approach is to run integration tests against a test graph. But this is rarely necessary as it is sufficient only to check that Neo4jClient is receiving the expected parameters for any given method as the method itself has already been extensively tested. Here’s a snippet to illustrate the sort of approach I take.
using Microsoft.VisualStudio.TestTools.UnitTesting;
using Neo4jClient;
using Neo4jClient.Cypher;
using Neo4jClient.SchemaManager;
using NSubstitute;
[TestClass]
public class IndexHelperUnitTests
{
#region Constants and Fields
private static IGraphClient graphClientSub;
#endregion
#region Public Methods and Operators
[TestInitialize]
public void TestInitialise()
{
graphClientSub = Substitute.For<IGraphClient>();
var cypher = Substitute.For<ICypherFluentQuery>();
graphClientSub.Cypher.Returns(cypher);
}
[TestMethod]
public void DropIndexCallsCypherDropWithCorrectParams()
{
var indexMetadataFactory = Substitute.For<IIndexMetadataFactory>();
var schemaReader = Substitute.For<ISchemaReader>();
var indexHelper = new IndexHelper(schemaReader, indexMetadataFactory);
string expected = "INDEX ON :Person(name)";
indexHelper.DropIndex(graphClientSub, "Person", "name");
graphClientSub.Cypher.Received().Drop(expected);
}
Profiling Queries
The recommended technique for building queries with Neo4JClient is to test the query as a Cypher command using the console and then convert the command to the equivalent in Neo4jClient. If you preface the cypher statement with the word PROFILE, you will get a breakdown of the number of database hits that have taken place when the statement was executed. This is the key metric and is nearly always higher than expected. I’d take the execution times given with a pinch of salt. Neo4j is smart and goes in for preemptive caching. The following example illustrates this.
Query: PROFILE MATCH (tom:Person {name: "Tom Hanks"}) RETURN tom
Response: COST. 267 total db hits in 679 ms.
Query: CREATE INDEX ON :Person(name)
Response: Added 1 index, statement executed in 159 ms.
Query: PROFILE MATCH (tom:Person {name: "Tom Hanks"}) RETURN tom
Response: COST. 2 total db hits in 166 ms.
Query: PROFILE MATCH (tom:Person {name: "Tom Hanks"}) RETURN tom
Response: COST. 2 total db hits in 14 ms.
Using the Visual Display in the Console Application
The visual display is very useful to confirm that the graph is being built as expected and that queries return the correct data. To display the whole graph, enter the command match (n) return n
You can untangle overlapping relationships by dragging the nodes into less crowded regions of the display and you can display the contents of nodes and relationships by running the cursor over them. It’s a real corker of an application that never fails to blow my socks off.
Demonstration Project
The demonstration project contains a selection of database management examples and queries that, hopefully, are a useful basis for further study. The project builds its own graph so it’s best to start with an empty database directory to avoid trashing valuable data
Conclusion
If you are used to rigid schema structures and type-safe entities, Neo4j’s API will be a bit of an eyebrow raiser for you, but stick with it and you may be able to see how graph databases can be used to your advantage. They help Google and they help Facebook and they may help you.
Acknowledgements
I’m grateful to the authors of Neo4j and Neo4jClient for making such great applications open source. My special thanks goes to Tatham Oddie. The support he gives to users of Neo4JClient is exemplary.