Download examples - 13.2 KB

Introduction

Graph database management systems store data in a network of related entities. This article explains how to manage and query the network to obtain result sets that would be almost impossible to achieve by other means.

Graph Databases

Graph databases are ideal for storing related data. The sort of data that would require lots of JOIN statements if it was stored in a conventional data table. They are not constrained by a rigid schema and their efficiency depends upon the length of the pathways that are searched, rather than the overall size of the graph. Their structure is very simple, it’s a network of nodes connected to each other by way of a relationship object, as shown in the diagram.

The node with the label Actor is connected to the node labelled ‘Movie’ through the relationship ACTED_IN. The direction of the arrow defines the direction of the relationship, in this case, it is Tom Hanks acted in Cast Away. Nodes can have properties as well as labels. In this example, the Actor node has a single property Name with the value Tom Hanks and the Movie node has the property Title set to the value Cast Away. Relationships can also have properties but their use should be limited as relationships are used as conduits for node transversal when searching rather than data storage repositories. Searching is carried out by using label-based indexes and following node and relationship pathways. Returning all the movies acted in by Tom Hanks would involve looking in the Actor index for the name Tom Hanks, in order to locate his node, and then following all the outgoing ACTED_IN relationships to find all of the Movie nodes. Relationships can be transversed in either direction, so, finding the cast of Cast Away would involve a similar process as before. First, locate the Castaway node and then follow all incoming ACTED_IN relationships.

Getting Started with Neo4j

The graph database management system illustrated in this article is Neo4j, the community edition can be downloaded here. It’s open source, free and fully ACID compliant. Neo4j employs a REST service interface and provides an admin console hosted in a web browser. You can access the console after starting the service by clicking on the link provided on the popup window. Just run neo4j-community.exe and follow the prompts. The console provides a series of simple steps to build some example graphs. The Movie graph is a good one to start with. When you have built it, you will see a stunning visual representation of the graph displayed in the browser.

Cypher

Cypher is the query language used by Neo4j. It employs a neat way of expressing nodes and their relationship. The relationship shown in the first diagram would be expressed as
(:Actor {name:’Tom Hanks’})-[:ACTED_IN]->(:Movie {title:’Cast Away’})
There are many excellent examples given in the console app, it’s instructive to run them all and observe the results. All the commands are very well explained but I’d like to highlight the use of the MERGE command as it is very easy to trip up with it. The MERGE command will create new nodes and relationships if the whole of the given pattern does not match an existing pattern. So, to avoid duplicating existing nodes,

MERGE (:User {name: "Bob"})-[Knows]->(:User {name: "Alice"})

is best written as:

MERGE (bob:User {name: "Bob"})
MERGE (alice:User {name: "Alice"})
MERGE(bob)-[:KNOWS]->(alice)

The parameters 'bob' and 'alice' attached to the first two statements uniquely identify their nodes in the last statement.

Database Management

Creating Indexes

Indexes are used to find the starting node for a query. They are Label based.

CREATE INDEX ON :Person(name)

Creates an index based on the Person label and the name property. The index is updated automatically and Neo4j is smart enough to know which index to use, so there is no need to specify any particular index when searching. To remove an index, use

DROP INDEX ON :Person(name)

Constraints

Constraints specify that a node with a given label and property should have a unique value on that property.

CREATE CONSTRAINT ON (person:Person) ASSERT person.name IS UNIQUE

This statement will create a new index based on the Person.name property and enforce the constraint on it. Dropping an index that has a constraint will throw an error. To drop a constrained index, just drop the constraint

DROP  CONSTRAINT ON (person:Person) ASSERT person.name IS UNIQUE

Schema

To view the schema, enter the following command:

:SCHEMA

You will get back a list of the graph’s indexes and constraints.

Backing up the Graph

There is no inbuilt backup facility in the community edition. But data can be backed up by stopping the service and using something like Windows Zip or 7-Zip to backup the database folder (default.graphdb) to a compressed folder.

Deleting the Graph

The cleanest way to start with an empty graph is to delete all the contents and sub-directories of the database folder. But you can, if you wish, delete all nodes and relationships using the command:

MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE r,n

This does not, however, remove any indexes that you may have created. They need to be removed using the Drop command.

Building the Graph by Importing Data

One of the best methods to load external data is to use the LOAD CSV command from the management console. Details of how to use this are here. Creating indexes before loading the data speed things up considerably. A good way to ensure that the data is correctly formatted is to load the data into Excel and save it as a CSV file. Aberrant commas inside addresses are something to look out for.

Some Graph Design Considerations

Anyone can design a graph database, all you need is a white board, a felt-tipped pen and the ability to draw ellipses, arrows and square brackets, but there are a couple of things to bear in mind before starting on the artwork. Relationships are optimised for rapid transversal when searching across them. They are fixed-length objects containing pointers to other nodes and relationships and their index number is multiplied by the object’s length in order to calculate relative positions quickly. They are not optimised for examining their properties, as the properties are stored elsewhere and require a ‘lookup’ to locate them. So, ideally, searches should only examine properties when they have reached their destination. You don’t want to have to stop at every station, get out and read the time-table.

Careful consideration needs to be given to the type of data that’s stored within relationships as that data cannot be linked to other data. Here’s an example from the Movie graph. The graph has the following relationship:

Creating an intermediate node between Person and Movie nodes would allow the domain to be expanded more easily, something like:

The number of relationships has increased substantially but that’s ok as only a small part of the graph is being transversed when searching. With this arrangement, we are free to add as many relationships to the character node as needed and we could get the cast of Cloud Atlas with a simple query:

MATCH (:Movie {title:"Cloud Atlas"})<-[:CAST_IN]-(role:Role)<-[:GOT_ROLE]-(person:Person)
 RETURN person,role

Querying data stored within relationships can get a bit messy as you need to know, in advance, the Type of the data. This query, using the Movie example graph, groups people by their role in the cast of Cloud Atlas. It uses the pipe character | to select alternative matches.

MATCH (movie:Movie {title: "Cloud Atlas"})
OPTIONAL MATCH (person)-[r:ACTED_IN|:DIRECTED|:PRODUCED|:WROTE]->(movie)
RETURN person.name,type(r)

It’s a matter of judgement as to which properties are stored together within nodes. In the extreme case, all nodes could have just one property. The Graphgist Project has an interesting collection of graph designs and models for use in a wide range of domains from bank fraud detection to Scotch Whisky retailing.

Using Neo4jClient

Neo4jClient is an excellent C# .NET client for the Neo4J server, it can be downloaded as the Neo4jClient package on NuGet. The documentation is here. I don’t want to duplicate the examples given, but I’d like to mention a couple of things that I’ve found useful with reference to the Movie graph example. Let’s start by connecting to the database and finding Tom Hanks.

Connecting to the Database Service

    //make sure Neo4J service is running before opening the db
    var graphClient = new GraphClient(new Uri("http://localhost:7474/db/data"), 
                      "neo4j", "myPassword");
    graphClient.Connect();

Simple Searching

     // Find the Person named "Tom Hanks"...
            var tomHanks =
                  graphClient.Cypher.Match("(person:Person)")
                      .Where((Person person) => person.name == "Tom Hanks")
                      .Return(person => person.As<Person>())
                      .Results
                      .Single();

As far as Neo4J is concerned, every entity is a Json object, that is, a collection of Key Value pairs. It’s up to you to tell it how to deserialise the object. In this case, we use person.As<Person>() to return a Person entity. If you want to inspect the Key Value pairs, you can use person.As<Dictionary<string, string>>(). Person is defined as:

    public class Person
    {
        //all public fields need to be properties with getters and setters
        public int born { get; set; }

        public string name { get; set; }
    }

The best-match approach is used. So, if there is no match for the key born but there is one for name, just the name will be returned and born will be set to its default value. This leaves you free to expand the Person class at a later date. Each entity is also given an Id property, but I would advise against using it as it’s reserved for the server and its value is changed by magic. Neo4j follows the Java convention of using lower case for the first character of a property name but that’s not sacrosanct and you are free to do your own thing.

Adding Indexes and Labels

          //build a couple of indexes
          graphClient.Cypher.Create("INDEX ON :Person(name)").ExecuteWithoutResults();
          graphClient.Cypher.Create("INDEX ON :Movie(title)").ExecuteWithoutResults();

          //Add a label 'Actor' to all the 103 actors
          //You can't use parameters in the Set method for setting labels
          graphClient.Cypher
          .Match("(person:Person)-[:ACTED_IN]->(m)")
          .Set("person :Actor")//to remove a label use  .Remove("person :Actor")
          .ExecuteWithoutResults();

Updating

This example illustrates how parameters can be passed by using an anonymous class.

var updatedTom = graphClient.Cypher.Match("(person:Person)")
                .Where((Person person) => person.name == "Tom Hanks")
                .Set("person.born = {year},person.lastName={lastName}")
                .WithParams(new { year = 1066, lastName = "Hanks" })
                .Return(person => person.As<Person>())
                .Result
                .Single();

Searching

Graphs are excellent for finding relationships between entities that would be difficult to reveal with a conventional data table structure. Here, we are looking for actors that would be good candidates for working with Tom Hanks and who have not worked with him before. So we are looking for people who have been cast with people that have worked with Tom but they themselves have not appeared with Tom. The Cypher command for this query is given in the console application. Here is its equivalent using Neo4jClient.

string actorsName = "Tom Hanks";
List<string> cocoActors =
graphClient.Cypher
.Match("(tom:Person {name:{nameParam}})-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActor:Person),
(coActor)-[:ACTED_IN]->(m2)<-[:ACTED_IN]-(cocoActor:Person)")
.WithParam("nameParam", actorsName)
.Where("NOT (tom)-[:ACTED_IN]->(m2)")
.ReturnDistinct(cocoActor => cocoActor.As<Person>().name)
.Results.ToList();

Building the Graph

There is a serious omission in the sample Movie database – Daniel Craig is not in it. So let’s correct that.

 var danielCraig = new Person { born = 1968, name = "Daniel Craig" };
 var skyfall = new Movie { released =2012 , title = "Skyfall" };
 var actedIn = new ActedIn { roles =new List<string> {"James Bond"}};
 // create a new node with Person and Actor labels 
 // if there is not a match with the name and born parameters
    graphClient.Cypher.Merge("(person:Person:Actor { name: {name}, born:{born} })")
               .OnCreate()
                //set the person node equal to the danielCraig parameter
                .Set("person = {danielCraig}")
                //create an anonymous type parameter object with name,
                //born and danielCraig properties
                //It is these properties that are referenced 
                //in the previous Merge and Set clauses
                .WithParams(new { danielCraig.name, danielCraig.born, danielCraig })
                .Merge("(movie:Movie { title: {title}, released:{released} })")
                .OnCreate()
                .Set("movie = {skyfall}")
                .WithParams(new { skyfall.title, skyfall.released, skyfall })
                //create relationship linking the person and movie 
                //nodes outgoing from person
                //Only if there is no identical relationship with the ACTED_IN label
                .Merge("(person)-[rs:ACTED_IN ]->(movie)")
                .OnCreate()
                .Set("rs = {actedIn}")
                .WithParam("actedIn", actedIn )
                .ExecuteWithoutResults();

Using Transaction Scope

Placing transactions within a transaction scope enables multiple transactions to be committed to the database as if they were a single transaction. Either they all succeed or they all fail so you don’t end up with a half-built graph if something goes wrong. There is a substantial improvement in efficiency over committing transactions individually so it’s well worth doing. Here’s an example that adds multiple labels to the graph.

//need to have a reference to System.Transactions
using (var scope = new TransactionScope())
            {
                 graphClient.Cypher.Match("(person:Person)-[:ACTED_IN]->(m)")
                      .Set("person :Actor")
                      .ExecuteWithoutResults();
                 graphClient.Cypher.Match("(person:Person)-[:DIRECTED]->(m)")
                      .Set("person :Director")
                      .ExecuteWithoutResults();

                 scope.Complete();
           }

Asynchronous Transactions

To run async transactions, end the query with ResultsAsync or ExecuteWithoutResultsAsync(). You can do something like this:

 public async Task<IEnumerable<Crew>> GetCrewOfMovieAsync(string movieTitle)
      {
          var movieCrew = await this.graphClient.Cypher.Match
                          ("(movie:Movie {title: {titleParam}})")
              .OptionalMatch("(person:Person)-[r:DIRECTED|:PRODUCED|:WROTE]->(movie:Movie)")
              .WithParam("titleParam", movieTitle)
              .Return((person, r) => new Crew
              {
                  Name = person.As<Person>().name,
                  Role = r.Type()
              })
               .OrderBy("person.name")
                .ResultsAsync;

          return movieCrew;
      }

It’s as well to keep in mind when running asynchronous transactions that operations on Relationships can write lock the Relationship and both the nodes that are connected to it. So care is needed to avoid the situation where transaction A is waiting for transaction B to unlock and B is waiting for A to unlock and deadlock ensues. Personally, I’ve not found that running queries asynchronously is advantageous. They often take much longer to run than the synchronous versions but this may not be the case with leviathan sized graphs.

Debugging

There was an in-memory graph database utility that was useful for running unit tests with Neo4jClient but Neo4j’s latest upgrade appears to have broken it . An alternative approach is to run integration tests against a test graph. But this is rarely necessary as it is sufficient only to check that Neo4jClient is receiving the expected parameters for any given method as the method itself has already been extensively tested. Here’s a snippet to illustrate the sort of approach I take.

    using Microsoft.VisualStudio.TestTools.UnitTesting;

    using Neo4jClient;
    using Neo4jClient.Cypher;
    using Neo4jClient.SchemaManager;

    using NSubstitute;

    [TestClass]
    public class IndexHelperUnitTests
    {
        #region Constants and Fields

        private static IGraphClient graphClientSub;

        #endregion

        #region Public Methods and Operators
        [TestInitialize]
        public void TestInitialise()
        {
            graphClientSub = Substitute.For<IGraphClient>();
            var cypher = Substitute.For<ICypherFluentQuery>();
            graphClientSub.Cypher.Returns(cypher);
        }     

        [TestMethod]
        public void DropIndexCallsCypherDropWithCorrectParams()
        {
            var indexMetadataFactory = Substitute.For<IIndexMetadataFactory>();
            var schemaReader = Substitute.For<ISchemaReader>();
            var indexHelper = new IndexHelper(schemaReader, indexMetadataFactory);
            string expected = "INDEX ON :Person(name)";
            indexHelper.DropIndex(graphClientSub, "Person", "name");
            //check graphClient.Cypher.Drop() was called with the expected string
            graphClientSub.Cypher.Received().Drop(expected);
        }

Profiling Queries

The recommended technique for building queries with Neo4JClient is to test the query as a Cypher command using the console and then convert the command to the equivalent in Neo4jClient. If you preface the cypher statement with the word PROFILE, you will get a breakdown of the number of database hits that have taken place when the statement was executed. This is the key metric and is nearly always higher than expected. I’d take the execution times given with a pinch of salt. Neo4j is smart and goes in for preemptive caching. The following example illustrates this.

Query: PROFILE MATCH (tom:Person {name: "Tom Hanks"}) RETURN tom
Response: COST. 267 total db hits in 679 ms.
Query: CREATE INDEX ON :Person(name)
Response: Added 1 index, statement executed in 159 ms.
Query: PROFILE MATCH (tom:Person {name: "Tom Hanks"}) RETURN tom
Response: COST. 2 total db hits in 166 ms.
Query: PROFILE MATCH (tom:Person {name: "Tom Hanks"}) RETURN tom
Response: COST. 2 total db hits in 14 ms.

Using the Visual Display in the Console Application

The visual display is very useful to confirm that the graph is being built as expected and that queries return the correct data. To display the whole graph, enter the command match (n) return n

You can untangle overlapping relationships by dragging the nodes into less crowded regions of the display and you can display the contents of nodes and relationships by running the cursor over them. It’s a real corker of an application that never fails to blow my socks off.

Demonstration Project

The demonstration project contains a selection of database management examples and queries that, hopefully, are a useful basis for further study. The project builds its own graph so it’s best to start with an empty database directory to avoid trashing valuable data

Conclusion

If you are used to rigid schema structures and type-safe entities, Neo4j’s API will be a bit of an eyebrow raiser for you, but stick with it and you may be able to see how graph databases can be used to your advantage. They help Google and they help Facebook and they may help you.

Acknowledgements

I’m grateful to the authors of Neo4j and Neo4jClient for making such great applications open source. My special thanks goes to Tatham Oddie. The support he gives to users of Neo4JClient is exemplary.