Click here to Skip to main content
13,771,537 members
Click here to Skip to main content
Add your own
alternative version

Tagged as

Stats

23.7K views
645 downloads
14 bookmarked
Posted 6 Jan 2016
Licenced CPOL

Introduction to Graph Databases using Neo4J and its .Net Client

, 6 Jan 2016
Rate this:
Please Sign up or sign in to vote.
An introduction to Graph Databases

Introduction

Graph database management systems store data in a network of related entities. This article explains how to manage and query the network to obtain result sets that would be almost impossible to achieve by other means .

Graph Databases

Graph databases are ideal for storing related data. The sort of data that would require lots of JOIN statements if it was stored in a conventional data table. They are not constrained by a rigid schema and their efficiency depends upon the length of the pathways that are searched, rather than the overall size of the graph. Their structure is very simple, it’s a network of nodes connected to each other by way of a relationship object, as shown in the diagram.

The node with the label ‘Actor’ is connected to the node labelled ‘Movie’ through the relationship ‘ACTED_IN’. The direction of the arrow defines the direction of the relationship, in this case, it is Tom Hanks acted in Cast Away. Nodes can have properties as well as labels. In this example, the Actor node has a single property ‘Name’ with the value ‘Tom Hanks’ and the Movie node has the property ‘Title’ set to the value ‘Cast Away’. Relationships can also have properties but their use should be limited as relationships are used as conduits for node transversal when searching rather than data storage repositories. Searching is carried out by using label-based indexes and following node and relationship pathways. Returning all the movies acted in by Tom Hanks would involve looking in the Actor index for the name ‘Tom Hanks,’ in order to locate his node, and then following all the outgoing ACTED_IN relationships to find all of the Movie nodes. Relationships can be transversed in either direction, so, finding the cast of Cast Away would involve a similar process as before. First, locate the Castaway node and then follow all incoming ACTED_IN relationships.

Getting started with Neo4j

The graph database management system illustrated in this article is Neo4j, the community edition can be downloaded here. It’s open source, free and fully ACID compliant. Neo4j employs a REST service interface and provides an admin console hosted in a web browser. You can access the console after starting the service by clicking on the link provided on the popup window. Just run neo4j-community.exe and follow the prompts. The console provides a series of simple steps to build some example graphs. The Movie graph is a good one to start with. When you have built it you will see a stunning visual representation of the graph displayed in the browser.

Cypher

Cypher is the query language used by Neo4j it employs a neat way of expressing nodes and their relationship. The relationship shown in the first diagram would be expressed as
(:Actor {name:’Tom Hanks’})-[:ACTED_IN]->(:Movie {title:’Cast Away’})
There are many excellent examples given in the console app, it’s instructive to run them all and observe the results. All the commands are very well explained but I’d like to highlight the use of the MERGE command as it is very easy to trip up with it. The MERGE command will create new nodes and relationships if the whole of the given pattern does not match an existing pattern. So, to avoid duplicating existing nodes,

MERGE (:User {name: "Bob"})-[Knows]->(:User {name: "Alice"})

is best written as

MERGE (bob:User {name: "Bob"})
MERGE (alice:User {name: "Alice"})
MERGE(bob)-[:KNOWS]->(alice)

The parameters 'bob' and 'alice' attached to the first two statements uniquely identify their nodes in the last statement.

Database Management.

Creating Indexes.

Indexes are used to find the starting node for a query. They are Label based .
CREATE INDEX ON :Person(name)
Creates an index based on the Person label  and the ‘name’ property. The index is updated automatically and Neo4j is smart enough to know which index to use, so there is no need to specify any particular index when searching. To remove an index use
DROP INDEX ON :Person(name)

Constraints.

Constraints  specify that a node with a given label  and property should have a unique value on that property.
CREATE CONSTRAINT ON (person:Person) ASSERT person.name IS UNIQUE
This statement will create a new index based on the Person.name property and enforce the constraint on it.   Dropping an index that has a constraint will throw an error. To drop a constrained index, just drop the constraint
DROP  CONSTRAINT ON (person:Person) ASSERT person.name IS UNIQUE

Schema

To view the schema, enter  the following command
:SCHEMA
You will get back a list of the graph’s indexes and constraints.

Backing up the Graph.

There is no inbuilt backup facility in the community edition. But data can be backed up by stopping the service and using  something like Windows Zip or 7-Zip  to backup the database folder  (default.graphdb ) to a compressed folder.

Deleting the Graph.

The cleanest way to start with an empty graph is to delete all the contents  and sub-directories of the database folder. But you can, if you wish,  delete all nodes and relationships using the command
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE r,n
This does not, however, remove any indexes that you may have created. They need to be removed using the Drop command. 

Building  The Graph By Importing Data.

One of the best methods to load external data  is to use the LOAD CSV command from the management console. Details of how to use this are here. Creating indexes before loading the data speed things up considerably. A good way to ensure that the data is correctly formatted is to load the data into Excel and save it as a CSV file. Aberrant commas inside addresses are something to look out for.

Some Graph Design Considerations.

Anyone can design a graph database, all you need is a white board, a felt-tipped pen and the ability to draw ellipses, arrows and square brackets, but there are a couple of things to bear in mind before starting on the artwork. Relationships are optimised for rapid transversal when searching across them. They are fixed-length objects containing pointers to other nodes and relationships and their index number is multiplied by the object’s length in order to calculate relative positions quickly. They are not optimised for examining their properties, as the properties are stored elsewhere and require a ‘lookup’ to locate them. So, ideally, searches should only examine properties when they have reached their destination. You don’t want to have to stop at every station, get out and read the time-table.
Careful consideration needs to be given to the type of data that’s stored within relationships as that data cannot be linked to other data . Here’s an example from the Movie graph. The graph has the following relationship.

Creating an intermediate node between Person and Movie nodes would allow the domain to be expanded more easily, something like

The number of relationships has increased substantially but that’s ok as only a small part of the graph is being transversed when searching . With this arrangement, we are free to add as many relationships to the character node as needed and   we could get the cast of Cloud Atlas with a simple query
MATCH (:Movie {title:"Cloud Atlas"})<-[:CAST_IN]-(role:Role)<-[:GOT_ROLE]-(person:Person) RETURN person,role
Querying  data stored within relationships can get a bit messy as you need to know, in advance, the Type of the data. This query, using the Movie  example graph,  groups people by their role in the cast of Cloud Atlas. It uses the pipe character ‘|’ to select alternative matches.
MATCH (movie:Movie {title: "Cloud Atlas"})
OPTIONAL MATCH (person)-[r:ACTED_IN|:DIRECTED|:PRODUCED|:WROTE]->(movie)
RETURN person.name,type(r)

It’s a matter of judgement as to which properties are stored together within nodes. In the extreme case all nodes could have just one property. The Graphgist Project has an interesting collection of graph designs and models for use in a wide range of domains from bank fraud detection to Scotch Whisky retailing.

Using Neo4jClient

Neo4jClient is an excellent C# .Net client for the Neo4J server, it can be downloaded as the Neo4jClient package on NuGet. The documentation is here. I don’t want to duplicate the examples given but I’d like to mention a couple of things that I’ve found useful with reference to the Movie graph example. Let’s start by connecting to the database and finding Tom Hanks.

Connecting to the Database Service

//make sure Neo4J service is running before opening the db
var graphClient = new GraphClient(new Uri("http://localhost:7474/db/data"), "neo4j", "myPassword");
graphClient.Connect();

Simple Searching

// Find the Person named "Tom Hanks"...
       var tomHanks =
             graphClient.Cypher.Match("(person:Person)")
                 .Where((Person person) => person.name == "Tom Hanks")
                 .Return(person => person.As<Person>())
                 .Results
                 .Single();

As far as Neo4J is concerned, every entity is a Json object, that is, a collection of Key Value pairs. It’s up to you to tell it how to deserialise the object. In this case we use person.As<Person>() to return a Person entity. If you want to inspect the Key Value pairs you can use person.As<Dictionary<string, string>>()Person is defined as

public class Person
{
    //all public fields need to be properties with getters and setters
    public int born { get; set; }

    public string name { get; set; }

}

The best-match approach is used. So, if there is no match for the key ‘born’ but there is one for ‘name’, just the name will be returned and ‘born’ will be set to its default value. This leaves you free to expand the Person class at a later date. Each entity is also given an Id property but I would advise against using it as it’s reserved for the server and its value is changed by magic. Neo4j follows the Java convention of using lower case for the first character of a property name but that’s not sacrosanct and you are free to do your own thing.

Adding Indexes and Labels

//build a couple of indexes
graphClient.Cypher.Create("INDEX ON :Person(name)").ExecuteWithoutResults();
graphClient.Cypher.Create("INDEX ON :Movie(title)").ExecuteWithoutResults();

//Add a label 'Actor' to all the 103 actors
//You can't use parameters in the Set method for setting labels
graphClient.Cypher
.Match("(person:Person)-[:ACTED_IN]->(m)")
.Set("person :Actor")//to remove a label use  .Remove("person :Actor")
.ExecuteWithoutResults();

Updating

This example illustrates how parameters can be passed by using an anonymous class.

var updatedTom = graphClient.Cypher.Match("(person:Person)")
                .Where((Person person) => person.name == "Tom Hanks")
                .Set("person.born = {year},person.lastName={lastName}")
                .WithParams(new { year = 1066, lastName = "Hanks" })
                .Return(person => person.As<Person>())
                .Result
                .Single();

Searching

Graphs are excellent for finding relationships between entities that would be difficult to reveal with a conventional data table structure. Here we are looking for actors that would be good candidates for working with Tom Hanks and who have not worked with him before. So we are looking for people who have been cast with people that have worked with Tom but they themselves have not appeared with Tom. The Cypher command for this query is given in the console application. Here is its equivalent using Neo4jClient.

string actorsName = "Tom Hanks";
List<string> cocoActors =
graphClient.Cypher
.Match("(tom:Person {name:{nameParam}})-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActor:Person),(coActor)-[:ACTED_IN]->(m2)<-[:ACTED_IN]-(cocoActor:Person)")
.WithParam("nameParam", actorsName)
.Where("NOT (tom)-[:ACTED_IN]->(m2)")
.ReturnDistinct(cocoActor => cocoActor.As<Person>().name)
.Results.ToList(); 

Building the Graph

There is a serious omission in the sample Movie database – Daniel Craig is not in it. So let’s correct that.

 var danielCraig = new Person { born = 1968, name = "Daniel Craig" };
 var skyfall = new Movie { released =2012 , title = "Skyfall" };
 var actedIn = new ActedIn { roles =new List<string> {"James Bond"}};
 // create a new node with Person and Actor labels 
 // if there is not a match with the name and born parameters
    graphClient.Cypher.Merge("(person:Person:Actor { name: {name}, born:{born} })")
               .OnCreate()
                //set the person node equal to the danielCraig parameter
                .Set("person = {danielCraig}")
                //create an anonymous type parameter object with name,born and danielCraig properties
                //It is these properties that are referenced in the previous Merge and Set clauses
                .WithParams(new { danielCraig.name, danielCraig.born, danielCraig })
                .Merge("(movie:Movie { title: {title}, released:{released} })")
                .OnCreate()
                .Set("movie = {skyfall}")
                .WithParams(new { skyfall.title, skyfall.released, skyfall })
                //create relationship linking the person and movie nodes outgoing from person
                //Only if there is no identical relationship with the ACTED_IN label
                .Merge("(person)-[rs:ACTED_IN ]->(movie)")
                .OnCreate()
                .Set("rs = {actedIn}")
                .WithParam("actedIn", actedIn )
                .ExecuteWithoutResults();

Using Transaction Scope

Placing transactions within a transaction scope enables multiple transactions to be committed to the database as if they were a single transaction. Either they all succeed or they all fail so you don’t end up with a half-built graph if something goes wrong.. There is a substantial improvement in efficiency over committing transactions individually so it’s well worth doing. Here’s an example that adds multiple labels to the graph.

//need to have a reference to System.Transactions
using (var scope = new TransactionScope())
            {
                 graphClient.Cypher.Match("(person:Person)-[:ACTED_IN]->(m)")
                      .Set("person :Actor")
                      .ExecuteWithoutResults();
                 graphClient.Cypher.Match("(person:Person)-[:DIRECTED]->(m)")
                      .Set("person :Director")
                      .ExecuteWithoutResults();

                 scope.Complete();
           }

Asynchronous Transactions

To run async transactions end the query with ResultsAsync or ExecuteWithoutResultsAsync(). You can do something like this.

public async Task<IEnumerable<Crew>> GetCrewOfMovieAsync(string movieTitle)
     {
         var movieCrew = await this.graphClient.Cypher.Match("(movie:Movie {title: {titleParam}})")
             .OptionalMatch("(person:Person)-[r:DIRECTED|:PRODUCED|:WROTE]->(movie:Movie)")
             .WithParam("titleParam", movieTitle)
             .Return((person, r) => new Crew
             {

                 Name = person.As<Person>().name,
                 Role = r.Type()

             })
              .OrderBy("person.name")
               .ResultsAsync;

         return movieCrew;
     }

It’s as well to keep in mind when running asynchronous transactions that operations on Relationships can write lock the Relationship and both the nodes that are connected to it. So care is needed to avoid the situation where transaction A is waiting for transaction B to unlock and B is waiting for A to unlock and deadlock ensues. Personally, I’ve not found that running queries asynchronously is advantageous. They often take much longer to run than the synchronous versions but this may not be the case with leviathan sized graphs.

Debugging.

There was an in-memory graph database utility that was useful for running unit tests with Neo4jClient but Neo4j’s latest upgrade appears to have broken it . An alternative approach is to run integration tests against a test graph. But this is rarely necessary as it is sufficient only to check that Neo4jClient is receiving the expected parameters for any given method as the method itself has already been extensively tested. Here’s a snippet to illustrate the sort of approach I take.

    using Microsoft.VisualStudio.TestTools.UnitTesting;

    using Neo4jClient;
    using Neo4jClient.Cypher;
    using Neo4jClient.SchemaManager;

    using NSubstitute;

    [TestClass]
    public class IndexHelperUnitTests
    {
        #region Constants and Fields

        private static IGraphClient graphClientSub;

        #endregion

        #region Public Methods and Operators
        [TestInitialize]
        public void TestInitialise()
        {
            graphClientSub = Substitute.For<IGraphClient>();
            var cypher = Substitute.For<ICypherFluentQuery>();
            graphClientSub.Cypher.Returns(cypher);
        }

     

        [TestMethod]
        public void DropIndexCallsCypherDropWithCorrectParams()
        {
            var indexMetadataFactory = Substitute.For<IIndexMetadataFactory>();
            var schemaReader = Substitute.For<ISchemaReader>();
            var indexHelper = new IndexHelper(schemaReader, indexMetadataFactory);
            string expected = "INDEX ON :Person(name)";
            indexHelper.DropIndex(graphClientSub, "Person", "name");
            //check graphClient.Cypher.Drop() was called with the expected string
            graphClientSub.Cypher.Received().Drop(expected);
        }

Profiling Queries.

The recommended technique for building queries with Neo4JClient is to test the query as a Cypher command using the console and then convert the command to the equivalent in Neo4jClient. If you preface the cypher statement with the word PROFILE, you will get a breakdown of the number of database hits that have taken place when the statement was executed. This is the key metric and is nearly always higher than expected. I’d take the execution times given with a pinch of salt. Neo4j is smart and goes in for preemptive caching. The following example illustrates this.

Query: PROFILE MATCH (tom:Person {name: "Tom Hanks"}) RETURN tom
Response: COST. 267 total db hits in 679 ms.
Query: CREATE INDEX ON :Person(name)
Response: Added 1 index, statement executed in 159 ms.
Query: PROFILE MATCH (tom:Person {name: "Tom Hanks"}) RETURN tom
Response: COST. 2 total db hits in 166 ms.
Query: PROFILE MATCH (tom:Person {name: "Tom Hanks"}) RETURN tom
Response: COST. 2 total db hits in 14 ms.

Using the visual display in the console application.

The visual display is very useful to confirm that the graph is being built as expected and that queries return the correct data. To display the whole graph enter the command match (n) return n

You can untangle overlapping relationships by dragging the nodes into less crowded regions of the display and you can display the contents of nodes and relationships by running the cursor over them. It’s a real corker of an application that never fails to blow my socks off.

Demonstration Project

The demonstration project contains a selection of database management examples and queries that, hopefully, are a useful basis for further study. The project builds its own graph so it’s best to start with an empty database directory to avoid trashing valuable data

Conclusion

If you are used to rigid schema structures and type-safe entities, Neo4j’s API will be a bit of an eyebrow raiser for you, but stick with it and you may be able to see how graph databases can be used to your advantage. They help Google and they help Facebook and they may help you.

Acknowledgements

I’m grateful to the authors of Neo4j and Neo4jClient for making such great applications open source. My special thanks goes to Tatham Oddie. The support he gives to users of Neo4JClient is exemplary.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

George Swan
Student
Wales Wales
No Biography provided

You may also be interested in...

Pro

Comments and Discussions

 
QuestionNot running on the latest version of .NET Pin
Member 115023484-Sep-18 4:14
memberMember 115023484-Sep-18 4:14 
AnswerMessage Closed Pin
22-May-18 0:30
memberSharayu Pandye22-May-18 0:30 
AnswerSchemaManager Pin
SteveHolle6-Jan-16 10:19
memberSteveHolle6-Jan-16 10:19 
GeneralRe: SchemaManager Pin
George Swan6-Jan-16 10:58
memberGeorge Swan6-Jan-16 10:58 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile
Web02 | 2.8.181119.1 | Last Updated 6 Jan 2016
Article Copyright 2016 by George Swan
Everything else Copyright © CodeProject, 1999-2018
Layout: fixed | fluid