RaptorDB - the Document Store

Mehdi Gholam

Rate me:

4.96/5 (278 votes)

4 May 2012CPOL86 min read

2.5M

16.3K

652

NoSql, JSON based, Document store database with compiled .net map functions and automatic hybrid bitmap indexing and LINQ query filters

This is an old version of the currently published article.

The Document Store Database Engine

Preface
Introduction
Why?
Why another database?
Possible Uses
How we use data
What is a Document Store database?
- Foul ! the relational purists cry...
- Is it for everybody?
Features
- Limitations
The Competition
Performance Tests
If you want to torture someone, make them write a LINQ Provider!
- The quirks of LINQ parsing
- How the LINQ parser works in RaptorDB
A sample application
How It Works
On the cutting room floor
Closing Remarks
History

Preface

The code for RaptorDB is on CodePlex (http://raptordb.codeplex.com/) under Git source control, I will be actively maintaining and adding features to this project as I believe very strongly in it. I will keep this article and the source code in sync.

Introduction

This article is the natural progression from my previous article about a persisted dictionary to a full blown NoSql document store database. While a key/value store is useful, it's not as useful to everybody as a "real" database with "columns" and "tables". RaptorDB uses the following articles:

fastJSON : JSON object serializer and deserializer found here (http://www.codeproject.com/Articles/159450/fastJSON)
fastBinaryJSON : a binary format JSON serializer found here (http://www.codeproject.com/Articles/345070/fastBinaryJSON)
miniLog4net : a mini log4net replacement found here (http://www.codeproject.com/Articles/136118/Mini-Drop-in-Replacement-for-log4net)
WAHBitArray : a Word Aligned Hybrid Compression technique for BitArrays found here (http://www.codeproject.com/Articles/214997/Word-Aligned-Hybrid-WAH-Compression-for-BitArrays)
hOOt : full text search engine found here (http://www.codeproject.com/Articles/224722/hOOt-full-text-search-engine)
RaptorDB the key value store : using the MGIndex for the indexing found here (http://www.codeproject.com/Articles/316816/RaptorDB-The-Key-Value-Store-V2)

Some advanced R&D (for more than a year) went into RaptorDB, in regards to the hybrid bitmap index. Similar technology is being used by Microsoft's Power Pivot for Excel and US Department of Energy Berkeley labs project called fastBit to track terabytes of information from particle simulations. Only the geeks among us care about this stuff and the normal person just prefer to sit in the Bugatti Veyron and drive, instead of marvel at the technological underpinnings.

To get here was quite a journey for me as I had to create a lot of technology from scratch, hopefully RaptorDB will be a prominent alternative, built on the .net platform to other document databases which are either java or c++ based.

RaptorDB puts the joy back into programming, as you can see in the sample application section.

Why?

The main driving force behind the development of RaptorDB is making the developer's and support jobs easier, developing software products is hard enough without complete requirements which becomes even harder when requirements and minds change as they only do in the real world. The benefits of

RaptorDB

can be summarized as :

Development easier : writing less code means less bugs and less testing.
Making changes faster : no need to edit database schema's with all the hassle of using other tools.
Knowledge requirements lowest : you don't need to know the SQL language, indexing techniques, configuration parameters, just plain old c# (vb.net) will do.
Maintenance simpler : changes are isolated so you are free to make them without worrying about breaking things elsewhere.
Setup time & cost minimal : to get up and running you just need the .net framework and an IDE, no setting up database servers, running scripts, editing config files etc. (even on netbooks).
Very fast execution : all this is done with speeds that put expensive servers to shame on mere laptops and netbooks.

Why another database?

Some people have said why create another database while you can use what exists or just write a driver in .net for the "X" database, to this I answer the following:

I believe that you can do a better job in pure .net like operating at 80% hard disk speed.
Writing drivers and marshaling across process boundaries can have a performance hit.
Implementing fundamental algorithms is an educational process.
We have to push the boundaries of what is possible too see that the only limitation is our imagination and resolve.
Someone will find it useful and who knows it might become one of the "big boys", they all started as "little boys" anyway.

Possible Uses

You can use the Document Store version of RaptorDB in the following scenarios:

The back-end store for your web based :

Forums
Blogs
Wikis
Content management systems
Web sites

Easily build a SharePoint clone.
Stand-alone applications that require storage ( no more installing SQL Server for a phone book app).
Real world business applications (with caveats).

How we use data

Before getting to the heart of document databases, let us examine how we use data in the first place as this will give us a better understanding of were we stand and how we can better utilize non relational technology.

Viewing lists of things (list of customers, products, inventory transaction,...)
Filtering lists of things (customer in country X)
Aggregating lists of things (sum of qty in stock)
Searching for things (much like filtering but may span multiple lists)
Viewing a document (open the invoice number 123)
Pivoting lists or building intelligence reports

Except for reporting all the others are essentially just the following :

Filtering lists
Aggregating lists

What is a Document Store database?

Document databases or stores are a class of storage systems which save the whole object hierarchy to disk and retrieve the same without the use of relational tables. To aid the searching in such databases most Document store databases have a map function which extracts the data needed and saves that as a "view" for later browsing and searching. These databases do away with the notion of transactions and locking mechanism in the traditional sense and offer high data through-put and "eventually consistent" data views. This means that the save pipeline is not blocked for insert operations and reading data will eventually reflect the inserts done ( allowing the mapping functions and indexers time to work).

There is some very appealing consequences for going this way:

Schema less design (just save it mentality) :

You don't need to define tables and columns before hand.
Your application can evolve and expand as needed without schema pains.

Operational speed :

You read the data as it was saved the first time, so you can read in one disk operation and have the whole object hierarchy without multiple reads to tables and patching an object with data retrieved.
Does away with locks and deadlocks, so its much faster and scales better.

Simplicity in application design :

The data access layer for the application is orders of magnitude simpler.
Changes to the application can be made anytime and on-site to the customers requirements.

Lower development costs : Simpler and less code means development and testing is faster, easier with lower knowledge requirements for developers and maintainers.
Historical data : Ability to have history of changes (essential for business applications).
Easy and simple replication : Because the data is already encapsulated (the original document) replication is simple and painless as you just transfer the document and save it at the other end, without the pains of inconsistent tables like relational models.
Operational cost savings : not requiring RDBM server licenses could offer considerable savings especially for web hosted and cloud based applications.

Foul ! the relational purists cry...

Most people who have worked with relational database will scream in horror here, at the thought of data being eventually consistent and not having tables. Most businesses have a lot of flexibility in regards to data validity and not all data items require the same granularity and in which case they have processes in place for "exceptional" cases and work perfectly fine.

For example if you have sold 10 items and go to the warehouse and see that the items were damaged in a the rain last night under a leaky roof, the business tries to find another 10 items otherwise calls the customer and explains their order might be delayed. So having an up to the millisecond record of the inventory is good, but the business can do fine without it.

This mindset takes a bit of getting used to for people who have been under the influence of "relational database" thinking and have not been exposed to actual running businesses, and they will freak out at the thought. Much of what has been drilled into us for the past 40 or so years since the advent of relational model has been a notion of data normalization which forces the breaking up of data in to discrete chunks of the same things. This was primarily done for the reason of shortage of data storage capacity and has stuck ever since, which forces a huge burden on the poor database engines to optimize joins and query plans to get back to what was put in, in the first place. Also the notion of normalization is a misnomer as you are perfectly allowed to have duplicated data as long as you ensure that they are in sync with one another.

Most of the relational model thinking has changed in recent years as the "Database Server" is not stand-alone and is part of the application and is accessed via the application and not called directly. This is the tiered mindset in application development which creates a layer for data access which can be isolated and controlled easily. This change, releases the burden of security settings, normalization requirements etc. from the RDBM server, as all this is done from the application service front end, case in point would be an API for Facebook which abstracts the usage of the site and you don't access tables directly. So much of what was built into RDBM servers goes unused for modern applications, and you can get away with embedding the database within the application as a layer not a separate process.

Is it for everybody?

With all the benefits stated, documents store databases are not for every body and every situation. The main point in divergence is if you require up to the millisecond data consistency across everything ( the cornerstone RDBM systems). If data validity is not an issue or your perfectly fine having the result valid at a certain point in time like the inventory count is valid for today at 9:00am then in my experience and you can get away with it for most web based application and probably 90% of business applications (this does not mean that these databases are consistent after hours but typically seconds) .

If your not willing to sacrifice this level of timed concurrency, then NoSql / document store databases are not for you. In my experience not being willing, is more a psychological barrier of the developers/designers than a technical requirement of the application and users.

Features

RaptorDB has been built with the following features in mind:

Core Features :

Built on the algorithms of persisted dictionary version of RaptorDB (so you get all the benefits of that).
Built on pure .net so there is considerable performance benefits of not marshaling data across process boundaries via drivers.
Tiny size 120 KB (even smaller than the great Sqlite).
Strings are stored in UTF8 or Unicode format for views.
Documents can be stored as ASCII JSON, the JSON standard already encodes unicode in ASCII format or as Binary JSON for speed.
Embedded design.

View Storage :

You don't need to specify column widths for
```
string
```
columns and RaptorDB can handle any size string (indexing is however limited to max 255 bytes for normal string columns).
Views (output from map functions) save data as binary not JSON ( much faster for indexing).
Primary list/View on object type definition for immediate access to the object saved (immediate call to a primary map function)

Document Storage :

Ability to save byte[] data as well so you can save files etc. ( these are saved in a separate storage file).

Indexing Engine :

Support for special hybrid bitmap index for views using Word Aligned Hybrid (WAH) compression.
Automatic indexing of views (self maintained no administration required).
All columns are indexed with the MGIndex.
Fast full text search support on string columns (you choose normal or full text search of strings in your view) .

LINQ Query Processor :

Query Filter parser with nested parenthesis, AND and OR expressions.

Map Function Engine :

Map functions are compiled .net code not JavaScript like competitors ( order of magnitude faster, see the performance tests).
Map functions have access to an API for doing queries and document fetches (you can do complex business logic in them).
Map functions type check the data saved, which makes reading the data easier and everything more consistent.

Limitations

The following limitations are in this release:

Requires at least .net 4 (uses the Task library).
Document and View Item count is limited to 4 billion items or Int32.
This release is not code backward compatible with the key/value store version.
Aggregation on views.
Standalone service version is not available at the moment.
Revision checking on documents is not supported at the moment.
Sharding / Replication is not supported at the moment.
Query filters only work with literal right hand sides (e.g. Name='bob' not
```
StatusColumn >
              LastStatusColumn
```
referencing another column)

The Competition

There are a number competing storage systems, a few of which I have looked at, are below:

MongoDB (c++): Great database which I love and is the main inspiration behind RaptorDB, although I have issues with its 32bit 4Gb database size limit and the memory map file design which could potentially corrupt easily. (Polymorphism has a workaround in mongodb if anyone wants to know).
CouchDB / CouchBase (erlang): Another standard in document databases. The design is elegant.
RavenDB (.net): Work done by Ayende which is a .net document database built on the Windows ESENT storage system.
OrientDB (java) : The performance specs are impressive.

Performance Tests

The test were all done on my notebook which is a AMD K625 1.5Ghz CPU, 4Gb DDR2 Ram, WD 5400rpm HDD, Win7 Home 64bit , Windows rating of 3.9.

Javascript vs .NET compiled map functions

Most of the document databases today use the javascript language as the language to write map functions, in RaptorDB we are using compiled .net code. A simple test of the performance benefits of using compiled code is the following example from the http://www.silverlight.net/content/samples/sl2/silverlightchess/run/default.html site which is a chess application, this test is certainly not comprehensive but it does show a reasonable computation intensive comparison :

As you can see even in Google Chrome (V8 javascript engine) which arguably is the fastest javascript processor currently, is beaten by the .net code (in Silverlight, full .net version could be faster still) by a factor of about 8x.

With this non scientific test it's unreasonable to use anything but compiled code for map functions.

Insert object performance

Depending on the complexity of you documents and the views defined you can expect around 10,000 docs/sec throughput from RaptorDB (based on my test system).

Query Performance

The real test to validate the use of a NoSql database is the query test, you have put the data in, how fast is it getting it out. RaptorDB will output query processing times to the log file, as you can see from the samples below most of the query time is taken by the actual fetching of the data and the query plan is executed in the milliseconds range (the power of the bitmap index).

2012-04-29 12:38:40|DEBUG|1|RaptorDB.Views.ViewHandler|| query bitmap done (ms) : 40.0023
2012-04-29 12:38:40|DEBUG|1|RaptorDB.Views.ViewHandler|| query rows fetched (ms) : 33.0019
2012-04-29 12:38:40|DEBUG|1|RaptorDB.Views.ViewHandler|| query rows count : 300

2012-04-29 12:38:40|DEBUG|1|RaptorDB.Views.ViewHandler|| query bitmap done (ms) : 1.0001
2012-04-29 12:38:40|DEBUG|1|RaptorDB.Views.ViewHandler|| query rows fetched (ms) : 469.0268
2012-04-29 12:38:40|DEBUG|1|RaptorDB.Views.ViewHandler|| query rows count : 25875 

2012-04-29 12:38:45|DEBUG|1|RaptorDB.Views.ViewHandler|| query bitmap done (ms) : 4.0002
2012-04-29 12:38:45|DEBUG|1|RaptorDB.Views.ViewHandler|| query rows fetched (ms) : 6.0003
2012-04-29 12:38:45|DEBUG|1|RaptorDB.Views.ViewHandler|| query rows count : 500 

2012-04-29 12:38:45|DEBUG|1|RaptorDB.Views.ViewHandler|| query bitmap done (ms) : 0
2012-04-29 12:38:45|DEBUG|1|RaptorDB.Views.ViewHandler|| query rows fetched (ms) : 677.0387
2012-04-29 12:38:45|DEBUG|1|RaptorDB.Views.ViewHandler|| query rows count : 50000

If you want to torture someone, make them write a LINQ Provider!

It took me around a month of intense research, debugging and tinkering to get my head around the LINQ provider interface and how it works, while the title of this section is a bit harsh but I hope it conveys the frustration I felt at the time.
To be fair what emerged is very clean, concise and elegant. Admittedly it is only the Expression evaluation part of LINQ and is a fraction of what you have to go through for a full LINQ provider. This was all I needed for RaptorDB so I will try to explain how it was done here for anybody wanting to continue as resources are very rare on this subject.

For RaptorDB we want a "where" clause parser in LINQ which will essentially filter the view data and give us the rows, this is done with the following command :

int j = 1000; 
var result = db.Query(typeof(SalesInvoice),
                 (SalesInvoice s) => (s.Serial > j &&  s.CustomerName == "aaa")
             );

The main part we are focusing on is the line :

(SalesInvoice s) => (s.Serial > j && s.CustomerName == "aaa")

From this we want to parse the expression which reads : given the SalesInvoice type (used for denoting the property/column names, and serves no other purpose) filter where the [serial number is greater than j and the customer name is "aaa"] or the count is greater than zero. From this the query engine must determine the "column names" used and fetch them from the index file and get the associated values from that index and apply logical arithmetic on the results to get what we want.

The quirks of LINQ parsing

There are two quirks in parsing LINQ queries :

Variables (the j in the above example) are replaced with a compiler placeholder and not the value.
All the work is done in the VisitBinary method, for logical expression parsing and clause evaluation, so you have to be able to distinguish and handle them both.

How the LINQ parser works in RaptorDB

In RaptorDB we want to be able to extract and query the index for each clause in the filter expression based on the order and logic of the expression. Because the indexes are built on the WAHBitArray the result will be a WAHBitArray. All this is done in the following very small code (compared to writing a language parser) :

delegate WAHBitArray QueryExpression(string colname, RDBExpression exp, object from);

internal class QueryVisitor : ExpressionVisitor
{
    public QueryVisitor(QueryExpression express)
    {
        qexpression = express;
    }
    public Stack<object> _stack = new Stack<object>();
    public Stack<object> _bitmap = new Stack<object>();
    QueryExpression qexpression;

    protected override Expression VisitBinary(BinaryExpression b)
    {
        this.Visit(b.Left);
        ExpressionType t = b.NodeType;

        if (t == ExpressionType.Equal || t == ExpressionType.LessThan || t == ExpressionType.LessThanOrEqual ||
            t == ExpressionType.GreaterThan || t == ExpressionType.GreaterThanOrEqual)
            _stack.Push(b.NodeType);

        this.Visit(b.Right);
        t = b.NodeType;
        if (t == ExpressionType.Equal || t == ExpressionType.NotEqual ||
            t == ExpressionType.LessThanOrEqual || t == ExpressionType.LessThan ||
            t == ExpressionType.GreaterThanOrEqual || t == ExpressionType.GreaterThan)
        {
            // binary expression
            object lv = _stack.Pop();
            ExpressionType lo = (ExpressionType)_stack.Pop();
            object ln = _stack.Pop();
            RDBExpression exp = RDBExpression.Equal;

            if (lo == ExpressionType.LessThan)
                exp = RDBExpression.Less;
            else if (lo == ExpressionType.LessThanOrEqual)
                exp = RDBExpression.LessEqual;
            else if (lo == ExpressionType.GreaterThan)
                exp = RDBExpression.Greater;
            else if (lo == ExpressionType.GreaterThanOrEqual)
                exp = RDBExpression.GreaterEqual;

            _bitmap.Push(qexpression("" + ln, exp, lv));
        }

        if (t == ExpressionType.And || t == ExpressionType.AndAlso ||
            t == ExpressionType.Or || t == ExpressionType.OrElse)
        {
            // do bitmap operations
            WAHBitArray r = (WAHBitArray)_bitmap.Pop();
            WAHBitArray l = (WAHBitArray)_bitmap.Pop();

            if (t == ExpressionType.And || t == ExpressionType.AndAlso)
                _bitmap.Push(r.And(l));
            if (t == ExpressionType.Or || t == ExpressionType.OrElse)
                _bitmap.Push(r.Or(l));
        }
        return b;
    }

    protected override Expression VisitMethodCall(MethodCallExpression m)
    {
        string s = m.ToString();
        _stack.Push(s.Substring(s.IndexOf('.') + 1));
        return m;
    }

    protected override Expression VisitMember(MemberExpression m)
    {
        var e = base.VisitMember(m);
        var c = m.Expression as ConstantExpression;
        if (c != null)
        {
            Type t = c.Value.GetType();
            var x = t.InvokeMember(m.Member.Name, BindingFlags.GetField, null, c.Value, null);
            _stack.Push(x);
        }
        if (m.Expression != null && m.Expression.NodeType == ExpressionType.Parameter)
        {
            _stack.Push(m.Member.Name);
            return e;
        }
        return e;
    }

    protected override Expression VisitConstant(ConstantExpression c)
    {
        IQueryable q = c.Value as IQueryable;
        if (q != null)
            _stack.Push(q.ElementType.Name);
        else if (c.Value == null)
            _stack.Push(null);
        else
        {
            _stack.Push(c.Value);
            if (Type.GetTypeCode(c.Value.GetType()) == TypeCode.Object)
                _stack.Pop();
        }
        return c;
    }
}

Most of the work is done in the VisitBinary method (for evaluating logical [&& || ] operations and clauses [b>3] ) so to distinguish the two a stack is used store the clause values for further processing. VisitBinary will be called recursively for left and right sides of expressions so a stack of bitmap is also required for aggregating the results of the expression.

The constructor to the class takes two delegates which are supplied by the caller for handles to the underlying indexes which this class calls when a binary clause is completely parsed. The results are push onto the bitmap stack.

The VisitMember method is responsible for replacing the compiler generated code for constant values with the appropriate value ( the j in the above example).

The rest of the code is generally for extracting the "column names" without the prefixes (s.Serial -> Serial etc.).

A sample application

To work with RaptorDB all you need to do is follow these steps :

Define your Entities (plain c# objects) as you would doing domain driven development.
Create a Primary View for your base Entities.
Register your views with RaptorDB.
Save and query your data.
Add new views based on your requirements.

As you will see below this is so easy and simple that it just happens and you don't need to learn anything new or worry about configurations or breaking things at runtime as the compiler will catch your error at compile time.
A great feature is the total absence of anything SQL related, the associated schema pains and having to switch to a database management product to define and check tables and columns as everything is in your source file.

1. Creating Entities

The first thing you should do is define your entities or data classes (referred to as domain driven development), these are plain c# (vb.net) classes or POCO's like the following :

public class LineItem
{
    public decimal QTY { get; set; }
    public string Product { get; set; }
    public decimal Price { get; set; }
    public decimal Discount { get; set; }
}

public class SalesInvoice
{
    public SalesInvoice()
    {
        ID = Guid.NewGuid();
    }
    public Guid ID { get; set; }
    public string CustomerName { get; set; }
    public string Address { get; set; }
    public List<LineItem> Items { get; set; }
    public DateTime Date { get; set; }
    public int Serial { get; set; }
    public byte Status { get; set; }
}

There is nothing special about the above other than the lack of anything extra you need to do like adding attributes etc. (even Serializable) as they are not needed.

2. Creating Views

Next you create your primary view for your entities as follows :

public class SalesInvoiceView : View<SalesInvoice> // create a view for the SalesInvoice type
 {
     public class RowSchema  // define the schema for this view
     {
         public NormalString CustomerName; // CustomerName is a normal string index
         public DateTime InvoiceDate;
         public string Address;
         public int Serial;
         public byte Status;
     }

     public SalesInvoiceView()
     {
         this.Name = "SalesInvoice";
         this.Description = "A primary view for SalesInvoices";
         this.isPrimaryList = true;
         this.isActive = true;
         this.BackgroundIndexing = true;

         this.Schema = typeof(SalesInvoiceView.RowSchema);

         this.AddFireOnTypes(typeof(SalesInvoice));

         this.Mapper = (api, docid, doc) =>
         {
             api.Emit(docid, doc.CustomerName, doc.Date, doc.Address, doc.Serial, doc.Status);
         };
     }
 }

This is pretty straight forward also.

This view is for a SalesInvoice object type.
RowSchema is a class which defines the columns for this view

You can name this class anything as long as you register it with the Schema property
All value types are supported (int, string, decimal etc.)
NormalString is a special type which instructs the indexer to index this column as a string so you will have to specify all the string when querying.
If you specify a string property then RaptorDB will do full text indexing on that column, so you can search for words within that column when querying.

BackgroundIndexing controls how the indexer does it's work on this view (i.e. block saves until each document is indexed when false).
AddFireOnTypes controls when this view is called based on the input document type
Mapper is the map function which will populate this view (i.e. extract information from the input document), you can add logic here if you need it. The order of the items must be the same as the schema you defined.
With api you can Fetch a document, log debug information and Query another view.

3. Registering Views

Registering a view is as simple as :

if (rap.RegisterView(new SalesInvoiceView()).OK == false)
{
    Console.WriteLine("Error registering view");
    return;
}

RaptorDB will do some checks on your view and if everything is fine it will return true, which means you are good to go.

4. Saving and querying data

Now you can use RaptorDB and save documents as follows :

var inv = new SalesInvoice()
{
    Date = FastDateTime.Now,
    Serial = i % 10000,
    CustomerName = "me " + i % 10,
    Status = (byte)(i % 4),
    Address = "df asd sdf asdf asdf"
};
inv.Items = new List<LineItem>();
for (int k = 0; k < 5; k++)
    inv.Items.Add(new LineItem()
         { Product = "prod " + k, Discount = 0, Price = 10 + k, QTY = 1 + k });

rap.Save(inv.ID, inv); // save to RaptorDB

Querying is as simple as writing LINQ predicates like the following:

var q = rap.Query(typeof(SalesInvoice), // call by the view type or the primary document type
                (SalesInvoice s) => (s.Serial < j) && (s.Status == 1 || s.Status == 3));

q = rap.Query("SalesItemRows",  // call by the view name
        (LineItem l) => (l.Product == "prod 1" || l.Product == "prod 3"));

As you can see you can call the query in 2 ways by specifying the type of the view (or type of the document type for primary views) or by calling the string name of the view.

Screen Shots

From the below image you can see the test application doing its work. RaptorDB was configured to do background indexing, so 100,000 documents were inserted in 12 secs and the primary view was populated (the query results for 500 items) and the background indexer is working on populating the other view defined, which after a couple of queries shows the final results of 50,000 items.

How It Works

The diagram above shows a high level view of how RaptorDB works, a document is first inserted into the storage file then immediately a primary map function is called to generate a primary view for the document at which point control returns to the calling method. After a timer has elapsed other map functions are called to generate other views for that document.
There some term which you must be aware of when using RaptorDB document data store, these are the following:

Document : is an object/entity which gets serialized as JSON
DocID : is a Guid which uniquely identifies a document and must be given with the document
View : is a like a standard database table
Map Function : is a function that takes a document and emits values from that document into a view
Index : is used to retrieve information when querying views.

What's a View?

A view is a list of values from a document which is bound to that document via a GUID property from that document. You can imagine a view as a 2 dimensional image of a multi dimensional document object usually created for a specific purpose.

For example if you have a invoice document one view would be a list of those invoices for the purpose of browsing them like:

{ invoice GUID, date, invoice number, status, salesman name }

Another view would be like below for the accounting department :

{ invoice GUID, date, total sales amount, total sales discounts, salesman name, customer name }

What's a map function?

A map function is a piece of code you write to take a document object and "emit" a list of values usually from that document (your are free to emit anything, but most times it will be somehow related to that document) to a "view".

What's a "Primary View" for?

A primary view is a view which will get it's map function called immediately after a document save and there is no delay. This is so you can get a list of those documents immediately and don't have to wait. To show the importance of this take the following example :

You have a sales application and sales persons add their invoices, you would want to see these invoices after a save for the simple reason that you want to start the sales workflow process on that item, so you have to see it and you don't want to wait for a map function to fire "eventually".

Obviously the map function should be minimal and only show what is needed, you can go overboard and emit a lot of data for this list but you would loose performance.

The Main Cast of Characters in RaptorDB

RaptorDB has the following parts:

RaptorDB main interface : what your code sees and the main loader of other parts.
Storage File : a fast and multi-threaded file storage mechanism used for both documents storage and view data.
View : see below
ViewManager : see below
ViewHandler : see below

ViewHandler

The View is the responsible for

Storing view data
Indexing columns based on query use

ViewManager

The View Manager is responsible for the following :

Loading and creating Views
Compiling view map functions.
Keeping track of mappings between the object types and views, or which view to send the object for inserting/updating.
Keeping track of last documents sent to views and the storage record numbers.
Periodic updating the views with new documents coming in

Query Executor

The query executor will do the following :

Parse the LINQ expression.
Check the column names exist in the view schema, validation error will be caught here.
Check to see if indexes exist for the columns used in the filter
Extract the bitmap indexes needed.
Extract the Deleted records bitmap.
Execute the filter and fetch the view rows from the view storage.

RaptorDB Setup

For RaptorDB to function you must follow these steps:

Define a primary view for objects which will allow immediate access to inserted objects via a list, inherited objects are supported for simplicity so you can define one for the base class.

The Save Process

The save process follows the following steps:

GetPrimaryListForType() : will try to get the primary map function for the object type and recurs the hierarchy until found.
SaveData() : saves the data to storage file, creates log entry in the log file and in memory log, the background indexer will process the log.
SavePrimaryView() : will call the primary map function for this type.
SaveInOtherViews() : will save to any other view that is defined.

The process will fail to save if any one of the steps fail.

Views and Map functions

A View is a list of data items from an object structure similar to a collection of rows in the table model. Map functions are responsible for taking an object and generating rows for views.
The premise behind map functions is that you precompute "queries" before-hand and just fetch the results when needed.
The map engine will do the following:

Find the last DocID processed and increment.
FindMapFunctionsForType() : a list of map functions to execute for this object

ExecuteMapFunction() : on the object and retrieve the new view rows with a reference to the DocID.
DeleteFromView(DocID) : flag delete the old values for DocID.
InsertView(newData) : add the new data from the map function to the view.

Find the next DocID and recurs the process.

The power of MGIndexs

Bitmap indexes are a form of index which stores the existence of a value in rows, in a series of bits. For example if you have a million records and you are searching for 'bob' in the Name column, then the b+tree for column Name might look like the picture below, and when you find the 'bob' in the leaf nodes you will get a BitArray representing the existence of 'bob' in that column indexed by record number. So if bit 3 of that BitArray is 1 then 'bob' is in the 3rd row etc. It is easy to see that this is extremely compact and efficient storage and retrieval mechanism.

* The above picture shows a b+tree/bitmap index structure not the MGIndex which uses a dictionary, but the principle is the same.

The real power is shown in the example below where you want to query for the following "Name='bob' and Code=between(1,3)", the query processor will take the filter, parse the values for that filter and generate an execution plan like below:

As you can see from the above diagram, the execution plan is a series of BitArrays and all you have to do to get the result is follow the bit arithmetic logic based on the parsed filter and {AND ,OR, NOT} the BitArrays together. These operations are typically in the sub millisecond range even for millions or rows in your database. The result is an position indexed row finder to the row contents which you read off disk, and send to the caller where 1 indicates read the row and 0 means skip the row contents (i.e. 100001... -> read record numbers :1,6,...)

Full Text Search

For string columns RaptorDB supports full text search for finding words in rows. To do this

RaptorDB

uses the technology built into hOOt the full text search engine.

On the cutting room floor

The following features were cut and might be incorporated in the next version:

Paging of results.
LINQ based aggregation (sum, count, ...).
Compression support in storage.
Usage statistics and monitoring information.
Query caching.
View schema changes.
Revision checking for documents

Closing Remarks

This is a work in progress, I will be happy if anyone wants to join in.

History

Initial release v1.0 : 29^th April 2012
Update v1.1 : 4^th May 2012

fulltext indexing via attribute
string query parser
fix shutdown flusing indexes to disk
rudementary console application
lowercase viewnames for string queries
fulltext search defaults to AND if + - characters not present in query
query now works when suppling the view type
save pauses indexer for better insert performance ~30% faster

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Mehdi Gholam

Architect -

United Kingdom

Mehdi first started programming when he was 8 on BBC+128k machine in 6512 processor language, after various hardware and software changes he eventually came across .net and c# which he has been using since v1.0.
He is formally educated as a system analyst Industrial engineer, but his programming passion continues.

* Mehdi is the 5th person to get 6 out of 7 Platinum's on Code-Project (13th Jan'12)
* Mehdi is the 3rd person to get 7 out of 7 Platinum's on Code-Project (26th Aug'16)

RaptorDB - the Document Store

Preface

Introduction

Why?

Why another database?

Possible Uses

How we use data

What is a Document Store database?

Foul ! the relational purists cry...

Is it for everybody?

Features

Limitations

The Competition

Performance Tests

Javascript vs .NET compiled map functions

Insert object performance

Query Performance

If you want to torture someone, make them write a LINQ Provider!

The quirks of LINQ parsing

How the LINQ parser works in RaptorDB

A sample application

1. Creating Entities

2. Creating Views

3. Registering Views

4. Saving and querying data

Screen Shots

How It Works

What's a View?

What's a map function?

What's a "Primary View" for?

The Main Cast of Characters in RaptorDB

ViewHandler

ViewManager

Query Executor

RaptorDB Setup

The Save Process

Views and Map functions

The power of MGIndexs

Full Text Search

On the cutting room floor

Closing Remarks

History

License

Comments and Discussions