Click here to Skip to main content
Click here to Skip to main content
 

RaptorDB

, 10 Jul 2011 CPOL
Rate this:
Please Sign up or sign in to vote.
Smallest, fastest embedded nosql persisted dictionary using b+tree or MurMur hash indexing.
This is an old version of the currently published article.
output.png

Introduction

RaptorDB is a extremely small size and fast embedded, noSql, persisted dictionary database using b+tree or MurMur hash indexing. It was primarily designed to store JSON data (see my fastJSON implementation), but can store any type of data that you give it. RaptorDB was inspired by the work done by Aaron Watters found at (http://sourceforge.net/projects/bplusdotnet/) although originally I used that code but I found it was too slow, memory consumption was excessive and the code was very hard to follow, so I rewrote the b+tree algorithm from scratch from what I could gather from the internet. Bear in mind that I'm not from a computer science background, so I had to go through a crash course in indexing techniques.

Interestingly, there are very few good resources on basic algorithms on the internet and most are from an academic perspective which does little for you when you want to make a practical application, but that's a different matter and it seems that they are "dark arts" reserved for RDMB vendors.

The name RaptorDB comes from a seeming tradition of naming databases after animals (HamsterDB, RavenDB, ...) but with a more sexy sounding name reflecting the performance attributes.

What Is It?

Here is a brief overview of all the terms used to describe RaptorDB:

  • Embedded: You can use RaptorDB inside your application as you would any other DLL, and you don't need to install services or run external programs.
  • NoSQL: A grass roots movement to replace relational databases with more relevant and specialized storage systems to the application in question. These systems are usually designed for performance.
  • Persisted: Any changes made are stored on hard disk, so you never lose data on power outages or crashes.
  • Dictionary: A key/value storage system much like the implementation in .NET.
  • B+Tree: The cornerstone to all database indexing techniques, see the link for an in depth description on this topic (http://en.wikipedia.org/wiki/B%2B_tree) for my implementation, see the how it works sections. It has a BigO order of O(logN).
  • MurMurHash: A non cryptographic hash function created by Austin Appleby in 2008 (http://en.wikipedia.org/wiki/MurmurHash). It has a BigO order of O(1).

Some Possible Uses...

  • High performance persisted message queue: A message queue with no dependencies and minimal overhead.
  • Document database: Store and retrieve document files.
  • Web site asset storage: Store graphic images and web content in one file.
  • High volume web site data storage: Store user information instantly without storage limitations and added RDMB overhead / installation / running costs.
  • JSON object storage: Basis for a mongodb or couchdb style database.
  • Web site session state storage: Store your web session info in a fast storage systems instead of SQL servers.
  • Single file file system storage: If you have a lot of small files, then RaptorDB could optimize the storage for those files with less file system storage waste and faster access.

Why Another Database?

The main reason for RaptorDB was the fact that there was no pure .NET implementation of a persisted dictionary which met the performance requirements, and regular databases like Microsoft SQL Server, MySql, etc. are dismally slow, and bloated for the job of storing JSON data. I even tried DBF files but most implementation in .NET lacked full memo support. Others like sqlite while having fast insert times lack multi thread support and lock databases on inserts.

RaptorDB was envisioned as a replacement storage system to RDBM storage for my own framework which has had a document centric design since 2003 and was built on RDBM storage at the time. The relational storage systems were extremely slow at storing BLOB data.

Features

RaptorDB has been built with the following features in mind:

  • Blazing fast insert speed ~ 80% hard disk speed (see performance test section)
  • Blazing fast retrieval speed
  • Store huge amounts of data
  • Minimum index files size
  • Background indexing of data
  • Embeddable in larger application
  • No installation required
  • Work on .NET 2.0 and up
  • Multi threaded insert support
  • Small as possible code base ~ 40KB DLL file
  • No dependencies
  • Crash resistant, fail safe and recoverable
  • No shutdown required
  • Add only, so data integrity is maintained and historical/duplicate values are supported
  • Immediate flush of data to disk to ensure data integrity (it is possible to get more throughput if you are willing to sacrifice integrity with buffered output)
  • Count() function implemented for index files.
  • InMemoryIndex property and SaveIndex() function implementation for in memory indexes and ability to save indexes to disk on demand.  
  • EnumerateStorageFile()  using yield to traverse the storage file contents into a KeyValuePair<byte[], byte[]> (key and contents). 
  • Support for duplicate keys via GetDuplicates() and FetchDuplicate().
  • FreeCacheOnCommit property to tweak memory usage on internal commit of the index contents. 
  • Read operations use separate Stream and is now multi-threaded.
  • As of v1.5 index file formats are not backward compatible.
  • Internal storage is based on record numbers of int instead of long file offsets, so you get ~50% memory and index file size reduction.

Limitations   

The following limitations are in this release:  

  • Key size is limited to 255 bytes or equivalent in string length (ASCII vs. Unicode sizes)
  • Session Commit / Rollback: This has been deferred to later because of multi thread issues.
  • Compression: Support is built into the record headers but has not been implemented.
  • Removing items is not supported and the database is append only (much like couchdb).
  • Log file count is hard coded to 1,000,000: which equates to 20 billion inserts in a continuous run (restarting will reset count to 0 for another 20 billion).
  • External recovery program

The Competition

There are a number competing storage systems, a few of which I have looked at are below:

  • HamsterDB: A delightful engine written in C++, which impressed me a lot for its speed while I was using Aarons Watters code for indexing. (RaptorDB eats it alive now... ahem!) It's quite large at 600KB for the 64bit edition.
  • Esent PersistentDictionary: A project on CodePlex which is part of a another project which implements a managed wrapper over the built in Windows esent data storage engine. The dictionary performance goes down exponentially after 40,000 items indexed and the index file just grows on guid keys. Apparently after talks with the project owners, it's a known issue at the moment.
  • Tokyo/Kyoto Cabinet: A C++ implementation of key store which is very fast. Tokyo cabinet is a b+tree indexer while Kyoto cabinet is a MurMur2 hash indexer.
  • 4aTech Dictionary: This is another article on CodeProject which does the same thing, the commercial version at the web site is huge (450KB) and fails dismally performance wise on guid keys after 50,000 items indexed.
  • BerkeleyDB: The grand daddy of all database which is owned by Oracle and comes in 3 flavours, C++ key store, Java key store and XML database.

Performance Tests

myharddisk.png

Tests were performed on my notebook computer with the following specifications: AMD K625 1.5Ghz, 4Gb Ram DDRII, Windows 7 Home Premium 64bit, Win Index 3.9 (the above picture is a screenshot of the data transfer timings for my hard disk type which is a WD 5400 rpm drive):

Tests Insert Time Index Time B+tree Index Time MurMur Hash
First 1 million inserts 21 sec 61 sec 81 sec
Next 1 million inserts 21 sec 159 sec 142 sec

As a general rule of thumb (on my test machine at least), b+tree indexing time increases linearly by 1 second per 600,000 keys indexed. MurMur hashing is consistent for every 1 million items and maxes out at 4 sec mostly because of writing all the buckets to disk which is around 60MB for the default values.

Performance Test v1.1

With the use of in memory indexes you can achieve the following performance statistics at the expense of memory usage for caching the indexes in RAM:
Tests Insert Time Index Time B+tree Index Time MurMur Hash
First 1 million inserts 21 sec

8.1 sec,

SaveIndex() = 0.9 sec

4.3 sec,

SaveIndex() = 1.1 sec

Next 1 million inserts 21 sec

8.2 sec,

SaveIndex() = 1.5 sec

6.8 sec,

SaveIndex() = 1.4 sec


As you can see the in-memory index is very fast in comparison to the disk based index, as everything is done in memory and there is no flushing to disk for every MaxItemsBeforeIndexing items indexed. You still have the ability to save the index to disk and this is very fast also as the whole index is written at once. The down side to all this is the memory overhead for keeping all the index in RAM.

Using the Code

Using the code is straight forward like the following example:

RaptorDB.RaptorDB rap = RaptorDB.RaptorDB.Open("docs\\data.ext", 16,true, INDEXTYPE.HASH);
Guid g = Guid.NewGuid();
rap.Set(g, new byte[]{1,2,3,4,5,6,7,8,9,0});
byte[] bytes=null; 
if(rap.Get(g, out bytes)) 
{ 
    //data found and in bytes array
} 

The Open method takes the path for the data and index files (it will use the filename minus the extension for the data, index and log file names and create the directory if needed), the maximum key size (16 in the example for GUID keys), a Boolean to indicate allowing of duplicate key values, and the indexing method which is BTREE of HASH.

Using the Code v1.1

InMemoryIndex and SaveIndex

RaptorDB.RaptorDB rap = RaptorDB.RaptorDB.Open("docs\\data.ext", 16,true, INDEXTYPE.BTREE);
rap.InMemoryIndex = true;

// do some work

rap.SaveIndex(); // will save the index in memory to disk

How It Works

block_diagram.png

RaptorDB has two main parts; the in memory burst insert handler and the background indexer. When you start inserting values into RaptorDB, the system will instantly write the data to the storage file and to a numbered log file. This ensures that the data is consistent and safe from crashes, the log is kept in memory which is essentially a key and file pointer to the storage file for fast in memory access to data even when the indexer has not started. After MaxItemsBeforeIndexing count of items in the log file, a new log file is created and the old one is given to the background indexer queue.

The background indexer will start indexing the log file contents using the method you specify and when done will free the log file contents from memory. This design ensures maximum performance and a way round concurrency issues regarding index updates which is a bane for b+tree indexers (there is only one thread updating the index).

Searches for keys will first hit the current log file in memory, second the queued log files in memory and lastly the disk based index file.

Tweaking

There are a few parameters which can be tweaked to your individual requirements, these include the following:

  • DEFAULTNODESIZE: This is the numbers of items in each index page, the default is 200.
  • BUCKETCOUNT: This is a prime number which controls the number of buckets for MurMur hashing, the default is 10007.
  • MaxItemsBeforeIndexing: The number of items in each log file and the number of items in memory before swapping to new log file and indexing starts. The default is 20,000.

As a general rule of thumb, if you have less than 1 million items (with the above default values), then go for the b+tree index as it is faster. If you have more than 2 million items, then MurMur hashing is faster and indexing time is constant within every 1 million records indexed.

B+tree

For b+tree indexes, the only parameter to tweak is the DEFAULTNODESIZE, the more the value the less the overall height of the tree from the root to the leaf values which requires fewer seeks to the hard disk and page fetches, this increases the page size on disk and memory usage.

MurMur Hash

For MurMur indexes, you have two parameters to tweak DEFAULTNODESIZE and BUCKETCOUNT. The multiplication of these values gives you the maximum data count that is supported with direct indexing (additional data will overflow into new buckets and would require another disk seek). Again, DEFAULTNODESIZE will determine the page disk size.

Index File Sizes

Index files are proportional to the number of pages in the index. You can compute the page size from the following formula:

Page size = 23 + ( (max key size + 17) * items in page) ~ 23+(33*200) =6,
623 bytes [ for16 byte keys ] 
File header = 19 + ( 8 * bucketcount) ~ 19+(8*10007) = 80,075 bytes

For b+tree indexes, leaf nodes/pages are statistically 70-80% full, and inner leaf nodes are about 30-50% full which averages out to about 75% full overall. So you would require about 25% more nodes/pages than your data count.

For MurMur hashing, the index file is allocated for the max number of data items anyway and overflows into additional buckets when you go over the limit. (e.g. 200*10007 ~ 2million data items limit which equates to 10007 * 6.6kb ~ 67Mb index file).

Memory Usage

Burst inserting 1million records takes about 80Mb of memory for internal log file cache, that's 1million * (16byte key + 8 byte offset) + .NET dictionary overhead.

MurMur hash indexing will take about MaxItemsBeforeIndexing * [page size] + .NET overhead of memory + [overflow pages] * [page size] which for default values will be around 120Mb if the pages are full.

B+tree indexing memory usage will grow as the size of data increases and is hard to quantify but as a worse case MaxItemsBeforeIndexing * [page size]* (1+0.25 +0.1) + .NET overhead.

What About Failures?

There are 4 types of failures:

  1. Non clean shutdown in the middle of indexing (power failure / crash): In the case of b+tree indexing, the system will automatically recover without a problem if the failure was before writing to disk, otherwise the external recovery program is required. For MurMur hashing system crashes have no effect on indexing and the system is automatically recoverable.
  2. Index file corruption/deletion: You lose access to your data but the data exists. Index files can be rebuilt from the data file via an external recovery program which scans the data file for rows and creates log files.
  3. Log file corruption/deletion: You lose access to recently added data for which indexing had not been done on yet. You can extract the log files again via an external recovery program.
  4. Data file corruption: You should have backed up your data! This scenario is extremely rare as data is not overwritten and only appended so at most you might lose the last record inserted but the external recovery program can scan the data file and restore what is valid.

Data is automatically flushed on inserts so the above scenarios are rare, but it does give piece of mind to know that RaptorDB is recoverable.

My Rant

Ever since I started writing articles for CodeProject, I was amazed at how much of the code that we write is bloated and disproportionately large for the task at hand, cases in point are this article, my previous fastJSON article and even the mini log4net replacement. I would push this further and take a swipe at RDBM vendors and ask the question why when I can perform at 80% of my hard disk speed on a managed run-time language, their products have such dismal performance in comparison on native code? It could be argued that they do more, and they certainly do, but the question remains that they can do better, with less code and smaller executable sizes. I understand the position of developers in that they have to get the job done as fast as possible with managers and customers breathing down their necks, and they take the easiest path to ship things out in the least amount of time, to the detriment of quality, speed and size.

I propose that we, as developers, take a little pride in what we do and create / write better and more optimized programs, and in the short and long term better everyone in the process.

vNext, Future Work and Call to Arms

I'm planning to implement the following features in the general directions of a full fledged document database in the realm of mongodb and couchdb.

  • Query-able content via Map/Reduce or similar functionality
  • Column/bitmap orientated indexes for fast queries on huge amounts of data

So if any one is interested in the above and is up to the challenge of writing non bloated and highly optimized code, drop me a line.

Closing Words

Most times, we have to see what is possible to believe, so show your love and vote 5.

History

  • Initial release v1.0: 3rd May, 2011
  • Update v1.1 : 23rd May, 2011
    • Bug fix data row header read
    • Count() function
    • InMemoryIndex and SaveIndex()
    • Restructured project files and folders
    • SafeDictionary bug fix on byte[] keys in logfile.cs -> .net Dictionary can't handle byte[] keys, problem with GetHashCode()
  • Update v1.2 : 29th May, 2011
    • EnumerateStorageFile() : keyvaluepair -> using yield
    • GetDuplicates(key)
    • FetchDuplicate(offset) 
    • FreeCacheOnCommit property
  • Update v1.3 : 1st June 2011
    • Thanks to cmschick for testing the string,string version
    • bug fix CheckHeader() in StorageFile.cs
    • bug fix Get() in LogFile.cs
  • Update v1.4 : 13th June 2011
    • added reverse to short, int, long bytes (big endian) so they sort properly in byte compare (in ascending order)
    • read operation on data file uses another stream 
    • bug fix duplicate page not set in keypointer
  • Update v1.4.1 : 17th June 2011 
    • bug fix a silly typo in helper.compare thanks to eagle eyed Nicolas Santini
  • Update v1.5 : 22nd June 2011
    • changed long to int for smaller index size and lower memory usage, you should get 50% saving in both
    • all internal operations are on record numbers instead of file offsets
    • added a new file (mgrec extension) for mapping record numbers (int) to file offset (long)  
    • ** index files are not backward compatible with pre v1.5 **
    • bug in unsafe bitconvert routines, using ms versions until workaround found, thanks to Dave Dolan for testing
  • Update v1.5.1 : 28th June 2011
    • unsafe bitconvert fixed
    • speed optimized storage write to disk, removed slow file.seek and file.length (amazingly moving from long to int took a ~50% performance hit, now back to ~5% to the original code speed)
  • Update v1.5.2 : 10th July 2011
    • bug fix int seek overflow in index file
    • changed to flush(true) for all files thanks to atlaste
    • bug fix hash duplicate code

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Mehdi Gholam
Architect
United Kingdom United Kingdom
Mehdi first started programming when he was 8 on BBC+128k machine in 6512 processor language, after various hardware and software changes he eventually came across .net and c# which he has been using since v1.0.
He is formally educated as a system analyst Industrial engineer, but his programming passion continues.
 
* Mehdi is the 5th person to get 6 out of 7 Platinums on CodeProject (13th Jan'12)

Comments and Discussions


Discussions posted for the Published version of this article. Posting a message here will take you to the publicly available article in order to continue your conversation in public.
 
GeneralAmazing work! Pinmembertutukakatu3-Jan-12 3:08 
GeneralRe: Amazing work! PinmvpMehdi Gholam19-Jan-12 3:31 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.141223.1 | Last Updated 10 Jul 2011
Article Copyright 2011 by Mehdi Gholam
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid