Persistant Hash Table: Part 2

Dharmateja Challa

0/5 (0 vote)

Oct 28, 2011

CPOL

3 min read

17294

644

This is a continuation article for a key value store.

Introduction

As a follow up article for the first one, I have improved upon the performance but still I have kept the record size and bucket sizes constants, I have implemented through “generalization”,” function objects” and “ shared memory”.

For Whom This Will be Useful…

For those who are new to shared memory and how it increases performance, this will illustrate how performance can be increased by keeping file I/O to a minimum. This will also be useful to those who are new to “function objects” and templates.

How It Is Done this Time…

I have used a simple logic to construct the persistent hash table. If we define number of buckets, their size and record size, it is easy to store and retrieve. First a file with enough storage to store all buckets with its records is created. That file is memory mapped. When we want to write a record or read a record key is passed to the hash function, this hash function is a user supplied function object with the hash value the bucket offset is derived. Then we need to jump to that base pointer of memory mapped file + offset and write the record or read the record. Before doing that, a check is made if a record is already present in that location. If present, then we jump to the next record and then that record is written. After reading and writing records, memory is flushed to the disk file. At a later point, if we need to read, a record db can be opened and then with the key supplied, that record can be retrieved back. Here in this implementation, dependency on “fseek()” is minimized to initial file creation. This is one step towards large file manipulation. This implementation can be improved to map one portion of a file instead of the entire file.

Implementation Details…

“CSharedMemory” class is a thin wrapper around shared memory APIs. That is a simple use of shared memory hence not going into the details. CHash is the generalized class for constructing different type of objects to create different tables.

namespace persistant_hash
{
	template<typename key_type,typename value_type,typename hash,typename size>
	class CHash
	{
	private:
		long m_nbuckets;
		long m_bucketsize;
		long m_recsize;
		CSharedMemory *m_psm;
		CRITICAL_SECTION cs;
		PBYTE m_pb;
		//unsigned long hash(long key);
		hash m_hashobj;
		size m_sizeobj;
		long writeoffset(long offset,void *p,long size);
		long readoffset(long offset,void *p,long size);
		long GetbucketOffset(long buckno);
		public:
                //pass how many number of buckets to construct.
                CHash(long nBuck,long nRecinbucket,
                hash ho,size so):m_nbuckets(nBuck),m_hashobj(ho),m_sizeobj(so) 
		{
			m_recsize = so();
			m_bucketsize = nRecinbucket * m_recsize;
			InitializeCriticalSection(&cs);
		}
		long createdb(wchar_t *p);
		long opendb(wchar_t *p);
		long closedb();
		long writerec(key_type k,value_type* pd);
		long readrec(key_type k,value_type* pd,long &offset);
		long readnextrec(long offset,value_type* pd);
			
	};
}

key_type: is the type of key with which table needs to be created
value_type: is the type of the value which is to be paired up in the key-value store.
hash is the function object type which user needs to supply on which entire table is dependent
size is another function object type which needs to be supplied by the user to maintain record size with which file size is calculated later.

I am explaining a few functions here. createdb creates a file with fixed size and it opens the shared memory and maps the entire file and the base pointer is stored in m_pb.

template<typename key_type,typename value_type,typename hash,typename size>
long CHash<key_type,value_type,hash,size>:: createdb(wchar_t *p)
{
	wstring name = p;
	name = name + L".hdb";
	long filesize = m_bucketsize * m_nbuckets;
	//create that much size file
	FILE *fp = NULL;
	_wfopen_s(&fp,name.c_str(),L"w");
	fseek(fp,filesize,0);
	fprintf(fp,"%d",7);
	fclose(fp);
	m_psm = new CSharedMemory();
	m_psm->createsm(name,&m_pb);
	return 0;
};

Writeoffset writes into the offset location from the basepointer:

template<typename key_type,typename value_type,typename hash,typename size>
long CHash<key_type,value_type,hash,size>::writeoffset(long offset,void *p,long size)
{
	EnterCriticalSection(&cs);
	memcpy((m_pb + offset),p,size);
	LeaveCriticalSection(&cs);
	return 0;
}

Writerec writes record at the specified offset, if already some record is present then it writes in the next record, here it calls the function object to get the hash first before it sends the same to get the bucket offset.

template<typename key_type,typename value_type,typename hash,
typename size>
long CHash<key_type,value_type,hash,size>::
writerec(key_type k,value_type *pd)
{
	long buckoffset = GetbucketOffset(m_hashobj(k));
    void *temp = malloc(m_recsize);
	while(0 != readoffset(buckoffset,temp,m_recsize))
	{
		buckoffset += m_recsize;//go to the next record
	}
	free(temp);
	writeoffset(buckoffset,pd,m_recsize);
	return 0;
}

How to Use this Class…

Need to include both the files, and use the namespace persistant_hash:

#include"CHash.h"
#include"CHash.cpp"
using namespace persistant_hash;

Need to write two function objects, this is where the customization comes, and to include your high performance hashing function. :)

class myhash
{
public:
	long operator()(long key)
	{
		return (key % NUMBUCK);
	}
};

And your sizing function goes here:

class size
{
	public:
	long operator() ()
	{
		return sizeof(fixedrec);
	}
};

How to Instantiate…

myhash hashobj;
size sizeobj;
//total of 200 * 50 records can be stored
CHash<long,fixedrec,myhash,size> hash2(200,50,hashobj,sizeobj); 
hash2.createdb(L"second");

Bye for Now…

I hope this will be of some help to some beginners. I will add on to this series until it becomes production grade…