Click here to Skip to main content
Click here to Skip to main content

Save Key/Value Pairs in a File

, 9 Oct 2009 GPL3
Rate this:
Please Sign up or sign in to vote.
A HashFile class to save key/value pairs in a file

Introduction

Every C# programmer might have used class “hashtable”. A hashtable is a mechanism to store and retrieve key/value pairs. Hashtable is efficient in many ways and data is stored in the RAM that makes it extremely fast. What if the number of key/value pairs is very large? Alternatives are windows registry or a relational database table. Obviously one may not wish to store a large amount of application data in the Windows registry and databases are resource hungry and expensive.

HashFile class is an attempt to find an answer to the problem.

HashFile Class

HashFile is a technique to insert/delete/find key value pairs in a file in the hard disk. It uses fixed length records to store data and index. There are two files “name.dat” and “name.idx” corresponding to a "name". "name" is the identification for key/value pairs, and is given at the time of initialization.

"name.dat

This is the data file. The Data file stores fixed length records of key/value pairs. Name of key/value pair, length of key and value should be specified at the time of initialization. Maximum size of the data file could stretch up to 2^32 bytes, which is the maximum value of unsigned integer (uint32).

name.idx

This is the index file. It stores fixed length records of record pointer and index pointer. Record pointer and index pointer are unsigned integers.

Indexing Technique

It uses Index Block (IB) of 91 x 2 Matrix to store record and index pointers. When Hashfile is initialized for the first time, an empty IB will be created and every element will be filled with max value of data type uint. Why 91 rows? 91 is the number of characters in the ASCII range 32-122. Each row in the IB represents an ASCII character. Columns of the IB are record pointer and index pointer. Data types of columns are unsigned integers. Record pointer represents the starting byte index number of a data record with respect to the origin, similarly index pointer represents starting byte index number of an Index record with respect to the origin.

Pseudo code for InsertKey method is given below:

  1. Take a byte from the new key
  2. If byte is within range 32-122, then jump to the matching index record within an IB that corresponds to distance from the left most byte, otherwise return error
  3. Read record pointer and index pointer
  4. If index pointer is equal to maximum value of data type uint, then, save record pointer and index pointer, save key and value in the data record and return record pointer
  5. Otherwise read data record for the record pointer
  6. Compare new key and existing key
  7. If new key matches existing key in the data record, then go to step 10
  8. If new key is greater than existing key, then go to step 11
  9. Otherwise step 13
  10. If the status of the data record is deleted then, save new value, change deleted flag to false, return record pointer, otherwise return error
  11. If index pointer points to next IB, take next byte and jump to step 2
  12. Otherwise create new IB, save record pointer and index pointer in the new IB, assign start index pointer of new IB to the old IB, save key and value in the data record, return record pointer
  13. If index pointer points to next IB, assign new record pointer to the index record, swap new and existing keys jump to step 2
  14. Otherwise assign new record pointer to the index record, create new IB, assign existing record pointer, new index pointer to new IB, assign start index pointer of new IB to the old IB, save key and value in the data record, return record pointer

Sounds bit fuzzy, isn't it? But the technique seems to work in the test environment. As new keys are added, indexing mechanism will attempt to place record pointer at a minimum distance as possible from the root IB. Binary tree search algorithm eliminates items to be searched by half after each branch, whereas the above described technique eliminates probably about 90/91th of the outstanding records after every IB assuming that population is well distributed. Test results show that average IB reads to find a record is about 6.2, in a population of 528,461 keys, which sounds good. See the test results section for details.

Using the HashFile Class

// Instantiate class
HashFile.HashFile hf = new HashFile.HashFile();
 
// Initialize Key & Value attributes
hf.Initialize("Test1",50,100);
 
// “test1” is the indentification
// 50 is maximum length of key
// 100 is maximum length of Value

// Call  InsertKey
//
String key = “Microsoft”;
String Value = “Software company”;
int success = hf.InsertKey(key,Value,false)
 
// Call FindKey
String key = “Microsoft”;
uint success = hf.FindKey(key)

// Call DeleteKey
String key = “Microsoft”;
uint success = hf.DeleteKey(key)

// Call GetFirstKey
 int Success = hf.GetFirstKey();

// Call GetNextKey
 int Success = hf.GetNextKey();

// Call GetPrevKey
 int Success = hf.GetPrevKey();

Test code is hashfileTest.cs.

A Method to Test HashFile.InsertKey

public static void InsertKeys()
{
    HashFile.HashFile HashFileTest1 = new HashFile.HashFile();
    HashFileTest1.Initialize("Test1",50,50);
 
    //StreamReader StreamReaderTestKeys = File.OpenText
    //(@"D:\Documents and Settings\All Users\Documents\vs\
    //EmailIDTEst\EmailIDTEst\bin\Debug\Keys.txt");
    StreamReader StreamReaderTestKeys = File.OpenText
	(@"D:\Documents and Settings\All Users\Documents\vs\EmailIDTEst\
	EmailIDTEst\bin\Debug\Keys1.txt");
 
    Stopwatch StopWatchtest1 = new Stopwatch();
 
    int CountofTestKeys = 0;
 
    Console.WriteLine("Please Wait...");
 
    string TestKeyRecord = StreamReaderTestKeys.ReadLine();
    while (TestKeyRecord != null)
    {
        string[] TestKeyFields = TestKeyRecord.Split(' ',',');
        for (int i = 0; i < TestKeyFields.Length; i++)
        {
            if (TestKeyFields[i].Contains("@"))
            {
                StopWatchtest1.Start();
                int success = HashFileTest1.InsertKey
                       	(TestKeyFields[i].Trim(),CountofTestKeys.ToString(),
                        		false); // Call to HashFile.InsertKey
                StopWatchtest1.Stop();
 
                if (success < 1) // Failure to insert key
                    Console.WriteLine(TestKeyFields[i].Trim() + "," + success );
 
                CountofTestKeys++;
            }
        }
        TestKeyRecord = StreamReaderTestKeys.ReadLine();
    }
 
    StreamReaderTestKeys.Close();
    Console.WriteLine("Records=" + CountofTestKeys + ",Ticks=" +
            	StopWatchtest1.ElapsedTicks + ",Frequency=" + Stopwatch.Frequency +
            	",Average Write Time(ms)=" + (double)StopWatchtest1.ElapsedTicks /
            	(double)Stopwatch.Frequency * 1000 / (double)CountofTestKeys + ",
            	Avg Reads=" + (double)HashFileTest1.count / (double)CountofTestKeys);
}

Test Results

InsertKey

Keys=528461,Ticks=1022397419386, Frequency=2992560000, Average Write Time(milliseconds)=0.646493162076644, Avg Reads=5.77592859264922

Keys=8125, Ticks=10044699390, Frequency=2992560000, Average Write Time(milliseconds)=0.413114755979444, Avg Reads=5.50141538461538

FindKey

Keys=528461,Ticks=201771307445, Frequency=2992560000, Average Read Time(milliseconds)=0.127586169617676, Avg Reads=6.14659738372368

Keys=8125, Ticks=2945715784, Frequency=2992560000, Average Read Time(milliseconds)=0.121150331139174, Avg Reads=5.48246153846154

Remarks

The version of the class released is just a prototype, which obviously means that this software should not be used in the production environment. Any losses arising out of error or bugs in this version of the software are the sole responsibility of the end user.

Any suggestions to improve the HashFile class is most welcome.

History

  • 4th October, 2009: Initial version

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

Share

About the Author

Bizken
Software Developer Bizken Pty Ltd
Australia Australia
Rajesh Kayakkal is a software developer in Brisbane Australia

Comments and Discussions

 
GeneralHi PinmemberAnil Srivastava24-Jan-10 21:03 
GeneralRe: Looks interesting PinmemberBizken16-Oct-09 2:30 
GeneralLooks Interesting... [modified] PinmemberReelix12-Oct-09 20:25 
From someone who has never dealt with HashTables before, I must say that this looks like a rather interesting method for storing pair values!
 
One thing I must say though, is that your "Test Results" were a bit vague.
 
It could be simplified to be like:
 
InsertKey
 
Keys Added: x
Time Taken To Insert All Keys: y ms
 

FindKey
 
Total Keys: x
Time Taken To Find Last Key: y ms
 

Possibly compare it to SQL SELECT statement speed on a large (10,000 / 100,000 / 1,000,000) amount of records.
 
Anyways, great article!
 
Time to try out some benchmarking of my own Smile | :)
 
-- Edit --
 
With the attached example, you should use relative paths, as not everyone has:
 
D:\Documents and Settings\All Users\Documents\vs\EmailIDTEst\EmailIDTEst\bin\Debug\keys1.txt
 
Poke tongue | ;-P
 
-- Edit 2 --
 
A replace for HashFileTest.cs
 
using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using HashFile;
using System.Diagnostics;
 
namespace HashFileTest1
{
    class Program
    {
        public static HashFile.HashFile HashFileTest1 = new HashFile.HashFile();
        static void Main(string[] args)
        {
            InsertKeys();
            FindKeys();
 
            Console.ReadLine();
        }
 
        public static void InsertKeys()
        {
            HashFileTest1.Initialize("Test1", 1, 1); // Not sure what this does, but changing it to 1,1 instead of 50,50 makes adding ALOT faster!
            Stopwatch StopWatchtest1 = new Stopwatch();
            Console.WriteLine("Adding 50000 keys - Please wait...");
            for (int i = 0; i < 50000; i++)
            {
                {
                    StopWatchtest1.Start();
                    HashFileTest1.InsertKey(i.ToString(), (i+5).ToString(), false); // Call to HashFile.InsertKey
                    StopWatchtest1.Stop();
                }
            }
            Console.WriteLine("50000 keys added in " + StopWatchtest1.ElapsedMilliseconds + "ms.");
        }
 
        public static void FindKeys()
        {
            Console.WriteLine("Please Wait...");
            Stopwatch StopWatchtest1 = new Stopwatch();
            string TestKeyRecord = "49995";
            //while (TestKeyRecord != null)
            {
                //for (int i = 0; i < HashFileTest1.Count; i++)
                {
                    //Console.WriteLine("Searching key " + i.ToString() + " of " + HashFileTest1.Count);
                    //Console.ReadLine();
                    StopWatchtest1.Start();
                    uint success = HashFileTest1.FindKey(TestKeyRecord); // Call to HashFile.FindKey
                    StopWatchtest1.Stop();
 
                    if (success == uint.MaxValue) // Failure to find a key
                    {
                        Console.WriteLine("Failed.");
                    }
                    else
                    {
                        Console.WriteLine("Found: " + HashFileTest1.Key + " = " + HashFileTest1.Value + " in " + StopWatchtest1.ElapsedMilliseconds + "ms.");
                      Console.ReadLine();
                    }
                }
            }
 
            Console.ReadLine();
        }
    }
}
 
Output:
 
Adding 50000 keys - Please wait...
50000 keys added in 49ms.
Please Wait...
Found: 49995 = 50000 in 1ms.
 
-- Edit 3 --
 
Using high amounts with my above example produces some weird results:
 
Adding 70000 keys - Please wait...
70000 keys added in 67ms.
Please Wait...
Found: 69995 = 700♣6 in 1ms.
 
Key 69995 should be 70000, not 700♣6 (Not sure where the Spade came from :p)
 
-- Edit 4 --
 
uint can only support up to 65535 :p
Takes too long to mass convert everything to a higher datatype, but, for the record, you can add Approx. 5 million records in 5 seconds Smile | :)
 
-= Reelix =-
modified on Tuesday, October 13, 2009 3:49 AM

GeneralRe: Looks Interesting... Pinmemberdave.dolan18-Sep-11 16:07 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.141030.1 | Last Updated 9 Oct 2009
Article Copyright 2009 by Bizken
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid