Introduction
After searching the internet for a .NET implementation for WAH compressed BitArray and not finding one, I decided to write and publish an article here so the .NET community are not left out as all the available implementations are in the Java language. This ties into a pretty advanced topic of bitmap indexing techniques for databases and I needed this for my RaptorDB Document data-store database engine.
What Is It?
A BitArray
is a data structure in the .NET library for storing true
/false
or bits of data in a compact form within an array of Int32
objects. A WAH or word-aligned hybrid BitArray
is a special run length compressed version of a BitArray
which saves a lot of space and memory. All the implementations that exist in Java essentially duplicate the functionality of a BitSet
, that is the AND
, OR
, NOT
, and XOR
operations with the compressed internal storage format.
In my implementation, I defer the functionality of the BitArray
to itself and just add compression and decompression routines. This is much faster than the Java way at the expense of memory usage. To overcome this, I have also added a FreeMemory
method to release the BitArray
contents and keep the compressed contents. Arguably, if you are using 100 million bits, then a full implementation is more performant than my implementation, but for most of our Use Cases, we are at most in the 10 millions of bits range.
This original method was invented at the Berkeley Labs of US Department of Energy; it is a project named FastBit and is used for high energy physics department experiments; you can see it here: FastBit site.
Why Should I Care?
So what?, you ask. Well, as mentioned before, BitArray
s are used in an indexing technique called bitmap indexes (Wiki) and column based databases which store data in columns instead of rows. An example which you might know is Microsoft's PowerPivot for Excel which can process millions of rows in seconds. Interestingly, Microsoft has only recently announced the implementation of bitmap indexes in the upcoming SQL Server platform, post 2008 R2. It has long been in use by other RDBM vendors like Oracle.
How it Works
The WAH compression algorithm is as follows:
- Take 31 bits from the array.
- If all zeros, then increment the zero count by 31.
- if ones count > 0, then output 32 bits with bit 32 =1 and bit 31=1 and 0-30 = ones count.
- If all ones, then increment the ones count by 31.
- If zeros count >0, then output 32 bits with bit 32=1 and bit 31=0 and 0-30 = zeros count.
- Else output the 31 bits as 32 bits with bit 32 = 0.
- If zeros or ones count >0, then output as above.
From the above, in the worst case, you will get N/31 more bits encoded or about 3% increase in size to the original.
What You Get
WAHBitArray
is essentially the same as the standard BitArray
in the .NET Framework, with the following additions:
FreeMemory()
: This will first compress the internal BitArray
then free the memory it uses.
GetCompressed()
: This will compress the current BitArray
then return it as an array of uint
.
CountOnes()
, CountZeros()
: will count the respective bits in the array.
GetBitIndexes(bool)
: will return an enumeration using yield
of the respective bit position; for example, if the bit array contains 10001... this will return integers 0,4,... if the bool
parameter was true
, and 1,2,3,... if bool was false
.
Get()
, Set()
: Methods implemented with auto resizing and no exceptions.
DebugPrint()
: generates a string
of 1 and 0 values.
Using the Code
To create a WAHBitArray
, you can do the following:
WAHBitArray ba1 = new WAHBitArray(1000000); WAHBitArray ba2 = new WAHBitArray(new BitArray(2000000));
WAHBitArray ba3 = new WAHBitArray(1000000, new uint[] { });
To perform operations, you can do the following:
WAHBitArray result1 = ba1.And( ba2);
WAHBitArray result2 = ba1.Or( ba3);
WAHBitArray result3 = ba1.Xor( ba2);
long count = result1.CountOnes();
result2.Set(2000000, true); bool b = result2.Get(3000000);
foreach(int pos in result2.GetBitIndexes(true))
{
ReadDatabaseRecordNumber(pos); }
Using the Code v1.3
As of version 1.3, you don't need to specify the size of the BitArray
and all operations will auto resize as needed.
WAHBitArray ba1 = new WAHBitArray(); WAHBitArray ba3 = new WAHBitArray(new uint[] { });
Points of Interest
BitArray
class is sealed by Microsoft so inheriting from it was not possible.
BitArray
throws an exception if the length of two BitArray
s are not equal on bit operations, WAHBitArray
makes them the same as the largest before operations.
BitArray
must be resized in 32 increments, otherwise it mangles the compression bits.
Version 2.0
For extra speed in compressing and uncompressing the bits, and the fact that the .NET Framework implementation does not give access to the internal data structures in the BitArray
, I had to rewrite all the BitArray
functionality in WAHBitArray
.
Using Reflector to see the internal implementation of the BCL BitArray
one can see the following snippets:
for (int i = 0; i < array.Length; i++)
array[i] &= val[i];
for (int i = 0; i < array.Length; i++)
array[i] |= val[i];
for (int i = 0; i < array.Length; i++)
array[i] ^= val[i];
As you can see, the bit operations are done on the int32
values.
Now with access to the internal uint[]
bits, the compression method gets 31 bit blocks of data instead of one by one. This is done in the Take31Bits()
method, which finds the two adjacent uint
values in the _uncompressed
list and does some bit manipulations as follows:
public WAHBitArray And(WAHBitArray op)
{
this.CheckBitArray();
uint[] ints = op.GetUncompressed();
FixSizes(ints, _uncompressed);
for (int i = 0; i < ints.Length; i++)
ints[i] &= _uncompressed[i];
return new WAHBitArray(false, ints); }
The compression and uncompression routines were rewritten to operate in the uint[]
arrays as follows:
private void Compress()
{
_compressed = new List<uint>();
uint zeros = 0;
uint ones = 0;
int count = _uncompressed.Count << 5;
for (int i = 0; i < count; )
{
uint num = Take31Bits(i);
i += 31;
if (num == 0)
{
zeros += 31;
FlushOnes(ref ones);
}
else if (num == 0x7fffffff)
{
ones += 31;
FlushZeros(ref zeros);
}
else
{
FlushOnes(ref ones);
FlushZeros(ref zeros);
_compressed.Add(num);
}
}
FlushOnes(ref ones);
FlushZeros(ref zeros);
}
private void Uncompress()
{
int index = 0;
List<uint> list = new List<uint>();
if (_compressed == null)
return;
foreach (uint ci in _compressed)
{
if ((ci & 0x80000000) == 0) {
this.Write31Bits(list, index, ci & 0x7fffffff);
index += 31;
}
else
{
uint c = ci & 0x3ffffff;
if ((ci & 0x40000000) > 0) this.WriteBits(list, index, c);
index += (int)c;
}
}
this.ResizeAsNeeded(list, index);
_uncompressed = list;
}
Because taking or updating 31 bits of data can overlap two adjacent uint
values, some bit massage is necessary as follows:
private uint Take31Bits(int index)
{
long l1 = 0;
long l2 = 0;
long l = 0;
long ret = 0;
int off = (index % 32);
int pointer = index >> 5;
l1 = _uncompressed[pointer];
pointer++;
if (pointer < _uncompressed.Count)
l2 = _uncompressed[pointer];
l = (l1 << 32) + l2;
ret = (l >> (32 - off)) & 0x07fffffff;
return (uint)ret;
}
Version 2.5
This version is a post back from the work done in RaptorDB
, which is an overhaul of all the routines focusing on multi thread access and concurrency issues. A very important lesson learned was locking at the source of the resource being used as opposed to a higher level, which takes care of concurrency issues at the lowest levels, and saves a lot of headaches and debugging.
Much of the code has been reformatted and optimized. I am very confident in the correctness of this version as it is being tested to it's maximum in RaptorDB
.
History
- Initial release v1.0: 22 June 2011
- Update v1.1: 24 June 2011
- Bit operations now return a
WAHBitArray
instead of BitArray
- A bit operation will take either a
WAHBitArray
or a BitArray
as the input
CountZeros()
, CountOnes()
methods added
GetBitIndexes()
method added
- Update v1.2: 28 June 2011
Get
, set
methods with auto resizing
- Update v1.3: 23 July 2011
- Removed the need to specify the initial size
- Bug fix resize in 32 increments
- Bug fix in
BitArray
arithmetic
DebugPrint()
method implemented
- Update v2.0
- Complete rewrite
- Optimized compression and decompression ~9x speed increase
- All bit operations are done internally without BCL
BitArray
- Update v2.5: 10 May 2012
- Thread safe access to bits
- Optimized Count() method
- Bits set in int[] in high order
- Bug fix getting uncompressed bits
- GetBitIndexes() will return the one bits only
- Update v2.6 : 15^{th} June 2013
- bug fix edge case decompress code (thanks to andrey269 )
- Update v2.6.1 : 22^{nd} June 2013
- bug fix a logic mistake
- added a unit test project
- Update v2.6.2 : 6^{th} October 2013
- bug fix last remaining bits output in WriteOnes()