## Introduction

After searching the internet for a .NET implementation for WAH compressed BitArray and not finding one, I decided to write and publish an article here so the .NET community are not left out as all the available implementations are in the Java language. This ties into a pretty advanced topic of bitmap indexing techniques for databases and I needed this for my RaptorDB Document data-store database engine.

## What Is It?

A `BitArray`

is a data structure in the .NET library for storing `true`

/`false `

or bits of data in a compact form within an array of `Int32`

objects. A WAH or word-aligned hybrid `BitArray`

is a special run length compressed version of a `BitArray`

which saves a lot of space and memory. All the implementations that exist in Java essentially duplicate the functionality of a `BitSet`

, that is the `AND`

, `OR`

, `NOT`

, and `XOR`

operations with the compressed internal storage format.

In my implementation, I defer the functionality of the `BitArray`

to itself and just add compression and decompression routines. This is much faster than the Java way at the expense of memory usage. To overcome this, I have also added a `FreeMemory`

method to release the `BitArray`

contents and keep the compressed contents. Arguably, if you are using 100 million bits, then a full implementation is more performant than my implementation, but for most of our Use Cases, we are at most in the 10 millions of bits range.

This original method was invented at the Berkeley Labs of US Department of Energy; it is a project named FastBit and is used for high energy physics department experiments; you can see it here: FastBit site.

## Why Should I Care?

So what?, you ask. Well, as mentioned before, `BitArray`

s are used in an indexing technique called bitmap indexes (Wiki) and column based databases which store data in columns instead of rows. An example which you might know is Microsoft's PowerPivot for Excel which can process millions of rows in seconds. Interestingly, Microsoft has only recently announced the implementation of bitmap indexes in the upcoming SQL Server platform, post 2008 R2. It has long been in use by other RDBM vendors like Oracle.

## How It Works

The WAH compression algorithm is as follows:

- Take 31 bits from the array.
- If all zeros, then increment the zero count by 31.
- if ones count > 0, then output 32 bits with bit 32 =1 and bit 31=1 and 0-30 = ones count.

- If all ones, then increment the ones count by 31.
- If zeros count >0, then output 32 bits with bit 32=1 and bit 31=0 and 0-30 = zeros count.

- Else output the 31 bits as 32 bits with bit 32 = 0.
- If zeros or ones count >0, then output as above.

From the above, in the worst case, you will get N/31 more bits encoded or about 3% increase in size to the original.

## What You Get

`WAHBitArray`

is essentially the same as the standard `BitArray`

in the .NET Framework, with the following additions:

`FreeMemory()`

: This will first compress the internal `BitArray`

then free the memory it uses.
`GetCompressed()`

: This will compress the current `BitArray`

then return it as an array of `uint`

.
`CountOnes()`

, `CountZeros()`

: will count the respective bits in the array.
`GetBitIndexes(bool)`

: will return an enumeration using `yield`

of the respective bit position; for example, if the bit array contains 10001... this will return integers 0,4,... if the `bool`

parameter was `true`

, and 1,2,3,... if bool was `false`

.
`Get()`

, `Set()`

: Methods implemented with auto resizing and no exceptions.
`DebugPrint()`

: generates a `string`

of 1 and 0 values.

## Using the Code

To create a `WAHBitArray`

, you can do the following:

WAHBitArray ba1 = new WAHBitArray(1000000); WAHBitArray ba2 = new WAHBitArray(new BitArray(2000000));
WAHBitArray ba3 = new WAHBitArray(1000000, new uint[] { });

To perform operations, you can do the following:

WAHBitArray result1 = ba1.And( ba2);
WAHBitArray result2 = ba1.Or( ba3);
WAHBitArray result3 = ba1.Xor( ba2);
long count = result1.CountOnes();
result2.Set(2000000, true); bool b = result2.Get(3000000);
foreach(int pos in result2.GetBitIndexes(true))
{
ReadDatabaseRecordNumber(pos); }

## Using the Code v1.3

As of version 1.3, you don't need to specify the size of the `BitArray`

and all operations will auto resize as needed.

WAHBitArray ba1 = new WAHBitArray(); WAHBitArray ba3 = new WAHBitArray(new uint[] { });

## Points of Interest

`BitArray`

class is sealed by Microsoft so inheriting from it was not possible.
`BitArray`

throws an exception if the length of two `BitArray`

s are not equal on bit operations, `WAHBitArray`

makes them the same as the largest before operations.
`BitArray`

must be resized in 32 increments, otherwise it mangles the compression bits.

## Version 2.0

For extra speed in compressing and uncompressing the bits, and the fact that the .NET Framework implementation does not give access to the internal data structures in the `BitArray`

, I had to rewrite all the `BitArray `

functionality in `WAHBitArray`

.

Using Reflector to see the internal implementation of the BCL `BitArray `

one can see the following snippets:

for (int i = 0; i < array.Length; i++)
array[i] &= val[i];
for (int i = 0; i < array.Length; i++)
array[i] |= val[i];
for (int i = 0; i < array.Length; i++)
array[i] ^= val[i];

As you can see, the bit operations are done on the `int32 `

values.

Now with access to the internal` uint[] `

bits, the compression method gets 31 bit blocks of data instead of one by one. This is done in the `Take31Bits()`

method, which finds the two adjacent `uint`

values in the _`uncompressed `

list and does some bit manipulations as follows:

public WAHBitArray And(WAHBitArray op)
{
this.CheckBitArray();
uint[] ints = op.GetUncompressed();
FixSizes(ints, _uncompressed);
for (int i = 0; i < ints.Length; i++)
ints[i] &= _uncompressed[i];
return new WAHBitArray(false, ints); }

The compression and uncompression routines were rewritten to operate in the `uint[]`

arrays as follows:

private void Compress()
{
_compressed = new List<uint>();
uint zeros = 0;
uint ones = 0;
int count = _uncompressed.Count << 5;
for (int i = 0; i < count; )
{
uint num = Take31Bits(i);
i += 31;
if (num == 0)
{
zeros += 31;
FlushOnes(ref ones);
}
else if (num == 0x7fffffff)
{
ones += 31;
FlushZeros(ref zeros);
}
else
{
FlushOnes(ref ones);
FlushZeros(ref zeros);
_compressed.Add(num);
}
}
FlushOnes(ref ones);
FlushZeros(ref zeros);
}
private void Uncompress()
{
int index = 0;
List<uint> list = new List<uint>();
if (_compressed == null)
return;
foreach (uint ci in _compressed)
{
if ((ci & 0x80000000) == 0) {
this.Write31Bits(list, index, ci & 0x7fffffff);
index += 31;
}
else
{
uint c = ci & 0x3ffffff;
if ((ci & 0x40000000) > 0) this.WriteBits(list, index, c);
index += (int)c;
}
}
this.ResizeAsNeeded(list, index);
_uncompressed = list;
}

Because taking or updating 31 bits of data can overlap 2 adjacent `uint `

values, some bit massage is necessary as follows:

private uint Take31Bits(int index)
{
long l1 = 0;
long l2 = 0;
long l = 0;
long ret = 0;
int off = (index % 32);
int pointer = index >> 5;
l1 = _uncompressed[pointer];
pointer++;
if (pointer < _uncompressed.Count)
l2 = _uncompressed[pointer];
l = (l1 << 32) + l2;
ret = (l >> (32 - off)) & 0x07fffffff;
return (uint)ret;
}

## Version 2.5

This version is a post back from the work done in `RaptorDB `

, which is an overhaul of all the routines focusing on multi thread access and concurrency issues. A very important lesson learned was locking at the source of the resource being used as opposed to a higher level, which takes care of concurrency issues at the lowest levels, and saves a lot of headaches and debugging.

Much of the code has been reformatted and optimized. I am very confident in the correctness of this version as it is being tested to it's maximum in `RaptorDB`

.

## History

**Initial release v1.0**: 22 June 2011
**Update v1.1**: 24 June 2011
- Bit operations now return a
`WAHBitArray`

instead of `BitArray`

- A bit operation will take either a
`WAHBitArray`

or a `BitArray`

as the input
`CountZeros()`

, `CountOnes()`

methods added
`GetBitIndexes()`

method added

**Update v1.2**: 28 June 2011
`Get`

, `set `

methods with auto resizing

**Update v1.3**: 23 July 2011
- Removed the need to specify the initial size
- Bug fix resize in 32 increments
- Bug fix in
`BitArray`

arithmetic
`DebugPrint()`

method implemented

**Update v2.0**
- Complete rewrite
- Optimized compression and decompression ~9x speed increase
- All bit operations are done internally without BCL
`BitArray`

**Update v2.5 : **10 May 2012- Thread safe access to bits
- Optimized Count() method
- Bits set in int[] in high order
- Bug fix getting uncompressed bits
- GetBitIndexes() will return the one bits only