|
|
Comments and Discussions
|
|
 |

|
Hi Mehdi,
Just few simple tests that will not pass the uncompress:
[Test]
public void TestWAHBitArray()
{
var b = new WAHBitArray();
int count = 31;
for (int i = 0; i < count; i++)
{
b.Set(i, true);
}
var expected = b.GetBitIndexes().ToArray();
var b2 = new WAHBitArray(WAHBitArray.TYPE.Compressed_WAH, b.GetCompressed());
var actual = b2.GetBitIndexes().ToArray();
expected.Should(Be.EqualTo(actual));
}
[Test]
public void TestWAHBitArray()
{
var b = new WAHBitArray();
int count = 64;
for (int i = 0; i < 5; i++ )
{
b.Set(i, false);
}
for (int i = 5; i < count + 5; i++)
{
b.Set(i, true);
}
var expected = b.GetBitIndexes().ToArray();
var b2 = new WAHBitArray(WAHBitArray.TYPE.Compressed_WAH, b.GetCompressed());
var actual = b2.GetBitIndexes().ToArray();
expected.Should(Be.EqualTo(actual));
}
|
|
|
|

|
Thanks Andrey!
I will look into it.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Good one. Thanks for posting!
|
|
|
|

|
Thanks for sharing this. Also the And, Or and Xor methods can be factorized like below.
The doWork function takes two uint sets having the same size and an operation (&, |, ^ ...). The first set of uints is updated with the value of the operation done on its elements and the elements of the uncompressed set.
The code is more readable and maintanble this way. If the calculation has to be updated, modifying the doWork function suffices rather than modifying it in three different places .
Action<uint[], uint[], Func<uint, uint, uint>> doWork = (uints, uncompressed, operation) =>
{
for (int i = 0; i < uints.Length; i++) uints[i] = operation(uints[i], uncompressed[i]);
};
public WAHBitArray And(WAHBitArray op)
{
this.CheckBitArray();
uint[] ints = op.GetUncompressed();
FixSizes(ints, _uncompressed);
doWork(ints, _uncompressed, (a, b) => a & b);
return new WAHBitArray(false, ints); }
public WAHBitArray Or(WAHBitArray op)
{
this.CheckBitArray();
uint[] ints = op.GetUncompressed();
FixSizes(ints, _uncompressed);
doWork(ints, _uncompressed, (a, b) => a | b);
return new WAHBitArray(false, ints); }
public WAHBitArray Xor(WAHBitArray op)
{
this.CheckBitArray();
uint[] ints = op.GetUncompressed();
FixSizes(ints, _uncompressed);
doWork(ints, _uncompressed, (a, b) => a ^ b);
return new WAHBitArray(false, ints); }
|
|
|
|

|
Thanks Akram!
I will have to check the performance difference as it is very important and at the heart of RaptorDB.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Hi, Yes I am curious to see the performance results.
|
|
|
|

|
sorry,i not sure the follow code whether is correct.
var wahBit = new WAHBitArray();
for (int i = 0; i < 100; i++)
{
wahBit.Set(i, true);
}
var ar = wahBit.GetCompressed();
wahBit = new WAHBitArray(true, ar);
foreach (var s in wahBit.GetBitIndexes(true))
{
Console.WriteLine(s);
}
the output message is :
32,33,34,35....99,125,126,127
that is not my expected result:
1,2,3,4,5 ....100
can you give me some suggestion?thanks
|
|
|
|

|
I will update the source and article with a new improved version which is being used in RaptorDB.
A lot of changes and fixes were done in that version.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
thank you for the reply.i like your hOOt - full text search engine but that also have a above same error.the RaptorDB have fixed this error?thans again.
|
|
|
|

|
Since you mention that Microsoft has only just announced bitmap indexing in SQL Server, I thought I'd point out they are certainly no stranger to the technique. Bitmap indexing has been a part of Microsoft Access since the very first version (over 20 years!), and MSSQL has always used it internally on the fly (JOIN processing, parts of index scans, etc.)
|
|
|
|

|
Thanks, great to know.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Mehdi
Good Job! I already implemented a bitarray as a bitset class, based on a old work done by someone here or there (don't remember)
This bitset has been used extensively by me, so I am used to it, and I've implemented several performance improvements, let me tell you some of them:
You should implement GetHashCode(), Equals() to allow this class to be used as index in Dictionary or Hashtable
Also ToString() should be overriden (useful for debugging)
Here is my code for this
public override int GetHashCode()
{
int h = 0;
if (_compressed != null)
{
foreach (uint u in _compressed) h ^= (int)u;
}
if (_uncompressed != null)
{
foreach (uint u in _uncompressed) h ^= (int)u;
}
return h;
}
Here is a not-so-lightweight comparison, with a little more work you may compare int-by int if both arrays are compressed or uncompressed, but for a hurry this should work as fine, (and actually does)!
public override bool Equals(object obj)
{
if (obj != null && obj is WAHBitArray)
{
WAHBitArray w = obj as WAHBitArray;
w.CheckBitArray();
CheckBitArray();
return w._uncompressed.Equals(_uncompressed);
}
else return false;
}
Its many times needed to see if the bitarray is empty (ie. after a And() operation), this yields a property needed, called 'Zero' which is good to implement as an extremely fast routine, like this one:
public bool Zero
{
get
{
CheckBitArray();
foreach (uint ui in _uncompressed)
{
if (ui != 0)
return false;
}
return true;
}
}
Even as I did not sniff inside the compression schema, the same may not need to decompress the array and work on the compressed one as well, even faster!
Another useful is the "cardinality" or OnesCount()
You can setup a simple array of 256 bytes, putting inside the number of ones of this byte (counted)
then simply add up all bytes (making at least 4 integer additions for each integer in the _uncompressed array).
The actual method makes Bit-Length additions (loops trough)
Here is the code:
public int BitCount()
{
CheckBitArray();
int c = 0;
int count = _uncompressed.Count << 2;
for (int i = 0; i < count; i++)
{
c += BYTE_COUNTS[(_uncompressed[i >> 2] >> ((i & 0x3)<<3)) & 0xFF];
}
return c;
}
For example the bitcount, is also slow, the string concatenation generates 'Length' strings in memory, with all the allocations overhead.
public string Bits
{
get
{
char[] c = new char[Length];
for (int i = 0; i < Length; i++)
c[i] = internalGet(i) ? '1' : '0';
return new string(c);
}
}
Also a simpler concatenation is not good but even faster than StringBuilder() for short stings
string s = "";
for (int i = 0; i < Length; i++)
s = (internalGet(i) ? "1" : "0") + s;
return s;
Parity is also needed, this is a fast way to calculate it!
public bool Parity
{
get
{
CheckBitArray();
int p = 0;
foreach (int i in _uncompressed)
p ^= i;
p ^= p >> 16;
p ^= p >> 8;
p ^= p >> 4;
p ^= p >> 2;
p ^= p >> 1;
return (p & 0x01) == 0;
}
}
and for the byte ones count, here is the block
#region private static byte[] BYTE_COUNTS
private static byte[] BYTE_COUNTS =
{ 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
#endregion
Hope this helps you all, and if you want to include this routines, mention me "Andres Hohendahl" with a co-Author/Contributor legend!
|
|
|
|

|
Thanks a lot, I will try to fit some in when I can.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
|
|
|
|

|
very nice 5 from me!
For my app I think I will need some sort of hierarchical access to this have you looked at such a thing?
|
|
|
|

|
Cheers,
What do you have in mind?
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
|
|
|
|

|
Sorry I'm not very clear about this at the moment. If you are trying to fine the intersection of several large sets over a huge base set you might need to jump ahead a large distance in one set or the other. That is you might have a large gap in one set but lots of members in the other. I think you would need some sort of hierarchy on top of the bit arrays to make this fast. Or I might be thinking about this in completely the wrong way
Cheers
|
|
|
|

|
Ah I see, the beauty of this implementation is that the actual bit logic is done by the original BitArray. So you don't have to worry about that.
One valid concern is performance, so you might want to break up the dataset into chunks of say 100mil bits for computation memory feasibility and performance.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
|
|
|
|

|
Hey Mehdi,
Good stuff. You posted this a couple of days before I finally made the decision to open source my implementation of Word Aligned Hybrid Bit Vectors and a search framework I built on top of them. As an FYI: a major restriction of my implementation of WAH Vectors is that writing to compressed vectors can only occur on the least significant 31 bits. Check it out: http://softwarebotanysun.codeplex.com. I still need to get more code coverage with the unit tests, and documentation and examples are non-existent; however, I will be working on them over the next couple of weeks as well as a series of blog posts.
Take Care,
-Dan
|
|
|
|

|
Nice one.
Are you following the exact fastbit directives?
It seems there is a patent on it.
The rle coding seems fine, I have to check the filing to make sure.
Personaly I was under the impression that this was an offbeat topic, but people are actually using it in very interesting ways.
Have you done any performance tests?
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
|
|
|
|

|
Regarding the patent, I just took a look. The patent filing makes it sound like the compression scheme and logical operations are all covered. However, FastBit is licensed under LGPL (and so is Software Botany Sunlight) which states in the preamble:
"Finally, software patents pose a constant threat to the existence of any free program. We wish to make sure that a company cannot effectively restrict the users of a free program by obtaining a restrictive license from a patent holder. Therefore, we insist that any patent license obtained for a version of the library must be consistent with the full freedom of use specified in this license."
... and later in section 11:
"For example, if a patent license would not permit royalty-free redistribution of the Library by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Library."
So yeah, I'm think it's safe to make derivative works as long as they are licensed under the LGPL or the GPL. Regardless, I haven't used Sunlight for anything commercial yet. It is more or less a pet project for enjoyment and opportunity to use C# features like unsafe and dynamic. I should definitely put a warning up on codeplex though for users who may want to use it commercially. I'll also see if I can get a hold of anyone who knows how to read legalese and whatnot.
Take care,
-Dan
|
|
|
|

|
Thanks you saved me a lot of time, I hate reading legal stuff.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
|
|
|
|

|
SoftwareBotanyDan wrote: As an FYI: a major restriction of my implementation of WAH Vectors is that writing to compressed vectors can only occur on the least significant 31 bits.
Taking the original data in multiples of 31 bits seems a bit icky. It would seem a relatively easy way to improve efficiency would be to process output data in multiples of 32 words words (128 bytes) formatted as 31x33 bits. The first 31 bits of each group would hold the lower 32 ("data") bits of each logical output word, and the last word would hold the "mode" bit for the preceding 31 words. Using that approach would allow input words that weren't 0x00000000 or 0xFFFFFFFF to be copied directly from the source to the destination without need for bit-shifting or masking.
|
|
|
|

|
Interesting, what about runs of ones and zeros?
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
|
|
|
|

|
Mehdi Gholam wrote: Interesting, what about runs of ones and zeros?
If word[5] of the array is supposed to represent 32 bits of literal data, then bit 5 of word[31] would be zero. If word[5] of the array is supposed to represent something else, then bit 5 of word[31] would be set, and the bits of word[5] would indicate what exactly it's supposed to be. A really simple approach would be to use bit 31 to indicate whether the word represents a run one of 0x00000000 or 0xFFFFFFFF, and the bottom 31 bits to indicate how many such words it represents. Since most runs aren't going to be anywhere near 2 billion words long, it may be helpful to allow some bits in the word to say something about data that follows the run. For example, one could use bits 31-30 to select one of four types of run:
- A run of 0 to a billion words of zero
- A run of 0 to a billion words of ones (0xFFFFFFFF)
- A run of 0-255 BITS of zeroes, 0-127 BITS of ones, 0-255 BITS of zeroes, 0-127 BITS of ones, and enough zeroes to pad out any partial word.
- A run of 0-255 BITS of ones, 0-127 BITS of zeroes, 0-255 BITS of ones, 0-127 BITS of zeroes, and enough ones to pad out any partial word.
Note that's just a simple example of how things could be done. Note that any four runs comprising 130 bits or less (and many combinations of four runs totaling 256 bits or more) could be stored in a single word.
|
|
|
|
|

|
Cheers
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
|
|
|
|

|
Thanks for the most interesting article. Did you by any chance do some performance testing on it. In particular it would be nice to know the speed difference between the WAHBitArray and the BitArray for AND, OR, NOT and XOR. I have been using BitArrays for representations of ranges of integers (Line segments) and your code might be very useful for compressing them. All in all this has some definite potential.
Thanks again
|
|
|
|

|
Wow again, and I thought this was an off beat topic but it seems people are actually using bitarrays. I never would have guessed for line segments!?
The performance is the same I'm afraid as all the computation is done by the standard bitarray, if you follow the link to the fastbit site, there you will find a research article which did some performance test. The long and short of it is that for low densities of values their way is faster but for high densities it's slower.
Cheers
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
|
|
|
|

|
Hi,
Checkout hOOt and tell me what you think.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
|
|
|
|

|
I have been using my own compression techniques in my bit array indexes for years but this looks much better. Thanks. Have a 5
|
|
|
|
|

|
Well, I did until:
BitArray result1 = ba1.And( ba2);
doesn't work as ba2 is a WAHBitArray and the And method only takes a BitArray.
I have a database system for fast counts and analysis - screenshot and brief intro of an earlier version here: http://www.divelements.com/net/corporate/casestudies/logicbox.aspx[^]
Later versions are all in-house so no piccies. I have been using simple RLE compression on bit arrays.
|
|
|
|

|
I will put some overloads in for the next version.
Also I will add the set and get methods with automatic resizing as it throws exceptions in original.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
|
|
|
|

|
On a different note, I'm interested in how and where you are using bitmap indexes and bitarrays, if you would like to elaborate.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
|
|
|
|

|
Hi, I put a description of some of the software in a previous post on this thread.
I am rewriting it all at the moment - I currently have a programming language writen in C++ which knows how to talk to standard SQL databases and create the bitmapped data from them and a GUI and bitmap engine written in C# and winforms which allows the user to query the bitmapped database.
I am replacing the programming language with a XML based declarative language (although it also supports embedded C# routines) for creating the bitmapped database - the GUI I am rewriting in WPF which is a bit of a pain but I am mostly there.
I have had similar systems for over 12 years - this is the 5th iteration.
Somebody explained bitmapped indexes to me at Tech-Ed ages ago over a few beers - I worked it all out myself how to implement it and went from there. It is only in recent years I have come across WAH and BBC etc so it's nice that the standard ways aren't fundamentally different from my own hand rolled stuff. I can do counts on a 100 million row database in 1/10th of a second so I must be doing something right.
|
|
|
|

|
Great job, bitmap indexes are new to me too eversince project Gemini which became powerpivot from Microsoft which intrigued me immensely and thus began the journey to find out. It's going into the document store version of RaptorDB.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
|
|
|
|

|
They are interesting to work with and very satisfying because they can be so fast - I like micro-optimising - in most programming tasks micro-optimising isn't worth it but in this sphere it can have some huge benefits. Profilers are your friend.
I have been keeping an eye on your RaptorDB work, it looks very promising indeed.
Are there any patents on WAH? I seem to recall the BBC algorithm being patented.
|
|
|
|

|
Hush hush, actually there is although I haven't read it to see what it covers, the rle or how to operate on the compressed bits.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
|
|
|
|

|
If memory serves it was how to do the operations on the compressed bitmaps
|
|
|
|

|
Hi,
Checkout hOOt and tell me what you think.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
|
|
|
|

|
Nice work, a bit more detail on how the code works would have been nice, but I get what it does and what you were trying to achieve and why, so I'll check the rest by code !
|
|
|
|

|
Cheers,
It's quite simply really, as opposed to the java and c++ versions which do the computation in compressed form.
The real kick will come in the next version on RaptorDB, when it's being used
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
|
|
|
|
 |
|
|
General News Suggestion Question Bug Answer Joke Rant Admin
|
Word Aligned Hybrid (WAH) compression for BitArrays
| Type | Article |
| Licence | CPOL |
| First Posted | 22 Jun 2011 |
| Views | 30,288 |
| Downloads | 1,300 |
| Bookmarked | 51 times |
|
|