The Lounge - CodeProject

First Prev Next

Re: What flavor of voodoo is this?

Jörgen Andersson12-Sep-19 7:11

Jörgen Andersson

12-Sep-19 7:11

Teaser.

Wrong is evil and must be defeated. - Jeff Ello

Re: What flavor of voodoo is this?

kalberts12-Sep-19 9:04

kalberts

12-Sep-19 9:04

I read through this thread and really cannot understand what's the "problem". Where the voodoo comes in. There is very little that surprises me. Or do I mistunderstand what you are talking about?

Say you've got 4kiByte blocks, 32bit key values, then an index block holds up to 500 downwards pointers (the 96 bytes are for horizontal pointers and other sorts of management info). Root + 1 level provide pointers to a quarter million blocks. If your data records are small enough to fit 40 to the block (i.e. less than 100 bytes), this might be enough for 10 million records: Search the root index (use binary search if you prefer!), follow associade pointer to next index level. Search this root index (again binary search if you like), and there is youv'e got the pointer to the block where your data are located. The data block may have a small index with key/offset pairs to each record; this is so small that a sequential seach probably is as fast as a binary.

If blocks are at minimum filling, you may, for ten million 100 byte records, end up with two index levels below the root. If keys are not integer, but require more space, you may not be able to pack 330-500 key/pointer pairs at the index levels; this may also cause a second level to be established. But still: Three small binary searches to get to the right block isn't much work. If this is a disk structure: The root will definitely remain i cache throughout, and unless you are extremely tight on RAM, all or much of the first level below will as well. Often you use sorted structures because the records are frequently processed in sorted order. So while the second index level is not necessarily all in RAM, the relevant parts of it may be, in those cases where you do a more or less sequential processing. (If the sequential ordering means nothing to your use of the data, then you should use a hash method, not B-tree!)

You quote a single figure for both insert and removal. When you make your timings, you should make a large number of calls, quoting minimum, maximum, and average (or even more statistics-oriented figures, if that's within you field of knowledge). If you enter a million records, most of them will be super-fast; there is still room in that block for accomodating another one (key, at index level, or record, at leaf level). Every now and then, a block fills up and must be split (more often at the leaf level than at the index level, if data itself is kept in the block, not just pointers). How often this happens depends on the key size (for index blocks) and record size (for the data blocks), and the block size.

At rare occasions, the entire tree is full and the number of levels must be added. That could be quite a lot more time consuming; if the tree is to be balanced according to B-tree rules all the time, this might require some shuffeling stuff around.

Remember that B-tree rules doesn't absolutely require you to immediately float a block split up to the index level above. A search may have to skip to the next "horizontal" block, until it finds a key higher than the candidate key. Some implementations are rather lax, leaving split blocks for quite some time before updating the index. This might reduce the average add time, at the expense of (asynchronous) "cleanup" routines and somewhat higher average search times.

On the average, B-trees are quite fast for searching. But one thing that holds for all sorts of index structures: If you've got ten million new records for inserting into the tree, the right way to it is NOT one-by-one, in unsorted order. Any B-tre implementation should provide a "batch insert", sorting those ten million records by an nlogn method before inserting them into the tree, and do this in sequential key order, filling up leaf blocks from one end, one block at a time, without worrying about the higher index levels, linking in new blocks through horizontal pointers as needed. Once the bottom layer is done (without concern for higher level indexes), you go up one level to insert pointers all the new blocks, completing that level before ascending further, level by level, up to the root.

Re: What flavor of voodoo is this?

honey the codewitch12-Sep-19 10:45

honey the codewitch

12-Sep-19 10:45

first of all, it's a collection class, not a DBMS. Yes, B-trees can be used for DBMS systems but if I was going to go through all the trouble of optimizing my algorithm doing batch inserts and what have you, then I'd be doing it on a B Plus tree instead

At least if my goal was to make a dbms.

My concern in my OP was not about the performance of my class(es)

It was about the performance of Microsoft's classes.

Namely, a SortedDictionary (red-black tree) beating out a B-tree after 10,000 entries.

But then i think i already explained that. Maybe i didn't. i don't remember and don't much care.

When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.

well gosh, I feel dumb now.

honey the codewitch11-Sep-19 13:01

honey the codewitch

11-Sep-19 13:01

Apparently I implemented a B+ tree for no reason because it gives you zero benefit unless you're accessing slow I/O like a disk, and it's tied to your nodes, which means implementing some kind of file driver and a file format for storing nodes at the very least.

So basically, a database.

I mean, it's cool code, but it's useless byself. On the shelf it goes.

*headdesk*

At least my regular B Tree works. That's kinda cool.

When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.

Welcome to the Lounge