The Lounge is rated Safe For Work. If you're about to post something inappropriate for a shared office environment, then don't post it. No ads, no abuse, and no programming questions. Trolling, (political, climate, religious or whatever) will result in your account being removed.
I've seen the reference implementations for dictionary and sorted dictionary. I've even fiddled with them in binary form way back when with Reflector. I expect the inserts and removal times to be longer - i'd be surprised if they weren't, but the searches? This is just wrong.
(The AVL tree implementation sucks - it will be improved)
Thank you for the polite tone of your response, Peter.
I am aware of Chris' comment on the prior post, and I do not interpret it as being a "blanket" invitation for posts, like this one, which are nothing but code. There is no "discussion" inherent in this post.
I take the privilege of being a "mentor" on CP seriously, and when I see content like this, which could contribute to the quality of CP in the long run ... if posted on the appropriate technical forum ... I will speak out.
«One day it will have to be officially admitted that what we have christened reality is an even greater illusion than the world of dreams.» Salvador Dali
The .NET BCL code has 2 advantage.
1. it runs in release mode (yours probably run in debug mode)
2. it has been precompiled ahead of time. (i.e. NGEN). NGEN does a better job sometimes.. though not sure it does that much better on library... (as opposed to program)
(bonus point) well you said you check the reference implementation so might not be true. but .NET BCL is heavily optimised (I used reflector sometimes and I was surprised by the length they occasionally go to to make common codepath faster)
Looks like a fun way to improve your coding!
Good luck with that and improvement!
Any chance I can give you some homework?
In AvalonEdit their text model use "Rope" which looks like an IList but has fast insert and delete, i.e. O(1) cost instead of O(n)
In fact their code is probably quite nice.. but after looking at it for only 2 minutes I was confused, it's all mixed up with the text model, if I remember right...
Maybe you could be interested in making standalone Rope class?!
although mine destroys Microsoft's for searching after about 30,000 items.
Exactly this is the reason for using B-trees in a database.
They're not very sensitive to size, which is something you regularly get in a database. While more than 30000 items is something you very seldom need to (or should) handle in memory.
While more than 30000 items is something you very seldom need to (or should) handle in memory.
That's getting less true. What's funny is by my tests, .NET is just fine with 3 million entries spread across 3 different dictionary classes.
The heap isn't as big as you'd expect and the performance is really good for both the base Dictionary class (which is basically unsorted, and uses a hash lookup) and for my class, while not being unsurprising for the other class.
Times are changing. In memory DB is totally doable even in C#, for smaller dbs.
I was thinking of backing JSON with something like this, or implementing a full B+ with backing storage for it.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
I read through this thread and really cannot understand what's the "problem". Where the voodoo comes in. There is very little that surprises me. Or do I mistunderstand what you are talking about?
Say you've got 4kiByte blocks, 32bit key values, then an index block holds up to 500 downwards pointers (the 96 bytes are for horizontal pointers and other sorts of management info). Root + 1 level provide pointers to a quarter million blocks. If your data records are small enough to fit 40 to the block (i.e. less than 100 bytes), this might be enough for 10 million records: Search the root index (use binary search if you prefer!), follow associade pointer to next index level. Search this root index (again binary search if you like), and there is youv'e got the pointer to the block where your data are located. The data block may have a small index with key/offset pairs to each record; this is so small that a sequential seach probably is as fast as a binary.
If blocks are at minimum filling, you may, for ten million 100 byte records, end up with two index levels below the root. If keys are not integer, but require more space, you may not be able to pack 330-500 key/pointer pairs at the index levels; this may also cause a second level to be established. But still: Three small binary searches to get to the right block isn't much work. If this is a disk structure: The root will definitely remain i cache throughout, and unless you are extremely tight on RAM, all or much of the first level below will as well. Often you use sorted structures because the records are frequently processed in sorted order. So while the second index level is not necessarily all in RAM, the relevant parts of it may be, in those cases where you do a more or less sequential processing. (If the sequential ordering means nothing to your use of the data, then you should use a hash method, not B-tree!)
You quote a single figure for both insert and removal. When you make your timings, you should make a large number of calls, quoting minimum, maximum, and average (or even more statistics-oriented figures, if that's within you field of knowledge). If you enter a million records, most of them will be super-fast; there is still room in that block for accomodating another one (key, at index level, or record, at leaf level). Every now and then, a block fills up and must be split (more often at the leaf level than at the index level, if data itself is kept in the block, not just pointers). How often this happens depends on the key size (for index blocks) and record size (for the data blocks), and the block size.
At rare occasions, the entire tree is full and the number of levels must be added. That could be quite a lot more time consuming; if the tree is to be balanced according to B-tree rules all the time, this might require some shuffeling stuff around.
Remember that B-tree rules doesn't absolutely require you to immediately float a block split up to the index level above. A search may have to skip to the next "horizontal" block, until it finds a key higher than the candidate key. Some implementations are rather lax, leaving split blocks for quite some time before updating the index. This might reduce the average add time, at the expense of (asynchronous) "cleanup" routines and somewhat higher average search times.
On the average, B-trees are quite fast for searching. But one thing that holds for all sorts of index structures: If you've got ten million new records for inserting into the tree, the right way to it is NOT one-by-one, in unsorted order. Any B-tre implementation should provide a "batch insert", sorting those ten million records by an nlogn method before inserting them into the tree, and do this in sequential key order, filling up leaf blocks from one end, one block at a time, without worrying about the higher index levels, linking in new blocks through horizontal pointers as needed. Once the bottom layer is done (without concern for higher level indexes), you go up one level to insert pointers all the new blocks, completing that level before ascending further, level by level, up to the root.
Last Visit: 31-Dec-99 18:00 Last Update: 31-Jul-21 13:04