Algorithms

5-Nov-08 4:40

For many varied reasons, I decided to write my own balanced binary tree library. One feature of my library that I haven't seen in other freely available libraries is "clusters", and a "recluster" function.

The problem that I am attacking is that as elements are randomly added to the tree, the tree rotations that automatically happen to keep the tree balanced tend to spread the edge pointers all over the place. The end result is that in a large tree, a typical traversal can require reading nodes from all over physical memory (or disk pages): a very inefficient use of cache. By clustering several subtree "near the root nodes" together in physical memory, tree traversals can be sped-up significantly by keeping nodes that always get visited near one another within physical memory (or on the same disk page). For example, in a properly clustered tree, with clusters large enough to hold 4 levels of records (1+2+4+8 = 15 records), a tree of height 16 (>45000 records) can be traversed in only 4 cluster reads instead of 16 reads that might be required of a non-clustered tree.

What I don't know is a good way to measure how "out-of-cluster" a tree is. I'd like to be able to determine when a tree needs to be reclustered. I'm sure there is good theory out there for this; and the idea doesn't seem to be that difficult, but I haven't run onto the solution yet. I've spent a little bit of time googling for a method, and probably a couple of hours thinking about it. Nothing yet.

Any ideas or helpful links?

David
---------
Empirical studies indicate that 20% of the people drink 80% of the beer. With C++ developers, the rule is that 80% of the developers understand at most 20% of the language. It is not the same 20% for different people, so don't count on them to understand each other's code.
http://yosefk.com/c++fqa/picture.html#fqa-6.6
---------

modified on Wednesday, November 5, 2008 11:33 AM

Member 41945935-Nov-08 6:28

5-Nov-08 6:28

Have you ever considered B+ trees as implemented by "The C Database Tool Chest", Mix Software. Mix is still on the web - Google "Mix Software". Note that the term "database" is, in fact, just a B+ tree. The database is a tree of memory or file blocks which contain multiple keys and records.

I got this in '85 and have used it ever since. I got the source software (don't know if this is still available). Not Open Source, but any programs that are released can be used without any royalty, only restriction is that it must be linked into an object, no DLLs, and the neither the source nor the library can be released - only the executable with the embedded code. With the source, you can make any modifications you want.

I modified it (originally 16bit - used Borland C to compile) to now use VS2008 thus 32 bit and supports >4GB files. Major rewrite to unwind the INT usage which now is 32 bit - in 16 bit an INT was 16 bit. Many conflicts with structure layout, and old 16 bit databases couldn't be read with the 32 bit version, nor the other way around. I built my version where I could declare which size of INT I wanted so I had only one converter to work with. I created two converters (16 bit and 32 bit) which could read the appropriate data base (in ascending order) and output a flat file with the data it contained, and then read in the flat file and write the other bit sized database.

Dave.

DQNOK5-Nov-08 6:56

5-Nov-08 6:56

Member 4194593 wrote:
Have you ever considered B+ trees as implemented by "The C Database Tool Chest", Mix Software.

No, I hadn't; but I am familiar with Mix, and have used both their Power C compiler and their C utilities toolchest.

I'll have to back-up and reconsider whether this would be a good approach to my end goal (which was NOT to rewrite the balanced binary tree algorithms).

I originally rolled my own balanced binary tree routines because the ones I found on the web didn't support duplicate keys, nor multi-threading. The clusters were an after-thought (actually, they came about from a "Hey, I can use the tree structure instead of yet-another-roll-my-own exotic structure -- IF I can speed them up sufficiently -- perhaps reclustering would do it).

Thanks for the thoughts.

Member 41945935-Nov-08 7:44

5-Nov-08 7:44

David,

Good that you like David. You be David, I be Dave. Good name.

The blocks in the B+ tree do not exactly correspond to the clusters you use, but do tend to group records together. The interesting thing is that the tree is self leveling. All blocks are identified by a block number. All of the full records are on the bottom level of the tree, linked horizontally by their block numbers, and have an array of offsets in each block to the records, thus constitute an "array", making it easy to go from one record to another. Above the "leaf" nodes at the bottom is an interior node or a layer of interior nodes which only contain the keys (not the additional data) and each key is associated with the block number containing the child nodes below. The first block (block 0) is a header block containing linking info (first leaf block, last leaf block, i.e. first record and last record), and the current root node. Whenever a node (either a leaf node or an interior node)) fills up (not when it is found to be full when a new record needs to be added, but whenever it becomes full due to the addition of a record), it attempts to offload some records to the prior same level block or following same level block, and if that is not possible, the block is split with half of the records in one and the second half in the other and these two blocks are linked. This also creates a new record for the parent block pointing to the block that just split, which may also cause the parent to split, on up the tree until maybe the root block needs to split thus creating a new root and updating the pointer of the header block. The blocks are allocated in memory with a buffer pool which is maintained with an LRU list, and when the buffer pool is full and a new block is needed, the LRU block is written (if dirty) to the database file, randomly by block number, or just discarded and reused if not dirty. Blocks can thus be either in-memory, or else on-file. Blocks are not written to file unless they hit LRU status and a new block is needed, or at the end, when the tree is closed. Multiple trees can be opened at one time (256 of then sharing the buffer pool). I build an application that needed to create a massive tree, then discard it and process other records comprised from the record numbers only. This caused the tree to be written out, then discarded. I wrote a new function to purge a tree without writing by just modifying the header record to indicate that it was empty, writing that out and closing the tree, then returning all buffer pool records to the pool, then discarding the file at the application level. Saved a massive amount of time.

Dave.

DQNOK6-Nov-08 4:54

6-Nov-08 4:54

This discussion has sent me looking into B+trees. Some of the little code I've found looks ridiculously simple (by comparison to my BalBinaryTree implementation). I have no idea of speed comparisons, but it makes sense that they would be fast or else RDBMS's wouldn't use them. Plus it seems they might actually be smaller since every record in my tree system carries 12 bytes of overhead.

Yes, the mix source is still available, but it sounds like the right source to have is Mix's (which of course requires a small $ outlay) plus your modifications.

My (junk) e-mail address is davidlqualls@yahoo.com
I don't check it every day; and it fills up with garbage pretty fast, but if you'll put "B+Tree" in the subject line, I'll be looking for it.

Thanks.

David

Member 41945936-Nov-08 7:57

6-Nov-08 7:57

DQNOK wrote:
which of course requires a small $ outlay

The operative word here is "small" not like VS, and the library an be included in an executable royalty free.

Dave.

Member 419459325-Apr-09 4:28

25-Apr-09 4:28

David,

Sorry this has taken so long to answer. It got buried on my email with Christmas correspondence, and you never responded to this thread again, and I got off on 15 other tangents.

Did you ever get the Mix source? I can't legally send you any "fixed" source unless you have the original source.

How is your balanced tree going?

Dave.

DQNOK27-Apr-09 3:13

27-Apr-09 3:13

Thanks for remembering me.

I looked into it and decided you were right; using the binary tree was the wrong approach. Binary trees just do too much memory thrashing.

I also got off on a really hot project and had to set it aside for a while.

I downloaded a couple of free BTree libraries, but I'm never happy with stuff I get from outside. They will have been developed with their own support libraries, which usually conflict with mine, or they are missing features I want, or are bloated with features I don't want, etc.

I really just wanted an in-memory index, without the burden of a disk-based system. However, I really should consider a product that has disk-based storage in case I want to expand later on.

I'm still interested, just haven't taken "the plunge" yet. It's not the money; it's the time committment in learning a new system and adapting all my stuff to their way of doing things.

I'm just responding to your post without having gone back and reviewed everything that was said to this point. I'm going to do that, plus, if you have anything else new to add, I'd be glad to hear it. Perhaps I should just go ahead and buy the source...

DQNOK27-Apr-09 5:57

27-Apr-09 5:57

A second response:

Have you noticed that Mix now sells a Win32 version of their database toolchest? I don't suppose you've compared their implementation against yours, have you? I know nothing except what they've posted on their web site.

David

Member 419459327-Apr-09 11:22

RePosted from C# Forums: A Job Exam Question

27-Apr-09 11:22

David,

It would be interesting to see how they converted it, however, I have already done that and it works great and I don't need to spend the money to see what they did. The primary fix was to get rid of all of the int's and make them CBT defined types, each one different so you knew what you were dealing with, then define these types as either Signed/Unsigned Longs or Signed/Unsigned Shorts depending on whether you wanted a version to be able to read an old database.

If you do get the source, I can give you my fixes for a PurgeTree to quickly kill a tree that is no longer needed.

Dave.

Bulky Fellow4-Nov-08 20:31

Bulky Fellow

4-Nov-08 20:31

My office is testing a bunch of folks who have applied for Java jobs. I don't know head and tails
of the language other than that its syntax is stolen from here and there (I'm still partyin' with
MFC/Win32, yay!). My buddy was the invigilator and got the questions' sheet. This was one of the
easiest assignments, and shockingly, only 3 out of the 14 people that attempted this problem finished
it correctly on time. Some took more than 45 minutes but got it right, and others didn't solve it at
all! I tried it for the hell of it beat the time by 29 minutes, finishing at 16 minutes (C/DevCPP),
which I guess is not bad. Post your times here, and see if you can bring it down below 16 minutes. Wink | ;)

Q7. (Note: You cannot answer Questions 8, 11 and 14 if you are answering this question.
Pick another 2 from the rest).
Write a command-line program as follows:

First the program will take as input a +ve integer from the user, let's call it
the TARGET.

It will then take in a set of +ve integers from the user, and here the user gets
to specify how many integers he/she wants to enter. Let's call this set the
RANGE.

Now the program should compute as follows:

- Find all combinations in the RANGE, the sum of which add up to the
  target. Print such combinations at the command prompt. In a combination,
  repeats are not allowed. i.e. (x + x + y) = TARGET is forbidden. Every element
  in the combination must be unique.

- In case a user enters a RANGE element greater than the target, it should be immediately
  discarded and the user must be prompted for the element again.
  
- Order shifted cases don't count. i.e. (x+y) and (y+x) being printed
  separately is unnecessary and MUST be avoided.
  
- In case no combinations exist, the program should report so.

- Add exception handling code where necessary.

Example Run:

Enter Target: 15
How many set elements?: 5
Enter Element1: 7
Enter Element2: 8
Enter Element3: 12
Enter Element4: 3
Enter Element5: 22
Sorry. That element is larger than your target. Try again.
Enter Element5: 4

Combinations are:
1] 7+8
2] 8+3+4
3] 12+3

Time: 45 mins.

(At the top of the source file, please comment in your ID details along with the reference
number at the back of the day-card you were provided by the examiner. Write your Full Name,
ID key, TDC number and most importantly the reference number. Good luck.)

ASP - AJAX is SEXY. PERIOD.

Arash Partow4-Nov-08 21:59

Arash Partow

4-Nov-08 21:59

Took roughly ~4mins, the next_combination function (similar to next_permutation) may someday end-up in the STL, if so probably 2mins or so.... Smile | :)

#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>

template <class BidirectionalIterator>
bool next_combination(BidirectionalIterator first,
                      BidirectionalIterator k,
                      BidirectionalIterator last);

int main()
{

   std::vector<int> v;
   v.push_back(7);
   v.push_back(8);
   v.push_back(12);
   v.push_back(3);
   v.push_back(4);

   int target = 15;
   int combination_count = 0;

   std::sort(v.begin(),v.end());

   for(unsigned int i = 1; i <= v.size(); ++i)
   {
      std::vector<int> tmp;
      std::copy(v.begin(),v.end(),std::back_inserter(tmp));
      do
      {
         if (std::accumulate(tmp.begin(),tmp.begin() + i, 0) == target)
         {
            std::cout << ++combination_count << "] ";
            for(unsigned int j = 0; j < i; ++j)
            {
               std::cout << tmp[j] << (j < (i-1) ? "+" : "\n");
            }
         }
      }
      while(next_combination(tmp.begin(),tmp.begin() + i,tmp.end()));
   }

   if (0 == combination_count)
   {
      std::cout << "No combinations found." < std::endl;
   }
   return 0;
}


/*
   Credits: Mark Nelson http://marknelson.us/
*/
template <class BidirectionalIterator>
bool next_combination(BidirectionalIterator first,
                      BidirectionalIterator k,
                      BidirectionalIterator last)
{
   if ((first == last) || (k == first) || (k == last)) return false;
   BidirectionalIterator i = first;
   BidirectionalIterator ii = last;
   ++i;
   if (i == last) return false;
   i = last;
   --i;

   i = k;
   --ii;
   while (i != first)
   {
      if (*--i < *ii)
      { // Search down to find first comb entry less than final entry
         BidirectionalIterator j = k;
         while(!(*i < *j)) j++; // Find swap position [good-1-high|good-2-low]
         iter_swap(i,j); // Swap [good-2-high|good-1-low]
         i++;j++;ii=k; // Move to rotation positions
         rotate(i,j,last); // Rotate [good-2-low-high-good-1]
         while(j!=last){ ++j; ++ii;} // Find high good position
         rotate(k,ii,last); // Rotate [good-2-(low/high>|good-1-<low/high)]
         return true;
      }
   }
   rotate (first,k,last);
   return false;
}

Mark Churchill5-Nov-08 1:46

Mark Churchill

5-Nov-08 1:46

My 10 mins. I really wanted to do something cool with yield, but the difficulty of nesting these really stuffed me. I wanted to recursively traverse the mask like a b-tree and yield out results along the way. Frown | :(

    class SubsetSummer<br />
    {<br />
        static IEnumerable< long> SubsetSumCore(IList< int> values, int target)<br />
        {<br />
            long mask = (1 << values.Count);<br />
<br />
            while(mask-- > 0)<br />
            {<br />
                int total = 0;<br />
                for(int bit = 0; bit< values.Count; bit++)<br />
                {<br />
                    if ((mask & 1 << bit) > 0) total += values[bit];<br />
                }<br />
                   <br />
                if (total == target) yield return mask;<br />
            }<br />
        }<br />
<br />
        static IEnumerable< List< int>> SubsetSum(IList< int> values, int target)<br />
        {<br />
            foreach (var resultMask in SubsetSumCore(values,target))<br />
            {<br />
                var results = new List< int>();<br />
                for (int bit = 0; bit < values.Count; bit++)<br />
                {<br />
                    if ((resultMask & 1 << bit) > 0) results.Add(values[bit]);<br />
                }<br />
                yield return results;<br />
            }<br />
        }<br />
<br />
        static void Main(string[] args)<br />
        {<br />
            var values = new []{7, 8, 12, 3, 4};<br />
<br />
            foreach (var result in SubsetSum(values, 15))<br />
            {<br />
                Console.WriteLine(string.Join(" + ", result.ConvertAll(i => i.ToString()).ToArray()));<br />
            }<br />
        }<br />
    }

(wow that really shagged my generics)

Mark Churchill
Director, Dunn & Churchill Pty Ltd
Free Download: Diamond Binding: The simple, powerful, reliable, and effective data layer toolkit for Visual Studio.

Alpha release: Entanglar: Transparant multiplayer framework for .Net games.

riced9-Nov-08 4:03

riced

9-Nov-08 4:03

Sorry you have failed. The program does not meet the requirements. Where is the user input and validation?
If it does not need to meet them then this program will do the job and take a lot less time and thought.

class SubsetSummer
{
   static void Main(string[] args)
   {
      Console.WriteLine("3 + 4 + 8");
      Console.WriteLine("3 + 12");
      Console.WriteLine("7 + 8");
   }
}

Bulky Fellow9-Nov-08 7:45

Bulky Fellow

9-Nov-08 7:45

You forgot to place this in the punintended namespace. Wink | ;)

ASP - AJAX is SEXY. PERIOD.