Algorithms Discussion Boards

This approach has been done before, e.g. the implementation of qsort in the C runtime library transitions from quicksort to a (linear) insertion sort when the partition size gets below a certain threshold.

I want to calculate the surface of an area on a sphere including projection effect.
For this I have a scanned image (which results in a 2D disc). On this image I can highlight an area and I have the index of each pixel of that area relative to the center of that disc (x, y coordinates). I also have the radius (in pixels and meters)

How can I calculate the correction needed for each pixel in order to take in account the projection effect (larger at the edges of the disc)?

Thanks.

[SOLUTION]
It's a pretty long solution, so I'll try to be as brief as possible.
1. You need to scale down the sphere to a unit sphere (radius=1)
2. With that you know that x²+y²+z² = 1 and thus z = sqrt(1-x²-y²) and the surface (of a full sphere) is 4*pi*r² = 4*pi, but you only have half the sphere so S=2*pi
3. What you need is the cosine of the angle of the pixel with the Z-axis. This is the z calculated in step 2 (the sqrt) divided by the radius which you reduced to one.
4. In the end you get this formula for the area per pixel: (1/r² * 1/(2*pi*sqrt(1-x²-y²)) ). If you loop through the pixels of the complete area you just need to to sum up all these pixels.
5. This results in a fraction of the surface compared to the sphere surface which you can then convert to any unit you like.
On first sight this "looks" correct Smile | :)

[/SOLUTION]

V.

(MQOTD rules and previous solutions)

modified 8-Aug-14 7:50am.

open-source GIS libraries have solved this kind of problems in much more general form, iff u get the time to decipher them Sniff | :^)

I've been studying lock free algorithms, specifically the circular lock free ring buffer. I fully understand what they're doing, but am having trouble applying that to some of my code. It seems that the lock free ring buffer is meant for situations where the queue is neither starved nor saturated for any "significant" (in CPU terms) period of time.

For example, I recently had some code with multiple producers and a single consumer. When the server was running at a normal load, the lock free ring buffer would have been an excellent solution. However, when the server load was low it seems the consumer would just be spinning, using quite a bit of CPU doing nothing. Contra-wise, when the server load was high, the producers would be spinning (in the current code, if this happens, the producers have a wait of a certain period at which point, they discard their data and grab the next set [losing data periodically was permissible since the next set would allow a full reconstruction, albeit with stutter]), but how long do they spin? Can't use just a count since time is important in this situation, but grabbing clock is expensive and simply looping the thread consumes a lot of CPU, depriving other producers of their quanta.

Am I missing something?

I posted a lock-free stack a few days ago. There would be no such issues of spinning/locking up the cpu. The only "problem" with it compared to the circular buffer you're using now is that the stack is FILO instead of FIFO.

A Fundamental Lock-Free Building Block - The Lock-Free Stack[^]

Hope it helps, lemme know if you have any questions.

Michael Gazonda wrote:
There would be no such issues of spinning/locking up the cpu

This is the part I struggle with. If the stack is full, how does the CPU not spin? Likewise, if the stack is empty, pop returns false. Then what?

With a stack, there is no "full". It's like a single-linked list. Each item points to the next, and carries with it the space required for doing so.

If the stack is empty, and returns false, then that's up to you to handle in whatever way is appropriate.

For myself, I wrote this to handle memory allocation (where I would manually allocate on false), or as a messaging system where false just meant there was no work to do and I would wait on a condition_variable.

Why not use one of the many open source message queues out there? This is exactly what they do. I'm assuming, of course, that you are trying to implement a circular buffer to solve a problem and not the other way around Smile | :)

.

* a message queue solves the producer / consumer problem for you
* a message queue solves the distributed workload problem for you
* a message queue can be configured to use either memory / database / files / etc. as a backing store, so you never lose a work request or response
* clustering / scaling / fail over, etc. is all built in for you

I used Apache ActiveMQ on a project with an arbitrary number of producers and about 400 - 500 consumers (VMWares) and it worked great. Producers would submit work requests and the 400 - 500 consumers would just cycle through the various queues looking for work they knew how to do.

The cool thing about ActiveMQ is that it supports push notifications so you don't use any CPU time during idle time.

Because I'm trying to understand lock free algorithms.

(I former colleague evaluated ActiveMQ and found it didn't scale well. Neither did RabbitMQ. For the previous app I worked on, both would have been completely inadequate.)

Yeah, that's why I was asking if you were trying to reinvent the wheel Smile | :)

.

Just as an aside with ActiveMQ, it actually scales EXTREMELY well, you just need to configure it exactly right and use it exactly right. They have a TON of configuration options which are pretty poorly documented IMO. Just mis-setting one of them can kill your performance.

One complaint I do have about it is that it doesn't handle a heavy onslaught of large messages very well. My original implementation tried to send 1MB to 2MB messages and it would crash often (probably a configuration issue as we wanted to slam messages through and didn't care if they got lost). I then started compressing the messages with LZ4 and they went down to about 400k to 500k and the system got pretty damn stable for only having one server and 400 to 500 clients and performance was much better. They are actually tweaking the system right now to send really small messages that point to file shares to pick up the work load.

And nothing against you or your ability, but I'd be surprised if you are able to create a production system that can top one of the widely used message queues out there. At least not in a reasonable amount of time. I don't think I could either. That's all those guys do.

If you are doing it as a learning process, well, that's a different story Smile | :)

.

On the other hand, if you have to deliver a system that's going to scale to 1000 clients and millions of messages / day in a few weeks, well... Smile | :)

.

SledgeHammer01 wrote:
One complaint I do have about it is that it doesn't handle a heavy onslaught of large messages very well

Yup.

SledgeHammer01 wrote:
On the other hand, if you have to deliver a system that's going to scale to 1000 clients and millions of messages / day in a few weeks, well...

What, again, was your definition of scalable? Smile | :)

SledgeHammer01 wrote:
And nothing against you or your ability, but I'd be surprised if you are able to create a production system that can top one of the widely used message queues out there. At least not in a reasonable amount of time. I don't think I could either. That's all those guys do.

Actually, I have, though the use was very narrow and specialized. Again, our testing found that all the commercial solutions had performance bottlenecks or just plain fell apart when trying to use them with massive amounts of data.

BTW, in one app, switching from ZLib to LZ4 was a significant performance gain that overwhelmed any decrease due to slightly larger message sizes.

Joe Woodbury wrote:
What, again, was your definition of scalable? Smile |

I meant if your task was to deliver that kind of scale from a home grown solution in a few weeks, that would be tough. If you managed to do that, that's very impressive.

Joe Woodbury wrote:
Actually, I have, though the use was very narrow and specialized. Again, our testing found that all the commercial solutions had performance bottlenecks or just plain fell apart when trying to use them with massive amounts of data.

BTW, in one app, switching from ZLib to LZ4 was a significant performance gain that overwhelmed any decrease due to slightly larger message sizes.

Yes, there are performance penalties due to all the configuration options.

A slight aside on this topic... we found with our testing that sending large payloads through a message queue (and we originally had a home grown solution as well) is very inefficient.

Say for example, your payload is 1MB files.

Your producer needs to:

1) load the file / deserialize it (or just load it as a binary blob)
2) possibly serialize it back to a binary blob, compress it
3) post it to a queue (whether your own or a commercial one)
4) possibly do some preprocessing on the blob before sending it over the wire

The client needs to:

5) get the large blob
6) decompress it
7) deserialize it
8) do the work

It is more efficient to put the payload on a file share and in the payload just send the UNC path to the file share. This way you get rid of the whole compression / decompression step / multiple serialize / de-serialization steps.

In our project, we have to process 4 BILLION 1MB payloads, so...

ANYWAYS... back to your original question...

Whatever you use, the producer "send" should be instant fire-and-forget and this includes if it needs to get a response. It should do that in an async callback (push notification). There should be pretty much 0% CPU usage when the producer is idle and it should never be blocked. If it has something to submit, it should be able to at any time. The server needs to be able to deal with a flood. Unless your server has a ton of memory and you are guaranteed the clients can keep up at any load, you'd need to have a backing store of some kind when the memory gets filled up.

As far as the client side goes, it sounds like you are polling your circular queue for work. NOTHING in your design should be polling. Polling is a big no-no. Everything should be async push notifications. Sleep / poll is not a good design. Should be push / WaitForSingleObject. WaitForSingleObject does not consume any CPU resources.

SledgeHammer01 wrote:
As far as the client side goes, it sounds like you are polling your circular queue for work. NOTHING in your design should be polling. Polling is a big no-no. Everything should be async push notifications. Sleep / poll is not a good design. Should be push / WaitForSingleObject. WaitForSingleObject does not consume any CPU resources.

That's the point of my question. When doing a lock-free algorithm, how do you avoid polling (due to starvation or saturation)? In the discussions of the well-known circular queue, this is never actually answered--it assumes that a natural balance will always be present.

For the record, my various solutions have used locks.

Joe Woodbury wrote:
That's the point of my question. When doing a lock-free algorithm, how do you
avoid polling (due to starvation or saturation)? In the discussions of the
well-known circular queue, this is never actually answered--it assumes that a
natural balance will always be present.

For the record, my
various solutions have used locks.

I'm not familiar with your requirements, but a circular buffer wouldn't be my first choice for this (unless your requirement is to discard data when the buffer is full). I would use something more like a queue. But right now your code probably looks like this:

while (true)
{
someClass theTask;

if (buffer.GetWork(out theTask))
{
DoWork(theTask);
}
}

or something of that nature. That is a *very* tight loop when no work is available and will kill the CPU. A hack in the polling design is to add a sleep to alleviate the pressure on the CPU. You'll still have some CPU usage, but it'll be much less. A drawback would be, of course, that your client couldn't do any work during the sleep time which is a huge waste of CPU time. I.e. if you slept for 1000ms, and you got a message at 10ms in, your client wouldn't do anything for another 990ms.

A push / no polling design would be something like where your buffer class exposes a waitable handle such as a mutex. Now your client code looks like:

while (true)
{
WaitForSingleObject(buffer.theMutex);
DoWork(buffer.GetWork());
}

or something like that. The WaitForSingleObject() will block with 0% CPU usage. It will only return when the mutex is set.

So...

Your buffer class would set the mutex whenever data is inserted into the queue (in the insert method). The GetWork() method would reset the mutex when there are no remaining items (idle time).

Not sure how you are sending over the network, but sockets support async push notifications, so you'd have to make sure you use that as well. In the socket async callback, you'd insert into the queue which would set the mutex.

No offense, but I already know what your writing--I've written several very high performance algorithms to solve this--my question is purely about how various lock-free algorithms handle starvation/saturation. Various articles on lock-free algorithms make the bold statement that the obvious polling condition, with accompanying CPU saturation, will never occur but never explain why not.

Hello,

I'm facing a strange problem. I have a word for example "ABCDE". I want to match this word againsta sentence like "ABC EFGH IJKL DE". As you can see, the query word is splited and distributed across the senetence. How can I design algorithm which can give me result like the query word has acombination in the sentence and their position.

An example;

QUERY word: ABCDEF
TARGET sentence: EFGH UIOP ABC GHY JKLU EF

I expect the output as the query word has a perfect match by the combining words 3,6.

I tried several algorithms like smithwatermann, levenstein distance and others, but I could figure out how I can compare string with randomly distributed sub-strings in a sentence.

One way of doing is, breaking sentence into words and make all possible combinations and start matching with query which would take forever if I have 500 words.

Any help or suggestion would be highly appreciated.

Hi srikanth2321,
probably there are some better ways to do that, but my suggestion is the most simple:

Step1: Do Wildcard search for *ABCDE*
Step2: Do Wildcard search for *ABCD*E*
Step3: Do Wildcard search for *ABC*DE*
Step4: Do Wildcard search for *AB*CDE*
Step5: Do Wildcard search for *A*BCDE*

My friend, I sense the problem, you should be more precise both in testing and explaining, for instance your second example is buggy.

If you want to play you can try my superb search tool Kazahana (downloadable here at CP).
Let me know if you find any difficulties.

The first exmple done with Step3:

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>copy con test.txt
ABC EFGH IJKL DE
EFGH UIOP ABC GHY JKLU EF
^Z
        1 file(s) copied.

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>type test.txt
ABC EFGH IJKL DE
EFGH UIOP ABC GHY JKLU EF

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>Kazahana.exe "*ABC*DE*" test.txt 999
Kazahana, a superfast exact & wildcards & Levenshtein Distance (Wagner-Fischer) searcher, r. 1-++fix+nowait_critical_nixFIX_Wolfram+fixITER+EX+CS, copyleft Kaze 2014-Mar-25.
Enforcing Case Insensitive wildcard mode ...
Enforcing SLOW wildcard mode ...
Pattern: *abc*de*
omp_get_num_procs( ) = 2
omp_get_max_threads( ) = 2
Enforcing HEXADECAD i.e. hexadecuple-threads ...
Allocating Master-Buffer 999KB ... OK

Kazahana: Total/Checked/Dumped xgrams: 2/2/1
Kazahana: Performance: 0 KB/clock
Kazahana: Performance: 0 xgrams/clock
Kazahana: Performance: Total/fread() clocks: 17/0
Kazahana: Performance: I/O time, i.e. fread() time, is 0 percents
Kazahana: Performance: RDTSC I/O time, i.e. fread() time, is 0 ticks
Kazahana: Done.

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>type Kazahana.txt
ABC EFGH IJKL DE

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>

Okay, this is more serious, 31 patterns are needed to exhaust your request.

0 internal '*':
Step01: *ABCDE*

1 internal '*':
Step02: *ABCD*E*
Step03: *ABC*DE*
Step04: *AB*CDE*
Step05: *A*BCDE*

2 internal '*':
Step06: *A*B*CDE*
Step07: *A*BC*DE*
Step08: *A*BCD*E*
Step09: *AB*C*DE*
Step10: *AB*CD*E*
Step11: *ABC*D*E*

3 internal '*':
Step12: *A*B*C*DE*
Step13: *A*B*CD*E*
Step14: *A*BC*D*E*
Step15: *AB*C*D*E*

Above 15 patterns are optional, they would yield hits with 'high' rank earlier than all the 31 ones below.

Finding all substrings (without elision):
Step01+15: *A*B*C*D*E*

Finding all substrings (with elision):
C(5,4) equals number of wildcard patterns, 5 choose 4 = 5!/(4!*(5-4)!) = 5*4*3*2*1/(4*3*2*1*1) = 5
Step02+15: *B*C*D*E*
Step03+15: *A*C*D*E*
Step04+15: *A*B*D*E*
Step05+15: *A*B*C*E*
Step06+15: *A*B*C*D*

C(5,3) equals number of wildcard patterns, 5 choose 3 = 5!/(3!*(5-3)!) = 5*4*3*2*1/(3*2*1*2*1) = 10
Step07+15: *A*B*C*
Step08+15: *A*B*D*
Step09+15: *A*B*E*
Step10+15: *A*C*D*
Step11+15: *A*C*E*
Step12+15: *A*D*E*

Step13+15: *B*C*D*
Step14+15: *B*C*E*
Step15+15: *B*D*E*

Step16+15: *C*D*E*

C(5,2) equals number of wildcard patterns, 5 choose 2 = 5!/(2!*(5-2)!) = 5*4*3*2*1/(2*1*3*2*1) = 10
Step17+15: *A*B*
Step18+15: *A*C*
Step19+15: *A*D*
Step20+15: *A*E*

Step21+15: *B*C*
Step22+15: *B*D*
Step23+15: *B*E*

Step24+15: *C*D*
Step25+15: *C*E*

Step26+15: *D*E*

C(5,1) equals number of wildcard patterns, 5 choose 1 = 5!/(1!*(5-1)!) = 5*4*3*2*1/(1*4*3*2*1) = 5
Step27+15: *A*
Step28+15: *B*
Step29+15: *C*
Step30+15: *D*
Step31+15: *E*

In essence the number e.g. C(5,4) equals all ways to order any four of five elements, which is 5!/(5-4)! divided by all ways to order any group of 4, which is 4!.
Thus, the number of distinct ways to choose 4 out of 5 is 120 divided by 24.

By chance I wrote the fastest wildcard search function (iterative) so these 31 patterns won't be much of a problem:

int WildcardMatch_Iterative_Kaze(const char* mask, const char* name) {
// Revision 1:
/*
const char* maskSTACK;
const char* nameSTACK;
int BacktrackFlag = 0;
Backtrack:
for (nameSTACK = name, maskSTACK = mask; *nameSTACK; ++nameSTACK, ++maskSTACK) {
	if (*maskSTACK == '&') {
		mask = maskSTACK+1;
		if (!*mask) return 1;
		name = nameSTACK;
		BacktrackFlag = -1;
		goto Backtrack;
	}
	//else if (*maskSTACK == '+') {
	//} else {
	else if (*maskSTACK != '+') {
		//if (tolower(*nameSTACK) != tolower(*maskSTACK)) { // These 'tolower's are outrageous, they hurt speed BADLY, in real-world usage both should have been lowercased outwith the 'for'.
		if (*nameSTACK != *maskSTACK) { 
			if (!BacktrackFlag) return 0;
			name++;
			goto Backtrack;
		}
	} 
} 
while (*maskSTACK == '&') ++maskSTACK;
return (!*maskSTACK);
*/
// Revision 2, 2013-Nov-30:
const char* maskSTACK;
const char* nameSTACK;
//int BacktrackFlag = 0; // No need of it in rev.2
/*
    // Simplest heuristic with SUPEROVERHEAD enforced: trying to skip the whole wildcard section by comparing the two arrays order 1 of mask&name.
    int i;
    unsigned char maskOrder1[256];
    unsigned char nameOrder1[256];
    for (i='a'; i <= 'z'; i++) { maskOrder1[i]=0; nameOrder1[i]=0;}
    for (i=0; i < strlen(mask); i++) maskOrder1[mask[i]]=1;
    for (i=0; i < strlen(name); i++) nameOrder1[name[i]]=1;
    // Assuming the incoming strings are already lowercased (as it should for speed) and if we don't have matching alphabet parts (from mask side) means we don't need to compare any further i.e. the match fails.
    for (i='a'; i <= 'z'; i++) if ( maskOrder1[i] == 1 && nameOrder1[i] == 0 ) return 0;
*/
/*
    int i;
    for (i=0; i < strlen(mask); i++) {
        if ( mask[i] != '&' && mask[i] != '+' )
            if ( !memchr(name,mask[i],strlen(name)) ) return 0;
    }
*/
/*
Backtrack:
for (nameSTACK = name, maskSTACK = mask; *nameSTACK; ++nameSTACK, ++maskSTACK) {
	if (*maskSTACK == '&') {
		mask = maskSTACK+1;
		if (!*mask) return 1;
		name = nameSTACK;
		BacktrackFlag = -1;
		goto Backtrack;
	}
	//else if (*maskSTACK == '+') {
	//} else {
	else if (*maskSTACK != '+') {
		//if (tolower(*nameSTACK) != tolower(*maskSTACK)) { // These 'tolower's are outrageous, they hurt speed BADLY, in real-world usage both should have been lowercased outwith the 'for'.
		if (*nameSTACK != *maskSTACK) { 
			if (!BacktrackFlag) return 0; // Stupid branching, SLOW!
			name++;
			goto Backtrack;
		}
	} 
} 
*/
// Here, outside the main/second 'for', in order to avoid branching we need to set the old/obsolete BacktrackFlag i.e. we need first occurrence of '&':
for (name, mask; *name; ++name, ++mask) {
	if (*mask == '&') {
		goto Backtrack;
	}
	//else if (*mask == '+') {
	//} else {
	else if (*mask != '+') {
		if (*name != *mask) { 
			return 0;
		}
	} 
}
// We are entering the main/second 'for' with mask pointing to '&' as if BacktrackFlag is already set in the very first iteration at first condition:
Backtrack:
for (nameSTACK = name, maskSTACK = mask; *nameSTACK; ++nameSTACK, ++maskSTACK) {
	if (*maskSTACK == '&') {
		mask = maskSTACK+1;
		if (!*mask) return 1;
		name = nameSTACK;
		//BacktrackFlag = -1;
		goto Backtrack;
	}
	//else if (*maskSTACK == '+') {
	//} else {
	else if (*maskSTACK != '+') {
		//if (tolower(*nameSTACK) != tolower(*maskSTACK)) { // These 'tolower's are outrageous, they hurt speed BADLY, in real-world usage both should have been lowercased outwith the 'for'.
		if (*nameSTACK != *maskSTACK) { 
			//if (!BacktrackFlag) return 0; // Stupid branching, SLOW!
			name++;
			goto Backtrack;
		}
	} 
}
while (*maskSTACK == '&') ++maskSTACK;
return (!*maskSTACK);
}

If you can write code generating, in particular, those 31+15 patterns you by all means could name the algorithm after your name, vanity is the strangest teachers of all, one Japanese aikidoka said that.

In case of you being stronger than pride my proposal is to call the whole process 'bunyipping', SOED defines 'bunyip' as:
bunyip, noun.
Austral. M19.
[Aboriginal.]
A fabulous monster of swamps and lagoons.

A ringy word, yes? Quite as 'troll' but not burdened with negativity of internetish jargon.

troll
v. tr.
Slang: To patrol (an area) in search for someone or something: "[Criminals] troll bus stations for young runaways" (Pete Axthelm).
n.
A supernatural creature of Scandinavian folklore, variously portrayed as a friendly or mischievous dwarf or as a giant, that lives in caves, in the hills, or under bridges.
/HERITAGE/

I've created a new sorting algorithm which allows O(1) sorting of ordered data, O(log (root n)) sorting of partially sorted data, and O(log n) sorting of random data.

The underlying mechanism is well suited for a general purpose data structure, which I'm calling a binary search cube, and has notable advantages over binary trees. The data structure is cube-shaped rather than tree-shaped.

I've described the data structure and sorting algorithm on the following webpage:

https://sites.google.com/site/binarysearchcube/

The site also includes three programs written in C, one of them implementing cubesort using a 4 dimensional search cube, and an implementation of a 3 and 4 dimensional binary search cube.

Haven't been able to properly benchmark the software against other sorting algorithms and data structures, but the memory overhead is significantly less.

Hoping for some interesting feedback.

Hi Mr. Hoven,
looks nice feels exciting, surely the showdowns are gonna be must-see.

My suggestion is to make a simple console tool to ease the benchmarking, I will be the first to run some superheavy English phraselists/wordlists (up to 400,000,000 lines) on my laptop T7500.

One very useful filedataset is to sort Knight-Tours (64x2=128 bytes long lines), because you can generate even billions of them in reasonable time, my package (that do that) can be downloaded freely at: www.sanmayce.com/Downloads/Sandokan_r3_vs_Windows-sort.zip[^]

My old console sorter Sandokan outperformed (only for internal sorting) the built-in Windows 7 64bit sort.com:

E:\Sandokan_r3-+++\Sandokan_r3_vs_Windows-sort>GENERATE_3million_Knight-Tours_and_SORT_them.bat

Microsoft Windows [Version 6.1.7601]
Generating 3,000,000 Knight-Tours and dumping them into file 390,000,000 bytes ...

E:\Sandokan_r3-+++\Sandokan_r3_vs_Windows-sort>Knight-tour_r8dump_Microsoft_V16_32bit_Ox.exe a8 3000000  1>KT3million.txt
Sorting these 3,000,000 Knight-Tours ...

E:\Sandokan_r3-+++\Sandokan_r3_vs_Windows-sort>"Sandokan_QuickSortExternal_4+GB_32bit_Microsoft.exe" KT3million.txt /fast /descend 768
Sandokan_QuickSortExternal_4+GB r.3, written by Kaze, using Bill Durango's Quicksort source.
Size of input file: 390,000,000
Counting lines ...
Lines encountered: 3,000,000
Longest line (including CR if present): 129
Allocated memory for pointers-to-lines in MB: 22
Assigning pointers ...
sizeof(int), sizeof(void*): 4, 4
Trying to allocate memory for the file itself in MB: 371 ... OK! Get on with fast internal accesses.
Uploading ...
Sorting 3,000,000 Pointers ...
Quicksort (Insertionsort for small blocks) commenced ...
| RightEnd: 000,001,500,225; NumberOfSplittings: 0,000,335,016; Done: 100% ...
NumberOfComparisons: 71,052,662
The time to sort 3,000,000 items via Quicksort+Insertionsort was 21,029 clocks.
Dumping the sorted data ...
- Done 100% ...
Dumped 3,000,000 lines.
OK! Incoming and resultant file's sizes match.
Dump time: 1,560 clocks.
Total time: 27,144 clocks.
Performance: 14,366 bytes/clock.
Done successfully.

E:\Sandokan_r3-+++\Sandokan_r3_vs_Windows-sort>sha1sum_Microsoft_V16_32bit_Ox.exe "QuickSortExternal_4+GB.txt"
a053faa74ffb6cad41e61e273a1a0e0049cb25e7  QuickSortExternal_4+GB.txt

E:\Sandokan_r3-+++\Sandokan_r3_vs_Windows-sort>timer sort /M 1048576 /R /T e: KT3million.txt /O KT3million.txt.Windows
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31

Kernel Time  =     0.530 =    1%
User Time    =    28.610 =   96%
Process Time =    29.140 =   97%
Global Time  =    29.801 =  100%

E:\Sandokan_r3-+++\Sandokan_r3_vs_Windows-sort>sha1sum_Microsoft_V16_32bit_Ox.exe KT3million.txt.Windows
a053faa74ffb6cad41e61e273a1a0e0049cb25e7  KT3million.txt.Windows

Simply my proposal is to collect several such tools in one package, it will be very informative and much fun.

modified 28-Jun-14 14:40pm.

In my own tests cubesort is 2 times slower than mergesort for random integers and 2.5 times faster for sorted integers. This is using the latest version which I uploaded today and improves performance by about 25%.

I'm not sure if this gap can be closed as mergesort has superior cache performance for random data. I haven't been able to find a decent quicksort implementation.

Cubesort seems best suited for cases where a data set is for more than 50% in order.

When I have a couple of hours I'll make a string based version of cubesort (very easy) and see how fast it sorts the file.

Does the file contain duplicates?

Edit:

It appears the file is in reverse order. Takes about 3.5 seconds to load the file, 5 seconds to sort it using cubesort.

modified 29-Jun-14 19:08pm.

Having benchmarked Cubesort with Knight-Tours my amateurish evaluation is of two words: Mutsi! Mutsunka!

My Sandokan (r3fix) is inferior compared to Cubesort, 10,000,000 lines (128 bytes each) were sorted and after that sorted file was sorted once more:

------------------------------------------------------------------------------------------
| sorter \ filedataset | KT10000000.txt                      | KT10000000_SORTED.txt     |
------------------------------------------------------------------------------------------
| Cubesort             | 080 seconds                         | 045 seconds               |
| Windows' sort.exe    | 118 seconds                         | 115 seconds               |
| Sandokan             | 131 seconds; swappings: 36,826,242+ | 114 seconds; swappings: 0 |
------------------------------------------------------------------------------------------

The benchmark is downloadable at: www.sanmayce.com/Downloads/TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit.zip[^]

The full log follows:

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>dir
 Volume in drive E is SSD_Sanmayce
 Volume Serial Number is 9CF6-FEA3

 Directory of E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit

06/29/2014  11:45 PM    <DIR>          .
06/29/2014  11:45 PM    <DIR>          ..
06/29/2014  11:45 PM            12,566 cubesort-1.0_non-original.c
06/29/2014  11:45 PM            86,016 cubesort-1.0_non-original.exe
06/29/2014  11:45 PM           202,880 Cubesort.zip
06/29/2014  11:45 PM            24,490 Knight-tour_r8dump.c
06/29/2014  11:45 PM            79,872 Knight-tour_r8dump.exe
06/29/2014  11:45 PM               116 Make_EXEs.bat
06/29/2014  11:45 PM             1,604 MokujIN prompt.lnk
06/29/2014  11:45 PM           301,406 Sandokan_Logo.pdf
06/29/2014  11:45 PM           170,433 Sandokan_QuickSortExternal_4+GB_r3fix.c
06/29/2014  11:45 PM           122,368 Sandokan_QuickSortExternal_4+GB_r3fix.exe
06/29/2014  11:45 PM             6,144 timer64.exe
06/29/2014  11:45 PM             1,111 TriShowdown_Sandokan_vs_Windows_vs_Cubesort.bat
              12 File(s)      1,009,006 bytes
               2 Dir(s)  27,537,633,280 bytes free

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>move ..\KT10000000.txt
        1 file(s) moved.

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>TriShowdown_Sandokan_vs_Windows_vs_Cubesort.bat
Sorting 10,000,000 KT...

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>timer64.exe cubesort-1.0_non-original.exe KT10000000.txt
Cubesort: sorted 10000000 elements in 66503 clocks.

Kernel  Time =     6.754 =    8%
User    Time =    54.756 =   68%
Process Time =    61.511 =   76%    Virtual  Memory =   2676 MB
Global  Time =    80.355 =  100%    Physical Memory =   2447 MB
Sorting 10,000,000 KT...

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>timer64.exe sort /M 1048576 /T D: KT10000000.txt /O WindowsSORT_ResultantFile.txt

Kernel  Time =     1.466 =    1%
User    Time =    88.249 =   74%
Process Time =    89.716 =   75%    Virtual  Memory =   1028 MB
Global  Time =   118.248 =  100%    Physical Memory =   1021 MB

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>fc Cubesort_ResultantFile.txt WindowsSORT_ResultantFile.txt /b
Comparing files Cubesort_ResultantFile.txt and WINDOWSSORT_RESULTANTFILE.TXT
FC: no differences encountered

Sorting 10,000,000 KT...

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>timer64.exe "Sandokan_QuickSortExternal_4+GB_r3fix.exe" KT10000000.txt /fast /ascend 512
Sandokan_QuickSortExternal_4+GB, revision 3fix, written by Kaze, using Bill Durango's Quicksort source.
Size of input file: 1,300,000,000
Counting lines ...
Lines encountered: 10,000,000
Longest line (including CR if present): 129
Allocated memory for pointers-to-lines in MB: 76
Assigning pointers ...
sizeof(int), sizeof(void*): 4, 8
Trying to allocate memory for the file itself in MB: 1239 ... OK! Get on with fast internal accesses.
Uploading ...
Sorting 10,000,000 Pointers ...
Quicksort (Insertionsort for small blocks) commenced ...
\ RightEnd: 000,001,500,225; NumberOfSplittings: 0,001,676,176; Done: 100% ...
NumberOfSwappings: 36,826,242
NumberOfComparisons: 255,105,484
The time to sort 10,000,000 items via Quicksort+Insertionsort was 106,704 clocks.
Dumping the sorted data ...
\ Done 100% ...
Dumped 10,000,000 lines.
OK! Incoming and resultant file's sizes match.
Dump time: 6,739 clocks.
Total time: 131,258 clocks.
Performance: 9,904 bytes/clock.
Done successfully.

Kernel  Time =     5.397 =    4%
User    Time =   116.173 =   88%
Process Time =   121.571 =   92%    Virtual  Memory =   1320 MB
Global  Time =   131.461 =  100%    Physical Memory =   1321 MB

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>fc QuickSortExternal_4+GB.txt WindowsSORT_ResultantFile.txt /b
Comparing files QuickSortExternal_4+GB.txt and WINDOWSSORT_RESULTANTFILE.TXT
FC: no differences encountered

Sorting sorted 10,000,000 KT...

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>timer64.exe cubesort-1.0_non-original.exe WindowsSORT_ResultantFile.txt
Cubesort: sorted 10000000 elements in 28064 clocks.

Kernel  Time =     6.723 =   14%
User    Time =    27.284 =   60%
Process Time =    34.008 =   74%    Virtual  Memory =   2661 MB
Global  Time =    45.474 =  100%    Physical Memory =   2560 MB
Sorting sorted 10,000,000 KT...

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>timer64.exe sort /M 1048576 /T D: WindowsSORT_ResultantFile.txt /O WindowsSORT_ResultantFile_.txt

Kernel  Time =     1.170 =    1%
User    Time =    76.237 =   66%
Process Time =    77.407 =   67%    Virtual  Memory =   1028 MB
Global  Time =   115.081 =  100%    Physical Memory =   1021 MB

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>fc Cubesort_ResultantFile.txt WindowsSORT_ResultantFile_.txt /b
Comparing files Cubesort_ResultantFile.txt and WINDOWSSORT_RESULTANTFILE_.TXT
FC: no differences encountered

Sorting sorted 10,000,000 KT...

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>timer64.exe "Sandokan_QuickSortExternal_4+GB_r3fix.exe" WindowsSORT_ResultantFile.txt /fast /ascend 512
Sandokan_QuickSortExternal_4+GB, revision 3fix, written by Kaze, using Bill Durango's Quicksort source.
Size of input file: 1,300,000,000
Counting lines ...
Lines encountered: 10,000,000
Longest line (including CR if present): 129
Allocated memory for pointers-to-lines in MB: 76
Assigning pointers ...
sizeof(int), sizeof(void*): 4, 8
Trying to allocate memory for the file itself in MB: 1239 ... OK! Get on with fast internal accesses.
Uploading ...
Sorting 10,000,000 Pointers ...
Quicksort (Insertionsort for small blocks) commenced ...
\ RightEnd: 000,010,000,000; NumberOfSplittings: 0,001,611,392; Done: 100% ...
NumberOfSwappings: 0
NumberOfComparisons: 214,016,797
The time to sort 10,000,000 items via Quicksort+Insertionsort was 90,278 clocks.
Dumping the sorted data ...
\ Done 100% ...
Dumped 10,000,000 lines.
OK! Incoming and resultant file's sizes match.
Dump time: 5,429 clocks.
Total time: 114,488 clocks.
Performance: 11,354 bytes/clock.
Done successfully.

Kernel  Time =     6.364 =    5%
User    Time =    99.372 =   86%
Process Time =   105.737 =   92%    Virtual  Memory =   1320 MB
Global  Time =   114.691 =  100%    Physical Memory =   1321 MB

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>fc QuickSortExternal_4+GB.txt WindowsSORT_ResultantFile_.txt /b
Comparing files QuickSortExternal_4+GB.txt and WINDOWSSORT_RESULTANTFILE_.TXT
FC: no differences encountered

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>

Cubesort is one lovely piece of software, Mr. Hoven I salute you with Beth Ditto's Heavy Cross[^]:

It's a cruel cruel world to face on your own
A heavy cross to carry alone
The lights are on but everyone's gone
And it's cruel
It's a funny way to make ends meet
When the lights are out on every street
It feels alright but never complete
without you
I chose you
If it's already have been done
Undo it
It takes two
It is up to me and you to proove it
On the rainy nights even the coldest days
you're moments ago but seconds away
The principle of Nature it's true but it's a cruel world
See we can play it safe or play it cool
Follow the leader or make up all the rules
Whatever you want the choice is yours
So choose me
I chose you

"If it's already have been done... undo it." - I wish Beth was into programming.

The most decent Quicksort I know is this:
https://code.google.com/p/libdivsufsort/[^]

libdivsufsort implements a suffix array, which according to some sources is about as fast as quicksort.

I made some improvements to the memory handling of Cubesort and it now is 1.21 times slower than mergesort for random data, and 5.75 times faster for sorted data.

Might see further improvement gains by setting BSC_L to 1 when comparing strings.

Cubesort is awesome, with big potential on top of that.
The thing that impresses me most is its construction speed, not to mention the 'free' dump.

Couldn't resist not to try my 'secret weapon' - a frozen/unfinished sub-project called 'Tcheburaschkasort'.
It is entirely based on the titaniumESQUEly strong Bayer-McCreight algorithm which I implemented specifically and only as order 3 (three children maximum).
It's been some time since I wrote it, enough to forget this-and-that, but the initial idea was to use next dimension i.e. to use millions of B-trees.

My unfinished Tcheburaschkasort currently uses 24bit pseudo-hash (in fact first 1|2|3 byte(s) ARE a slot) giving a forest of 16,777,216 B-Trees of order 3:

UINT PseudoHash(const char *str, unsigned int wrdlen)
{
	UINT hash32 = 0;
	const char *p = str;
	// Use 256-based system for first 3 bytes i.e. left is more significant:
	// Currently 3bytes long i.e. 24bit:
	if (wrdlen>=1) {
		hash32 =  hash32 | ((*p)<<16);
		p += 1;
	}
	if (wrdlen>=2) {
		hash32 =  hash32 | ((*p)<<8);
		p += 1;
	}
	if (wrdlen>=3) {
		hash32 =  hash32 | ((*p)<<0);
		p += 1;
	}
	return hash32;
}

Each slot houses the root of a tree.
Not using any hash Tcheburaschkasort resembles a hypercube of order 0, using hash makes it 1D i.e. of order 1 - lookup hashtable is the axis.

Blah-blah, in short the randomness of those prefixes (1..3bytes long) decides the speed performance.
In example with KT10000000.txt all first 3 bytes are equal, 'A8C', so there is one only tree and performance is poor.
Because all the 10,000,000 lines start from board position A8.
Knight-tours: each square represents a board position, as A1(bottom-left), H8(top-right):

-----------------
|a8 |b8 |...|h8 |
-----------------
|a7 |...|       |
-----------------
|...|...|       |
-----------------
|a1 |...|...|h1 |
-----------------

When making the data more diverse, e.g. reverse the order of tours then for KT10000000.txt we will have 50 trees, see below.
Let us see what boost comes from this diversity.

So, Tcheburaschkasort vs Cubesort results for KT10M filedataset (10,000,000 Knight-Tours from A8, reversed):

NOTE:
Tcheburaschkasort is based on my superfast multi-purpose text ripper Leprechaun.
Tcheburaschkasort still DOES NOT make in-order dump/traversal, resultant file is unsorted, to be done. This DOES NOT affect results, though.

The full log is as follows:

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>dir
 Volume in drive E is SSD_Sanmayce
 Volume Serial Number is 9CF6-FEA3

 Directory of E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit

07/03/2014  04:19 AM    <DIR>          .
07/03/2014  04:19 AM    <DIR>          ..
07/03/2014  04:20 AM               823 BiShowdown_Cubesort_Tcheburaschkasort.bat
07/03/2014  04:20 AM            12,566 cubesort-1.0_non-original.c
07/03/2014  04:20 AM            86,016 cubesort-1.0_non-original.exe
07/03/2014  04:20 AM           202,880 Cubesort.zip
07/03/2014  04:20 AM            24,490 Knight-tour_r8dump.c
07/03/2014  04:20 AM            79,872 Knight-tour_r8dump.exe
07/03/2014  04:20 AM            25,071 Knight-tour_r8dump_REVERSE.c
07/03/2014  04:20 AM            79,360 Knight-tour_r8dump_REVERSE.exe
07/03/2014  04:20 AM               216 Make_EXEs.bat
07/03/2014  04:20 AM             1,604 MokujIN prompt.lnk
07/03/2014  04:20 AM           170,433 Sandokan_QuickSortExternal_4+GB_r3fix.c
07/03/2014  04:20 AM           122,368 Sandokan_QuickSortExternal_4+GB_r3fix.exe
07/03/2014  04:20 AM           327,819 TcheburaschkaSort.c
07/03/2014  04:20 AM         1,771,134 TcheburaschkaSort.cod
07/03/2014  04:20 AM           140,288 TcheburaschkaSort.exe
07/03/2014  04:20 AM             6,144 timer64.exe
              16 File(s)      3,051,084 bytes
               2 Dir(s)  29,523,390,464 bytes free

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>BiShowdown_Cubesort_Tcheburaschkasort.bat
Dumping 10,000,000 reversed KT...

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>Knight-tour_r8dump_REVERSE A8 10000000 1>KT10000000_from-end-to-start.txt

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>sort /M 1048576 /T D: /R KT10000000_from-end-to-start.txt /O KT10000000_from-end-to-start-SORTED-IN-REVERSE.txt
Sorting 10,000,000 KT...

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>timer64.exe cubesort-1.0_non-original.exe KT10000000_from-end-to-start.txt
Cubesort: sorted 10000000 elements in 94349 clocks.

Kernel  Time =     8.283 =    7%
User    Time =    59.592 =   55%
Process Time =    67.876 =   62%    Virtual  Memory =   2689 MB
Global  Time =   108.217 =  100%    Physical Memory =   2497 MB
Sorting 10,000,000 KT...

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>timer64.exe sort /M 1048576 /T D: KT10000000_from-end-to-start.txt /O KT10000000_from-end-to-start-SORTED.txt

Kernel  Time =     1.778 =    1%
User    Time =    83.132 =   48%
Process Time =    84.911 =   49%    Virtual  Memory =   1028 MB
Global  Time =   173.113 =  100%    Physical Memory =   1021 MB

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>fc Cubesort_ResultantFile.txt KT10000000_from-end-to-start-SORTED.txt /b
Comparing files Cubesort_ResultantFile.txt and KT10000000_FROM-END-TO-START-SORTED.TXT
FC: no differences encountered

Sorting 10,000,000 KT...

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>dir KT10000000_from-end-to-start.txt/b 1>TcheburaschkaSort.lst

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>timer64.exe TcheburaschkaSort TcheburaschkaSort.lst TcheburaschkaSort.txt 2900123 y
Leprechaun_singleton (Fast-In-Future Greedy n-gram-Ripper), rev. 16FIXFIX, written by Svalqyatchx.
Purpose: Rips all distinct 1-grams (1-word phrases) with length 1..128 chars from incoming texts.
Feature1: All words within x-lets/n-grams are in range 1..31 chars inclusive.
Feature2: In this revision 128MB 1-way hash is used which results in 16,777,216 external B-Trees of order 3.
Feature3: In this revision, 1 pass is to be made.
Feature4: If the external memory has latency 99+microseconds then !(look no further), IOPS(seek-time) rules.
Pass #1 of 1:
Size of input file with files for Leprechauning: 34
Allocating HASH memory 134,217,793 bytes ... OK
Allocating memory 2833MB ... OK
Size of Input TEXTual file: 1,300,000,000
\; 00,222,222P/s; Phrase count: 10,000,000 of them 10,000,000 distinct; Done: 64/64
Bytes per second performance: 28,888,888B/s
Phrases per second performance: 222,222P/s
Time for putting phrases into trees: 45 second(s)
Flushing UNsorted phrases: 100%; Shaking trees performance: 00,952,380P/s
Time for shaking phrases from trees: 21 second(s)
Leprechaun: Current pass done.

Total memory needed for one pass: 2,133,638KB
Total distinct phrases: 10,000,000
Total time: 67 second(s)
Total performance: 149,253P/s i.e. phrases per second
Leprechaun: Done.

Kernel  Time =     6.193 =    9%
User    Time =    59.732 =   89%
Process Time =    65.926 =   99%    Virtual  Memory =   2967 MB
Global  Time =    66.580 =  100%    Physical Memory =   2218 MB

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>type Leprechaun.LOG
Leprechaun report:
Number Of Hash Collisions(Distinct WORDs - Number Of Trees): 9,999,950
Number Of Trees(GREATER THE BETTER): 50
Number Of LEAFs(littler THE BETTER) not counting ROOT LEAFs: 7,533,898
Highest Tree not counting ROOT Level i.e. CORONA levels(littler THE BETTER): 15
Used value for third parameter in KB: 2,900,123
Use next time as third parameter: 2,133,638
Total Attempts to Find/Put WORDs into B-trees order 3: 164,504,421

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>type KT10000000_from-end-to-start.txt|more
E3D1C3E4F6D5B6A4C5E6F4D3B2C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
F6E4C3D1E3D5B6A4C5E6F4D3B2C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
E3D1C3A4B6D5F6E4C5E6F4D3B2C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
B6A4C3D1E3D5F6E4C5E6F4D3B2C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
E3D1C3E4F6D5F4E6C5D3B2A4B6C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
F6E4C3D1E3D5F4E6C5D3B2A4B6C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
E3D1C3D5F6E4C5E6F4D3B2A4B6C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
C3D1E3D5F6E4C5E6F4D3B2A4B6C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
F6D5E3D1C3E4C5E6F4D3B2A4B6C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
...

Oh, and for those who are interested here are the stats (memory requirement and number of attempts i.e. depthness of the B-tree):
Simply you need 15 (in worst case) attempts in order to find a key within 10,000,000 keys.
Nothing impressive, 15 tries, yet with no unpleasant surprises like unbalancing.
Also for data 1,240MB (the 130x10,000,000) the tree is 2,133,638KB=2,083MB in size.

There are three reasons for Tcheburaschkasort's poor performance, first is the slow dumping of sorted data:
66 seconds are divided in:
Sorting: Time for putting phrases into trees: 45 second(s)
Dumping: Time for shaking phrases from trees: 21 second(s)

Second is the way I insert new keys, for speed boost (when incoming data is not of unique keys as here) fast-search is performed and if failed new search-insert is enforced, grmbl.
In simple words, insertion is slow, especially for order 3, double-grmbl.

Third is the parsing of keys, in Cubesort I wrote it synthetically:

fread(&z_array[cnt].key, sizeof(keyType), 1, ifp);

While in Tcheburaschkasort the parsing is byte-by-byte, as in real-world rippers:

                if ( workbyte < '0' ) // Most characters are under alphabet - only one if // Cheburashka
                {
ElStupido:
                        // This fragment is MIRRORed: #1 copy [
                        if (workbyte == 10) {NumberOfLines++;}
...
// Cheburashka [ 
                else if( workbyte <= '9' )
		{
                        //if( wrdlen < 31 )
                        if( wrdlen < 128 ) // Cheburashka
                        //if( wrdlen < LongestLineInclusive )
                        { wrd[ wrdlen ] = workbyte; }
                        wrdlen++;
		}
                else if( workbyte >= 'A' &&  workbyte <= 'Z' )
		{
                        //if( wrdlen < 31 )
                        if( wrdlen < 128 ) // Cheburashka
                        //if( wrdlen < LongestLineInclusive )
                        { wrd[ wrdlen ] = workbyte; }
                        wrdlen++;
		}
// Cheburashka ]
                else if( workbyte >= 'a' &&  workbyte <= 'z' )
		{
                        //if( wrdlen < 31 )
                        if( wrdlen < 128 ) // Cheburashka
                        //if( wrdlen < LongestLineInclusive )
                        { wrd[ wrdlen ] = workbyte; }
                        wrdlen++;
		}
		else
                {
                        // This fragment is MIRRORed: #2 copy [
                goto ElStupido;
                        // This fragment is MIRRORed: #2 copy ]
		}

Scared! Well above fragment handles 'words' of alpha-numerical type 1..128 bytes in length.

Having an old Samsung 470 SSD on my laptop (Core 2 T7500 2200MHz) I ran the scariest sort of all, with all-external memory operations i.e. no physical no virtual used:

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>timer64.exe TcheburaschkaSort TcheburaschkaSort.lst TcheburaschkaSort.txt 2900123 z
Leprechaun_singleton (Fast-In-Future Greedy n-gram-Ripper), rev. 16FIXFIX, written by Svalqyatchx.
Purpose: Rips all distinct 1-grams (1-word phrases) with length 1..128 chars from incoming texts.
Feature1: All words within x-lets/n-grams are in range 1..31 chars inclusive.
Feature2: In this revision 128MB 1-way hash is used which results in 16,777,216 external B-Trees of order 3.
Feature3: In this revision, 1 pass is to be made.
Feature4: If the external memory has latency 99+microseconds then !(look no further), IOPS(seek-time) rules.
Pass #1 of 1:
Size of input file with files for Leprechauning: 34
Allocating HASH memory 134,217,793 bytes ... OK
Allocating/ZEROing 2,969,725,966 bytes swap file ... OK
Size of Input TEXTual file: 1,300,000,000
\; 00,008,643P/s; Phrase count: 10,000,000 of them 10,000,000 distinct; Done: 64/64
Bytes per second performance: 1,123,595B/s
Phrases per second performance: 8,643P/s
Time for putting phrases into trees: 1157 second(s)
Flushing UNsorted phrases: 100%; Shaking trees performance: 00,056,179P/s
Time for shaking phrases from trees: 356 second(s)
Leprechaun: Current pass done.

Total memory needed for one pass: 2,133,638KB
Total distinct phrases: 10,000,000
Total time: 1528 second(s)
Total performance: 6,544P/s i.e. phrases per second
Leprechaun: Done.

Kernel  Time =  1096.609 =   71%
User    Time =   217.449 =   14%
Process Time =  1314.058 =   86%    Virtual  Memory =    130 MB
Global  Time =  1527.601 =  100%    Physical Memory =    131 MB

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>

Only 1527/108=14 times slower than Cubesort, not bad in my eyes.

As for Tcheburaschka, few things remain to be done, in-order traversal dump and few more raw ideas, however I lost my momentum, now I am interested in textual decompression.

Ah, and the name, Чебурашка, it is a little supercalm soullet (English lacks double diminutives), in fact the golden Russian animatronics movie, love it.

How to reduce a quantified 2 Sat formula to a quantified horn formula?