Re: Cubesort - Algorithms Discussion Boards

SledgeHammer01 wrote:
As far as the client side goes, it sounds like you are polling your circular queue for work. NOTHING in your design should be polling. Polling is a big no-no. Everything should be async push notifications. Sleep / poll is not a good design. Should be push / WaitForSingleObject. WaitForSingleObject does not consume any CPU resources.

That's the point of my question. When doing a lock-free algorithm, how do you avoid polling (due to starvation or saturation)? In the discussions of the well-known circular queue, this is never actually answered--it assumes that a natural balance will always be present.

For the record, my various solutions have used locks.

Joe Woodbury wrote:
That's the point of my question. When doing a lock-free algorithm, how do you
avoid polling (due to starvation or saturation)? In the discussions of the
well-known circular queue, this is never actually answered--it assumes that a
natural balance will always be present.

For the record, my
various solutions have used locks.

I'm not familiar with your requirements, but a circular buffer wouldn't be my first choice for this (unless your requirement is to discard data when the buffer is full). I would use something more like a queue. But right now your code probably looks like this:

while (true)
{
someClass theTask;

if (buffer.GetWork(out theTask))
{
DoWork(theTask);
}
}

or something of that nature. That is a *very* tight loop when no work is available and will kill the CPU. A hack in the polling design is to add a sleep to alleviate the pressure on the CPU. You'll still have some CPU usage, but it'll be much less. A drawback would be, of course, that your client couldn't do any work during the sleep time which is a huge waste of CPU time. I.e. if you slept for 1000ms, and you got a message at 10ms in, your client wouldn't do anything for another 990ms.

A push / no polling design would be something like where your buffer class exposes a waitable handle such as a mutex. Now your client code looks like:

while (true)
{
WaitForSingleObject(buffer.theMutex);
DoWork(buffer.GetWork());
}

or something like that. The WaitForSingleObject() will block with 0% CPU usage. It will only return when the mutex is set.

So...

Your buffer class would set the mutex whenever data is inserted into the queue (in the insert method). The GetWork() method would reset the mutex when there are no remaining items (idle time).

Not sure how you are sending over the network, but sockets support async push notifications, so you'd have to make sure you use that as well. In the socket async callback, you'd insert into the queue which would set the mutex.

No offense, but I already know what your writing--I've written several very high performance algorithms to solve this--my question is purely about how various lock-free algorithms handle starvation/saturation. Various articles on lock-free algorithms make the bold statement that the obvious polling condition, with accompanying CPU saturation, will never occur but never explain why not.

Hello,

I'm facing a strange problem. I have a word for example "ABCDE". I want to match this word againsta sentence like "ABC EFGH IJKL DE". As you can see, the query word is splited and distributed across the senetence. How can I design algorithm which can give me result like the query word has acombination in the sentence and their position.

An example;

QUERY word: ABCDEF
TARGET sentence: EFGH UIOP ABC GHY JKLU EF

I expect the output as the query word has a perfect match by the combining words 3,6.

I tried several algorithms like smithwatermann, levenstein distance and others, but I could figure out how I can compare string with randomly distributed sub-strings in a sentence.

One way of doing is, breaking sentence into words and make all possible combinations and start matching with query which would take forever if I have 500 words.

Any help or suggestion would be highly appreciated.

Hi srikanth2321,
probably there are some better ways to do that, but my suggestion is the most simple:

Step1: Do Wildcard search for *ABCDE*
Step2: Do Wildcard search for *ABCD*E*
Step3: Do Wildcard search for *ABC*DE*
Step4: Do Wildcard search for *AB*CDE*
Step5: Do Wildcard search for *A*BCDE*

My friend, I sense the problem, you should be more precise both in testing and explaining, for instance your second example is buggy.

If you want to play you can try my superb search tool Kazahana (downloadable here at CP).
Let me know if you find any difficulties.

The first exmple done with Step3:

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>copy con test.txt
ABC EFGH IJKL DE
EFGH UIOP ABC GHY JKLU EF
^Z
        1 file(s) copied.

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>type test.txt
ABC EFGH IJKL DE
EFGH UIOP ABC GHY JKLU EF

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>Kazahana.exe "*ABC*DE*" test.txt 999
Kazahana, a superfast exact & wildcards & Levenshtein Distance (Wagner-Fischer) searcher, r. 1-++fix+nowait_critical_nixFIX_Wolfram+fixITER+EX+CS, copyleft Kaze 2014-Mar-25.
Enforcing Case Insensitive wildcard mode ...
Enforcing SLOW wildcard mode ...
Pattern: *abc*de*
omp_get_num_procs( ) = 2
omp_get_max_threads( ) = 2
Enforcing HEXADECAD i.e. hexadecuple-threads ...
Allocating Master-Buffer 999KB ... OK

Kazahana: Total/Checked/Dumped xgrams: 2/2/1
Kazahana: Performance: 0 KB/clock
Kazahana: Performance: 0 xgrams/clock
Kazahana: Performance: Total/fread() clocks: 17/0
Kazahana: Performance: I/O time, i.e. fread() time, is 0 percents
Kazahana: Performance: RDTSC I/O time, i.e. fread() time, is 0 ticks
Kazahana: Done.

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>type Kazahana.txt
ABC EFGH IJKL DE

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>

Okay, this is more serious, 31 patterns are needed to exhaust your request.

0 internal '*':
Step01: *ABCDE*

1 internal '*':
Step02: *ABCD*E*
Step03: *ABC*DE*
Step04: *AB*CDE*
Step05: *A*BCDE*

2 internal '*':
Step06: *A*B*CDE*
Step07: *A*BC*DE*
Step08: *A*BCD*E*
Step09: *AB*C*DE*
Step10: *AB*CD*E*
Step11: *ABC*D*E*

3 internal '*':
Step12: *A*B*C*DE*
Step13: *A*B*CD*E*
Step14: *A*BC*D*E*
Step15: *AB*C*D*E*

Above 15 patterns are optional, they would yield hits with 'high' rank earlier than all the 31 ones below.

Finding all substrings (without elision):
Step01+15: *A*B*C*D*E*

Finding all substrings (with elision):
C(5,4) equals number of wildcard patterns, 5 choose 4 = 5!/(4!*(5-4)!) = 5*4*3*2*1/(4*3*2*1*1) = 5
Step02+15: *B*C*D*E*
Step03+15: *A*C*D*E*
Step04+15: *A*B*D*E*
Step05+15: *A*B*C*E*
Step06+15: *A*B*C*D*

C(5,3) equals number of wildcard patterns, 5 choose 3 = 5!/(3!*(5-3)!) = 5*4*3*2*1/(3*2*1*2*1) = 10
Step07+15: *A*B*C*
Step08+15: *A*B*D*
Step09+15: *A*B*E*
Step10+15: *A*C*D*
Step11+15: *A*C*E*
Step12+15: *A*D*E*

Step13+15: *B*C*D*
Step14+15: *B*C*E*
Step15+15: *B*D*E*

Step16+15: *C*D*E*

C(5,2) equals number of wildcard patterns, 5 choose 2 = 5!/(2!*(5-2)!) = 5*4*3*2*1/(2*1*3*2*1) = 10
Step17+15: *A*B*
Step18+15: *A*C*
Step19+15: *A*D*
Step20+15: *A*E*

Step21+15: *B*C*
Step22+15: *B*D*
Step23+15: *B*E*

Step24+15: *C*D*
Step25+15: *C*E*

Step26+15: *D*E*

C(5,1) equals number of wildcard patterns, 5 choose 1 = 5!/(1!*(5-1)!) = 5*4*3*2*1/(1*4*3*2*1) = 5
Step27+15: *A*
Step28+15: *B*
Step29+15: *C*
Step30+15: *D*
Step31+15: *E*

In essence the number e.g. C(5,4) equals all ways to order any four of five elements, which is 5!/(5-4)! divided by all ways to order any group of 4, which is 4!.
Thus, the number of distinct ways to choose 4 out of 5 is 120 divided by 24.

By chance I wrote the fastest wildcard search function (iterative) so these 31 patterns won't be much of a problem:

int WildcardMatch_Iterative_Kaze(const char* mask, const char* name) {
// Revision 1:
/*
const char* maskSTACK;
const char* nameSTACK;
int BacktrackFlag = 0;
Backtrack:
for (nameSTACK = name, maskSTACK = mask; *nameSTACK; ++nameSTACK, ++maskSTACK) {
	if (*maskSTACK == '&') {
		mask = maskSTACK+1;
		if (!*mask) return 1;
		name = nameSTACK;
		BacktrackFlag = -1;
		goto Backtrack;
	}
	//else if (*maskSTACK == '+') {
	//} else {
	else if (*maskSTACK != '+') {
		//if (tolower(*nameSTACK) != tolower(*maskSTACK)) { // These 'tolower's are outrageous, they hurt speed BADLY, in real-world usage both should have been lowercased outwith the 'for'.
		if (*nameSTACK != *maskSTACK) { 
			if (!BacktrackFlag) return 0;
			name++;
			goto Backtrack;
		}
	} 
} 
while (*maskSTACK == '&') ++maskSTACK;
return (!*maskSTACK);
*/
// Revision 2, 2013-Nov-30:
const char* maskSTACK;
const char* nameSTACK;
//int BacktrackFlag = 0; // No need of it in rev.2
/*
    // Simplest heuristic with SUPEROVERHEAD enforced: trying to skip the whole wildcard section by comparing the two arrays order 1 of mask&name.
    int i;
    unsigned char maskOrder1[256];
    unsigned char nameOrder1[256];
    for (i='a'; i <= 'z'; i++) { maskOrder1[i]=0; nameOrder1[i]=0;}
    for (i=0; i < strlen(mask); i++) maskOrder1[mask[i]]=1;
    for (i=0; i < strlen(name); i++) nameOrder1[name[i]]=1;
    // Assuming the incoming strings are already lowercased (as it should for speed) and if we don't have matching alphabet parts (from mask side) means we don't need to compare any further i.e. the match fails.
    for (i='a'; i <= 'z'; i++) if ( maskOrder1[i] == 1 && nameOrder1[i] == 0 ) return 0;
*/
/*
    int i;
    for (i=0; i < strlen(mask); i++) {
        if ( mask[i] != '&' && mask[i] != '+' )
            if ( !memchr(name,mask[i],strlen(name)) ) return 0;
    }
*/
/*
Backtrack:
for (nameSTACK = name, maskSTACK = mask; *nameSTACK; ++nameSTACK, ++maskSTACK) {
	if (*maskSTACK == '&') {
		mask = maskSTACK+1;
		if (!*mask) return 1;
		name = nameSTACK;
		BacktrackFlag = -1;
		goto Backtrack;
	}
	//else if (*maskSTACK == '+') {
	//} else {
	else if (*maskSTACK != '+') {
		//if (tolower(*nameSTACK) != tolower(*maskSTACK)) { // These 'tolower's are outrageous, they hurt speed BADLY, in real-world usage both should have been lowercased outwith the 'for'.
		if (*nameSTACK != *maskSTACK) { 
			if (!BacktrackFlag) return 0; // Stupid branching, SLOW!
			name++;
			goto Backtrack;
		}
	} 
} 
*/
// Here, outside the main/second 'for', in order to avoid branching we need to set the old/obsolete BacktrackFlag i.e. we need first occurrence of '&':
for (name, mask; *name; ++name, ++mask) {
	if (*mask == '&') {
		goto Backtrack;
	}
	//else if (*mask == '+') {
	//} else {
	else if (*mask != '+') {
		if (*name != *mask) { 
			return 0;
		}
	} 
}
// We are entering the main/second 'for' with mask pointing to '&' as if BacktrackFlag is already set in the very first iteration at first condition:
Backtrack:
for (nameSTACK = name, maskSTACK = mask; *nameSTACK; ++nameSTACK, ++maskSTACK) {
	if (*maskSTACK == '&') {
		mask = maskSTACK+1;
		if (!*mask) return 1;
		name = nameSTACK;
		//BacktrackFlag = -1;
		goto Backtrack;
	}
	//else if (*maskSTACK == '+') {
	//} else {
	else if (*maskSTACK != '+') {
		//if (tolower(*nameSTACK) != tolower(*maskSTACK)) { // These 'tolower's are outrageous, they hurt speed BADLY, in real-world usage both should have been lowercased outwith the 'for'.
		if (*nameSTACK != *maskSTACK) { 
			//if (!BacktrackFlag) return 0; // Stupid branching, SLOW!
			name++;
			goto Backtrack;
		}
	} 
}
while (*maskSTACK == '&') ++maskSTACK;
return (!*maskSTACK);
}

If you can write code generating, in particular, those 31+15 patterns you by all means could name the algorithm after your name, vanity is the strangest teachers of all, one Japanese aikidoka said that.

In case of you being stronger than pride my proposal is to call the whole process 'bunyipping', SOED defines 'bunyip' as:
bunyip, noun.
Austral. M19.
[Aboriginal.]
A fabulous monster of swamps and lagoons.

A ringy word, yes? Quite as 'troll' but not burdened with negativity of internetish jargon.

troll
v. tr.
Slang: To patrol (an area) in search for someone or something: "[Criminals] troll bus stations for young runaways" (Pete Axthelm).
n.
A supernatural creature of Scandinavian folklore, variously portrayed as a friendly or mischievous dwarf or as a giant, that lives in caves, in the hills, or under bridges.
/HERITAGE/

I've created a new sorting algorithm which allows O(1) sorting of ordered data, O(log (root n)) sorting of partially sorted data, and O(log n) sorting of random data.

The underlying mechanism is well suited for a general purpose data structure, which I'm calling a binary search cube, and has notable advantages over binary trees. The data structure is cube-shaped rather than tree-shaped.

I've described the data structure and sorting algorithm on the following webpage:

https://sites.google.com/site/binarysearchcube/

The site also includes three programs written in C, one of them implementing cubesort using a 4 dimensional search cube, and an implementation of a 3 and 4 dimensional binary search cube.

Haven't been able to properly benchmark the software against other sorting algorithms and data structures, but the memory overhead is significantly less.

Hoping for some interesting feedback.

Hi Mr. Hoven,
looks nice feels exciting, surely the showdowns are gonna be must-see.

My suggestion is to make a simple console tool to ease the benchmarking, I will be the first to run some superheavy English phraselists/wordlists (up to 400,000,000 lines) on my laptop T7500.

One very useful filedataset is to sort Knight-Tours (64x2=128 bytes long lines), because you can generate even billions of them in reasonable time, my package (that do that) can be downloaded freely at: www.sanmayce.com/Downloads/Sandokan_r3_vs_Windows-sort.zip[^]

My old console sorter Sandokan outperformed (only for internal sorting) the built-in Windows 7 64bit sort.com:

E:\Sandokan_r3-+++\Sandokan_r3_vs_Windows-sort>GENERATE_3million_Knight-Tours_and_SORT_them.bat

Microsoft Windows [Version 6.1.7601]
Generating 3,000,000 Knight-Tours and dumping them into file 390,000,000 bytes ...

E:\Sandokan_r3-+++\Sandokan_r3_vs_Windows-sort>Knight-tour_r8dump_Microsoft_V16_32bit_Ox.exe a8 3000000  1>KT3million.txt
Sorting these 3,000,000 Knight-Tours ...

E:\Sandokan_r3-+++\Sandokan_r3_vs_Windows-sort>"Sandokan_QuickSortExternal_4+GB_32bit_Microsoft.exe" KT3million.txt /fast /descend 768
Sandokan_QuickSortExternal_4+GB r.3, written by Kaze, using Bill Durango's Quicksort source.
Size of input file: 390,000,000
Counting lines ...
Lines encountered: 3,000,000
Longest line (including CR if present): 129
Allocated memory for pointers-to-lines in MB: 22
Assigning pointers ...
sizeof(int), sizeof(void*): 4, 4
Trying to allocate memory for the file itself in MB: 371 ... OK! Get on with fast internal accesses.
Uploading ...
Sorting 3,000,000 Pointers ...
Quicksort (Insertionsort for small blocks) commenced ...
| RightEnd: 000,001,500,225; NumberOfSplittings: 0,000,335,016; Done: 100% ...
NumberOfComparisons: 71,052,662
The time to sort 3,000,000 items via Quicksort+Insertionsort was 21,029 clocks.
Dumping the sorted data ...
- Done 100% ...
Dumped 3,000,000 lines.
OK! Incoming and resultant file's sizes match.
Dump time: 1,560 clocks.
Total time: 27,144 clocks.
Performance: 14,366 bytes/clock.
Done successfully.

E:\Sandokan_r3-+++\Sandokan_r3_vs_Windows-sort>sha1sum_Microsoft_V16_32bit_Ox.exe "QuickSortExternal_4+GB.txt"
a053faa74ffb6cad41e61e273a1a0e0049cb25e7  QuickSortExternal_4+GB.txt

E:\Sandokan_r3-+++\Sandokan_r3_vs_Windows-sort>timer sort /M 1048576 /R /T e: KT3million.txt /O KT3million.txt.Windows
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31

Kernel Time  =     0.530 =    1%
User Time    =    28.610 =   96%
Process Time =    29.140 =   97%
Global Time  =    29.801 =  100%

E:\Sandokan_r3-+++\Sandokan_r3_vs_Windows-sort>sha1sum_Microsoft_V16_32bit_Ox.exe KT3million.txt.Windows
a053faa74ffb6cad41e61e273a1a0e0049cb25e7  KT3million.txt.Windows

Simply my proposal is to collect several such tools in one package, it will be very informative and much fun.

modified 28-Jun-14 14:40pm.

In my own tests cubesort is 2 times slower than mergesort for random integers and 2.5 times faster for sorted integers. This is using the latest version which I uploaded today and improves performance by about 25%.

I'm not sure if this gap can be closed as mergesort has superior cache performance for random data. I haven't been able to find a decent quicksort implementation.

Cubesort seems best suited for cases where a data set is for more than 50% in order.

When I have a couple of hours I'll make a string based version of cubesort (very easy) and see how fast it sorts the file.

Does the file contain duplicates?

Edit:

It appears the file is in reverse order. Takes about 3.5 seconds to load the file, 5 seconds to sort it using cubesort.

modified 29-Jun-14 19:08pm.

Having benchmarked Cubesort with Knight-Tours my amateurish evaluation is of two words: Mutsi! Mutsunka!

My Sandokan (r3fix) is inferior compared to Cubesort, 10,000,000 lines (128 bytes each) were sorted and after that sorted file was sorted once more:

------------------------------------------------------------------------------------------
| sorter \ filedataset | KT10000000.txt                      | KT10000000_SORTED.txt     |
------------------------------------------------------------------------------------------
| Cubesort             | 080 seconds                         | 045 seconds               |
| Windows' sort.exe    | 118 seconds                         | 115 seconds               |
| Sandokan             | 131 seconds; swappings: 36,826,242+ | 114 seconds; swappings: 0 |
------------------------------------------------------------------------------------------

The benchmark is downloadable at: www.sanmayce.com/Downloads/TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit.zip[^]

The full log follows:

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>dir
 Volume in drive E is SSD_Sanmayce
 Volume Serial Number is 9CF6-FEA3

 Directory of E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit

06/29/2014  11:45 PM    <DIR>          .
06/29/2014  11:45 PM    <DIR>          ..
06/29/2014  11:45 PM            12,566 cubesort-1.0_non-original.c
06/29/2014  11:45 PM            86,016 cubesort-1.0_non-original.exe
06/29/2014  11:45 PM           202,880 Cubesort.zip
06/29/2014  11:45 PM            24,490 Knight-tour_r8dump.c
06/29/2014  11:45 PM            79,872 Knight-tour_r8dump.exe
06/29/2014  11:45 PM               116 Make_EXEs.bat
06/29/2014  11:45 PM             1,604 MokujIN prompt.lnk
06/29/2014  11:45 PM           301,406 Sandokan_Logo.pdf
06/29/2014  11:45 PM           170,433 Sandokan_QuickSortExternal_4+GB_r3fix.c
06/29/2014  11:45 PM           122,368 Sandokan_QuickSortExternal_4+GB_r3fix.exe
06/29/2014  11:45 PM             6,144 timer64.exe
06/29/2014  11:45 PM             1,111 TriShowdown_Sandokan_vs_Windows_vs_Cubesort.bat
              12 File(s)      1,009,006 bytes
               2 Dir(s)  27,537,633,280 bytes free

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>move ..\KT10000000.txt
        1 file(s) moved.

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>TriShowdown_Sandokan_vs_Windows_vs_Cubesort.bat
Sorting 10,000,000 KT...

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>timer64.exe cubesort-1.0_non-original.exe KT10000000.txt
Cubesort: sorted 10000000 elements in 66503 clocks.

Kernel  Time =     6.754 =    8%
User    Time =    54.756 =   68%
Process Time =    61.511 =   76%    Virtual  Memory =   2676 MB
Global  Time =    80.355 =  100%    Physical Memory =   2447 MB
Sorting 10,000,000 KT...

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>timer64.exe sort /M 1048576 /T D: KT10000000.txt /O WindowsSORT_ResultantFile.txt

Kernel  Time =     1.466 =    1%
User    Time =    88.249 =   74%
Process Time =    89.716 =   75%    Virtual  Memory =   1028 MB
Global  Time =   118.248 =  100%    Physical Memory =   1021 MB

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>fc Cubesort_ResultantFile.txt WindowsSORT_ResultantFile.txt /b
Comparing files Cubesort_ResultantFile.txt and WINDOWSSORT_RESULTANTFILE.TXT
FC: no differences encountered

Sorting 10,000,000 KT...

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>timer64.exe "Sandokan_QuickSortExternal_4+GB_r3fix.exe" KT10000000.txt /fast /ascend 512
Sandokan_QuickSortExternal_4+GB, revision 3fix, written by Kaze, using Bill Durango's Quicksort source.
Size of input file: 1,300,000,000
Counting lines ...
Lines encountered: 10,000,000
Longest line (including CR if present): 129
Allocated memory for pointers-to-lines in MB: 76
Assigning pointers ...
sizeof(int), sizeof(void*): 4, 8
Trying to allocate memory for the file itself in MB: 1239 ... OK! Get on with fast internal accesses.
Uploading ...
Sorting 10,000,000 Pointers ...
Quicksort (Insertionsort for small blocks) commenced ...
\ RightEnd: 000,001,500,225; NumberOfSplittings: 0,001,676,176; Done: 100% ...
NumberOfSwappings: 36,826,242
NumberOfComparisons: 255,105,484
The time to sort 10,000,000 items via Quicksort+Insertionsort was 106,704 clocks.
Dumping the sorted data ...
\ Done 100% ...
Dumped 10,000,000 lines.
OK! Incoming and resultant file's sizes match.
Dump time: 6,739 clocks.
Total time: 131,258 clocks.
Performance: 9,904 bytes/clock.
Done successfully.

Kernel  Time =     5.397 =    4%
User    Time =   116.173 =   88%
Process Time =   121.571 =   92%    Virtual  Memory =   1320 MB
Global  Time =   131.461 =  100%    Physical Memory =   1321 MB

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>fc QuickSortExternal_4+GB.txt WindowsSORT_ResultantFile.txt /b
Comparing files QuickSortExternal_4+GB.txt and WINDOWSSORT_RESULTANTFILE.TXT
FC: no differences encountered

Sorting sorted 10,000,000 KT...

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>timer64.exe cubesort-1.0_non-original.exe WindowsSORT_ResultantFile.txt
Cubesort: sorted 10000000 elements in 28064 clocks.

Kernel  Time =     6.723 =   14%
User    Time =    27.284 =   60%
Process Time =    34.008 =   74%    Virtual  Memory =   2661 MB
Global  Time =    45.474 =  100%    Physical Memory =   2560 MB
Sorting sorted 10,000,000 KT...

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>timer64.exe sort /M 1048576 /T D: WindowsSORT_ResultantFile.txt /O WindowsSORT_ResultantFile_.txt

Kernel  Time =     1.170 =    1%
User    Time =    76.237 =   66%
Process Time =    77.407 =   67%    Virtual  Memory =   1028 MB
Global  Time =   115.081 =  100%    Physical Memory =   1021 MB

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>fc Cubesort_ResultantFile.txt WindowsSORT_ResultantFile_.txt /b
Comparing files Cubesort_ResultantFile.txt and WINDOWSSORT_RESULTANTFILE_.TXT
FC: no differences encountered

Sorting sorted 10,000,000 KT...

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>timer64.exe "Sandokan_QuickSortExternal_4+GB_r3fix.exe" WindowsSORT_ResultantFile.txt /fast /ascend 512
Sandokan_QuickSortExternal_4+GB, revision 3fix, written by Kaze, using Bill Durango's Quicksort source.
Size of input file: 1,300,000,000
Counting lines ...
Lines encountered: 10,000,000
Longest line (including CR if present): 129
Allocated memory for pointers-to-lines in MB: 76
Assigning pointers ...
sizeof(int), sizeof(void*): 4, 8
Trying to allocate memory for the file itself in MB: 1239 ... OK! Get on with fast internal accesses.
Uploading ...
Sorting 10,000,000 Pointers ...
Quicksort (Insertionsort for small blocks) commenced ...
\ RightEnd: 000,010,000,000; NumberOfSplittings: 0,001,611,392; Done: 100% ...
NumberOfSwappings: 0
NumberOfComparisons: 214,016,797
The time to sort 10,000,000 items via Quicksort+Insertionsort was 90,278 clocks.
Dumping the sorted data ...
\ Done 100% ...
Dumped 10,000,000 lines.
OK! Incoming and resultant file's sizes match.
Dump time: 5,429 clocks.
Total time: 114,488 clocks.
Performance: 11,354 bytes/clock.
Done successfully.

Kernel  Time =     6.364 =    5%
User    Time =    99.372 =   86%
Process Time =   105.737 =   92%    Virtual  Memory =   1320 MB
Global  Time =   114.691 =  100%    Physical Memory =   1321 MB

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>fc QuickSortExternal_4+GB.txt WindowsSORT_ResultantFile_.txt /b
Comparing files QuickSortExternal_4+GB.txt and WINDOWSSORT_RESULTANTFILE_.TXT
FC: no differences encountered

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>

Cubesort is one lovely piece of software, Mr. Hoven I salute you with Beth Ditto's Heavy Cross[^]:

It's a cruel cruel world to face on your own
A heavy cross to carry alone
The lights are on but everyone's gone
And it's cruel
It's a funny way to make ends meet
When the lights are out on every street
It feels alright but never complete
without you
I chose you
If it's already have been done
Undo it
It takes two
It is up to me and you to proove it
On the rainy nights even the coldest days
you're moments ago but seconds away
The principle of Nature it's true but it's a cruel world
See we can play it safe or play it cool
Follow the leader or make up all the rules
Whatever you want the choice is yours
So choose me
I chose you

"If it's already have been done... undo it." - I wish Beth was into programming.

The most decent Quicksort I know is this:
https://code.google.com/p/libdivsufsort/[^]

libdivsufsort implements a suffix array, which according to some sources is about as fast as quicksort.

I made some improvements to the memory handling of Cubesort and it now is 1.21 times slower than mergesort for random data, and 5.75 times faster for sorted data.

Might see further improvement gains by setting BSC_L to 1 when comparing strings.

Cubesort is awesome, with big potential on top of that.
The thing that impresses me most is its construction speed, not to mention the 'free' dump.

Couldn't resist not to try my 'secret weapon' - a frozen/unfinished sub-project called 'Tcheburaschkasort'.
It is entirely based on the titaniumESQUEly strong Bayer-McCreight algorithm which I implemented specifically and only as order 3 (three children maximum).
It's been some time since I wrote it, enough to forget this-and-that, but the initial idea was to use next dimension i.e. to use millions of B-trees.

My unfinished Tcheburaschkasort currently uses 24bit pseudo-hash (in fact first 1|2|3 byte(s) ARE a slot) giving a forest of 16,777,216 B-Trees of order 3:

UINT PseudoHash(const char *str, unsigned int wrdlen)
{
	UINT hash32 = 0;
	const char *p = str;
	// Use 256-based system for first 3 bytes i.e. left is more significant:
	// Currently 3bytes long i.e. 24bit:
	if (wrdlen>=1) {
		hash32 =  hash32 | ((*p)<<16);
		p += 1;
	}
	if (wrdlen>=2) {
		hash32 =  hash32 | ((*p)<<8);
		p += 1;
	}
	if (wrdlen>=3) {
		hash32 =  hash32 | ((*p)<<0);
		p += 1;
	}
	return hash32;
}

Each slot houses the root of a tree.
Not using any hash Tcheburaschkasort resembles a hypercube of order 0, using hash makes it 1D i.e. of order 1 - lookup hashtable is the axis.

Blah-blah, in short the randomness of those prefixes (1..3bytes long) decides the speed performance.
In example with KT10000000.txt all first 3 bytes are equal, 'A8C', so there is one only tree and performance is poor.
Because all the 10,000,000 lines start from board position A8.
Knight-tours: each square represents a board position, as A1(bottom-left), H8(top-right):

-----------------
|a8 |b8 |...|h8 |
-----------------
|a7 |...|       |
-----------------
|...|...|       |
-----------------
|a1 |...|...|h1 |
-----------------

When making the data more diverse, e.g. reverse the order of tours then for KT10000000.txt we will have 50 trees, see below.
Let us see what boost comes from this diversity.

So, Tcheburaschkasort vs Cubesort results for KT10M filedataset (10,000,000 Knight-Tours from A8, reversed):

NOTE:
Tcheburaschkasort is based on my superfast multi-purpose text ripper Leprechaun.
Tcheburaschkasort still DOES NOT make in-order dump/traversal, resultant file is unsorted, to be done. This DOES NOT affect results, though.

The full log is as follows:

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>dir
 Volume in drive E is SSD_Sanmayce
 Volume Serial Number is 9CF6-FEA3

 Directory of E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit

07/03/2014  04:19 AM    <DIR>          .
07/03/2014  04:19 AM    <DIR>          ..
07/03/2014  04:20 AM               823 BiShowdown_Cubesort_Tcheburaschkasort.bat
07/03/2014  04:20 AM            12,566 cubesort-1.0_non-original.c
07/03/2014  04:20 AM            86,016 cubesort-1.0_non-original.exe
07/03/2014  04:20 AM           202,880 Cubesort.zip
07/03/2014  04:20 AM            24,490 Knight-tour_r8dump.c
07/03/2014  04:20 AM            79,872 Knight-tour_r8dump.exe
07/03/2014  04:20 AM            25,071 Knight-tour_r8dump_REVERSE.c
07/03/2014  04:20 AM            79,360 Knight-tour_r8dump_REVERSE.exe
07/03/2014  04:20 AM               216 Make_EXEs.bat
07/03/2014  04:20 AM             1,604 MokujIN prompt.lnk
07/03/2014  04:20 AM           170,433 Sandokan_QuickSortExternal_4+GB_r3fix.c
07/03/2014  04:20 AM           122,368 Sandokan_QuickSortExternal_4+GB_r3fix.exe
07/03/2014  04:20 AM           327,819 TcheburaschkaSort.c
07/03/2014  04:20 AM         1,771,134 TcheburaschkaSort.cod
07/03/2014  04:20 AM           140,288 TcheburaschkaSort.exe
07/03/2014  04:20 AM             6,144 timer64.exe
              16 File(s)      3,051,084 bytes
               2 Dir(s)  29,523,390,464 bytes free

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>BiShowdown_Cubesort_Tcheburaschkasort.bat
Dumping 10,000,000 reversed KT...

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>Knight-tour_r8dump_REVERSE A8 10000000 1>KT10000000_from-end-to-start.txt

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>sort /M 1048576 /T D: /R KT10000000_from-end-to-start.txt /O KT10000000_from-end-to-start-SORTED-IN-REVERSE.txt
Sorting 10,000,000 KT...

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>timer64.exe cubesort-1.0_non-original.exe KT10000000_from-end-to-start.txt
Cubesort: sorted 10000000 elements in 94349 clocks.

Kernel  Time =     8.283 =    7%
User    Time =    59.592 =   55%
Process Time =    67.876 =   62%    Virtual  Memory =   2689 MB
Global  Time =   108.217 =  100%    Physical Memory =   2497 MB
Sorting 10,000,000 KT...

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>timer64.exe sort /M 1048576 /T D: KT10000000_from-end-to-start.txt /O KT10000000_from-end-to-start-SORTED.txt

Kernel  Time =     1.778 =    1%
User    Time =    83.132 =   48%
Process Time =    84.911 =   49%    Virtual  Memory =   1028 MB
Global  Time =   173.113 =  100%    Physical Memory =   1021 MB

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>fc Cubesort_ResultantFile.txt KT10000000_from-end-to-start-SORTED.txt /b
Comparing files Cubesort_ResultantFile.txt and KT10000000_FROM-END-TO-START-SORTED.TXT
FC: no differences encountered

Sorting 10,000,000 KT...

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>dir KT10000000_from-end-to-start.txt/b 1>TcheburaschkaSort.lst

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>timer64.exe TcheburaschkaSort TcheburaschkaSort.lst TcheburaschkaSort.txt 2900123 y
Leprechaun_singleton (Fast-In-Future Greedy n-gram-Ripper), rev. 16FIXFIX, written by Svalqyatchx.
Purpose: Rips all distinct 1-grams (1-word phrases) with length 1..128 chars from incoming texts.
Feature1: All words within x-lets/n-grams are in range 1..31 chars inclusive.
Feature2: In this revision 128MB 1-way hash is used which results in 16,777,216 external B-Trees of order 3.
Feature3: In this revision, 1 pass is to be made.
Feature4: If the external memory has latency 99+microseconds then !(look no further), IOPS(seek-time) rules.
Pass #1 of 1:
Size of input file with files for Leprechauning: 34
Allocating HASH memory 134,217,793 bytes ... OK
Allocating memory 2833MB ... OK
Size of Input TEXTual file: 1,300,000,000
\; 00,222,222P/s; Phrase count: 10,000,000 of them 10,000,000 distinct; Done: 64/64
Bytes per second performance: 28,888,888B/s
Phrases per second performance: 222,222P/s
Time for putting phrases into trees: 45 second(s)
Flushing UNsorted phrases: 100%; Shaking trees performance: 00,952,380P/s
Time for shaking phrases from trees: 21 second(s)
Leprechaun: Current pass done.

Total memory needed for one pass: 2,133,638KB
Total distinct phrases: 10,000,000
Total time: 67 second(s)
Total performance: 149,253P/s i.e. phrases per second
Leprechaun: Done.

Kernel  Time =     6.193 =    9%
User    Time =    59.732 =   89%
Process Time =    65.926 =   99%    Virtual  Memory =   2967 MB
Global  Time =    66.580 =  100%    Physical Memory =   2218 MB

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>type Leprechaun.LOG
Leprechaun report:
Number Of Hash Collisions(Distinct WORDs - Number Of Trees): 9,999,950
Number Of Trees(GREATER THE BETTER): 50
Number Of LEAFs(littler THE BETTER) not counting ROOT LEAFs: 7,533,898
Highest Tree not counting ROOT Level i.e. CORONA levels(littler THE BETTER): 15
Used value for third parameter in KB: 2,900,123
Use next time as third parameter: 2,133,638
Total Attempts to Find/Put WORDs into B-trees order 3: 164,504,421

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>type KT10000000_from-end-to-start.txt|more
E3D1C3E4F6D5B6A4C5E6F4D3B2C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
F6E4C3D1E3D5B6A4C5E6F4D3B2C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
E3D1C3A4B6D5F6E4C5E6F4D3B2C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
B6A4C3D1E3D5F6E4C5E6F4D3B2C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
E3D1C3E4F6D5F4E6C5D3B2A4B6C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
F6E4C3D1E3D5F4E6C5D3B2A4B6C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
E3D1C3D5F6E4C5E6F4D3B2A4B6C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
C3D1E3D5F6E4C5E6F4D3B2A4B6C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
F6D5E3D1C3E4C5E6F4D3B2A4B6C4E5F3D4F5D6B5A3B1D2F1H2G4H6G8E7C8A7C6D8B7A5B3A1C2E1G2H4G6H8F7G5H7F8D7B8A6B4A2C1E2G1H3F2H1G3H5G7E8C7A8
...

Oh, and for those who are interested here are the stats (memory requirement and number of attempts i.e. depthness of the B-tree):
Simply you need 15 (in worst case) attempts in order to find a key within 10,000,000 keys.
Nothing impressive, 15 tries, yet with no unpleasant surprises like unbalancing.
Also for data 1,240MB (the 130x10,000,000) the tree is 2,133,638KB=2,083MB in size.

There are three reasons for Tcheburaschkasort's poor performance, first is the slow dumping of sorted data:
66 seconds are divided in:
Sorting: Time for putting phrases into trees: 45 second(s)
Dumping: Time for shaking phrases from trees: 21 second(s)

Second is the way I insert new keys, for speed boost (when incoming data is not of unique keys as here) fast-search is performed and if failed new search-insert is enforced, grmbl.
In simple words, insertion is slow, especially for order 3, double-grmbl.

Third is the parsing of keys, in Cubesort I wrote it synthetically:

fread(&z_array[cnt].key, sizeof(keyType), 1, ifp);

While in Tcheburaschkasort the parsing is byte-by-byte, as in real-world rippers:

                if ( workbyte < '0' ) // Most characters are under alphabet - only one if // Cheburashka
                {
ElStupido:
                        // This fragment is MIRRORed: #1 copy [
                        if (workbyte == 10) {NumberOfLines++;}
...
// Cheburashka [ 
                else if( workbyte <= '9' )
		{
                        //if( wrdlen < 31 )
                        if( wrdlen < 128 ) // Cheburashka
                        //if( wrdlen < LongestLineInclusive )
                        { wrd[ wrdlen ] = workbyte; }
                        wrdlen++;
		}
                else if( workbyte >= 'A' &&  workbyte <= 'Z' )
		{
                        //if( wrdlen < 31 )
                        if( wrdlen < 128 ) // Cheburashka
                        //if( wrdlen < LongestLineInclusive )
                        { wrd[ wrdlen ] = workbyte; }
                        wrdlen++;
		}
// Cheburashka ]
                else if( workbyte >= 'a' &&  workbyte <= 'z' )
		{
                        //if( wrdlen < 31 )
                        if( wrdlen < 128 ) // Cheburashka
                        //if( wrdlen < LongestLineInclusive )
                        { wrd[ wrdlen ] = workbyte; }
                        wrdlen++;
		}
		else
                {
                        // This fragment is MIRRORed: #2 copy [
                goto ElStupido;
                        // This fragment is MIRRORed: #2 copy ]
		}

Scared! Well above fragment handles 'words' of alpha-numerical type 1..128 bytes in length.

Having an old Samsung 470 SSD on my laptop (Core 2 T7500 2200MHz) I ran the scariest sort of all, with all-external memory operations i.e. no physical no virtual used:

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>timer64.exe TcheburaschkaSort TcheburaschkaSort.lst TcheburaschkaSort.txt 2900123 z
Leprechaun_singleton (Fast-In-Future Greedy n-gram-Ripper), rev. 16FIXFIX, written by Svalqyatchx.
Purpose: Rips all distinct 1-grams (1-word phrases) with length 1..128 chars from incoming texts.
Feature1: All words within x-lets/n-grams are in range 1..31 chars inclusive.
Feature2: In this revision 128MB 1-way hash is used which results in 16,777,216 external B-Trees of order 3.
Feature3: In this revision, 1 pass is to be made.
Feature4: If the external memory has latency 99+microseconds then !(look no further), IOPS(seek-time) rules.
Pass #1 of 1:
Size of input file with files for Leprechauning: 34
Allocating HASH memory 134,217,793 bytes ... OK
Allocating/ZEROing 2,969,725,966 bytes swap file ... OK
Size of Input TEXTual file: 1,300,000,000
\; 00,008,643P/s; Phrase count: 10,000,000 of them 10,000,000 distinct; Done: 64/64
Bytes per second performance: 1,123,595B/s
Phrases per second performance: 8,643P/s
Time for putting phrases into trees: 1157 second(s)
Flushing UNsorted phrases: 100%; Shaking trees performance: 00,056,179P/s
Time for shaking phrases from trees: 356 second(s)
Leprechaun: Current pass done.

Total memory needed for one pass: 2,133,638KB
Total distinct phrases: 10,000,000
Total time: 1528 second(s)
Total performance: 6,544P/s i.e. phrases per second
Leprechaun: Done.

Kernel  Time =  1096.609 =   71%
User    Time =   217.449 =   14%
Process Time =  1314.058 =   86%    Virtual  Memory =    130 MB
Global  Time =  1527.601 =  100%    Physical Memory =    131 MB

E:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>

Only 1527/108=14 times slower than Cubesort, not bad in my eyes.

As for Tcheburaschka, few things remain to be done, in-order traversal dump and few more raw ideas, however I lost my momentum, now I am interested in textual decompression.

Ah, and the name, Чебурашка, it is a little supercalm soullet (English lacks double diminutives), in fact the golden Russian animatronics movie, love it.

How to reduce a quantified 2 Sat formula to a quantified horn formula?

Guys,
for 25 or so days I have been wrestling with the most simple approach in LZSS.
Seeing how demanding (criticizing) are some fellow members, here I would like to hear from all interested in this topic programmers what one good article about LZSS should feature.
So, please share here your point on what has to be covered what to be emphasized and other must-be-s in the article.

Maybe I will write an article, the thing that stops me is the unappreciativeness. My way of doing things is all about enjoying the speeds of small etudes in C, that's why I ask preemptively:
Does CODEPROJECT need the article 'Fastest textual decompression in C'?.

Here you can see my etude heavily benchmarked against the best in the world:
http://www.sanmayce.com/Nakamichi/[^]

Sanmayce wrote:
Does CODEPROJECT need the article 'Fastest textual decompression in C'?.

If you think your article covers a subject better than existing articles, or is the first one to describe the problem, then go ahead. But before you start, please spend time reading A Guide To Writing Articles For Code Project[^] and the Posting Guidelines[^]. You should also look at some of the articles by people such as Pete O'Hanlon[^], Sacha Barber[^] etc, to see the sort of submission that is likely to get good reviews.

Thanks, I checked all the links you gave me, also searched for LZ, LZSS, compression in article names.
Strange, not a single article on LZ/LZSS, not to mention fast, let alone fastest.
You see Mr.MacCutchan, I want to share a stand-alone file-to-file compressor written in C featuring fastest decompression on textual data. I don't see it as utility but rather as an illustration/sample how to decompress some needed text-like data at speeds 18%-40% of memcpy().
The focus I want to be precisely on DECOMPRESSION, I don't want to be dragged into mumbo-jumbisms as "How do I compress? The LZSS is not explained. This is not an article. Poor description and similar complains."
Seeing some fellow members' articles on LZW and Huffman I see my approach in different light, I like benchmarks real-world ones, disassembly listings and comparisons with the best decompressors today.
That's why I ask for some preemptive feedback/directions.

Personally, it is annoying for me to want to share some code etude and instead of making some interesting place/topic for sharing ideas/benching with fellow coders to receive complains "THIS ARTICLE IS NOT GOOD", I am not a journalist to write articles, after all the CP logo says 'For those who code' not 'write articles'.

My point, new ideas/improvements are to be easily shareable, I think.

If you are going to write then write, rather than discussing the what and how in these forums. If your article does not meet the requirements of the site then you have to accept the feedback, and either improve or change it as necessary.

Sanmayce wrote:
I don't want to be dragged into mumbo-jumbisms as "How do I compress?

it's unlikely that many people are going to be familiar with the LZSS algorithm. and people are going to ask you for the compression side of it. so, i'd suggest including a simpler compressor with a description of the algorithm. you have to at least get the readers up to the point where they can follow your discussion of the decompressor. otherwise, you're starting the story in the middle.

image processing toolkits | batch image processing

Mr.Losinger,
have you seen the Wikipedia's article on LZSS: http://en.wikipedia.org/wiki/LZSS[^]

Now check my amateurish attempt to shed light on LZSS: http://www.sanmayce.com/Nakamichi/[^]

LZSS is simple enough yet writing an article is not easy, my wish is to contribute to the decompression part with one of the fastest (in top 3) etudes, benchmarks, C/Assembly snippets, not to define/explain the algorithm.

For a start, I need modern machine to test on, sadly I own only Core2, that's a nasty break for me since I use DWORD fetching (sucks on Core2), XMM and even ZMM which is so interesting to be seen what can bring to the table.
My naive expectations were that some fellow coders will help me at least with its benchmarking, sadly I see no interested coders on this topic.

One of my wishes is to show an etude, Nakamichi 'Sanagi' to be exact, 512bit targeted, luckily Intel C optimizer v14 (which I used) supports them but still no computer in my reach can run the executable.

"Tianhe-2 the world's fastest supercomputer according to the TOP500 list for June and November 2013 utilizes Xeon Phi accelerators based on Knights Corner."
/Wikipedia/

In my view such an benchmark is both important and interesting, thus the incoming Knights Corner/Landing processors will show the power of ZMMWORD moves.

"... Far more interesting is that Intel's Knights Landing not only will be an expansion card but also will serve as a standalone platform or processor ..."
/http://www.admin-magazine.com/Articles/Exploring-the-Xeon-Phi/

Simply, the near future will bring monstrous bandwidths, not utilizing them is just ... lame. I see how some 30 years old algorithms are/have to be 'pimped' and pumped.

>it's unlikely that many people are going to be familiar with the LZSS algorithm.
Agreed, but it is with everything like that, isn't it!

>i'd suggest including a simpler compressor with a description of the algorithm.
Sure.

>you have to at least get the readers up to the point where they can follow your discussion of the decompressor. otherwise, you're starting the story in the middle.
Of course, I am not some villain wanting to obscure information, but as I said my focus is entirely on textual decompression speed boosts, they alone took a lot of my time to wrestle with.

And for those who want to see what one of the best in the world has done:
http://fastcompression.blogspot.com/p/lz4.html

Comparing an eventual etude with two of the fastest (LzTurbo being the other) is a must, I see no better way to evaluate its worth.

The formula that I'm using to create this is from the book "Mathematical tools in computer graphics with C# implementation", however it only had the formula and no code. You could find the same kind of formulations available online here[^], and skipback and forwards on the slides to see a bit more.

My Implementation looks like this:

VB

''' <summary>
''' The Bessel-Overhauser Spline interpolation
''' </summary>
''' <param name="p0">First point</param>
''' <param name="p1">Second point</param>
''' <param name="p2">Third point</param>
''' <param name="p3">Forth point</param>
''' <param name="t"></param>
''' <param name="u">The value from 0 - 1 where 0 is the start of the curve (globally) and 1 is the end of the curve (globally)</param>
''' <returns></returns>
''' <remarks></remarks>
Public Function PointOnBesselOverhauserCurve(ByVal p0 As System.Windows.Point, ByVal p1 As System.Windows.Point, ByVal p2 As System.Windows.Point, ByVal p3 As System.Windows.Point, ByVal t() As Double, ByVal u As Double) As System.Windows.Point
    Dim result As New System.Windows.Point()

    Dim ViXPlusHalf, VixMinusHalf, ViYPlusHalf, ViYMinusHalf, ViX, ViY As Double
    ViXPlusHalf = (p2.X - p1.X) / (t(2) - t(1))
    VixMinusHalf = (p1.X - p0.X) / (t(1) - t(0))
    ViYPlusHalf = (p2.Y - p1.Y) / (t(2) - t(1))
    ViYMinusHalf = (p1.Y - p0.Y) / (t(1) - t(0))

    ViX = ((t(2) - t(1)) * VixMinusHalf + (t(1) - t(0)) * ViXPlusHalf) / (t(2) - t(0))
    ViY = ((t(2) - t(1)) * ViYMinusHalf + (t(1) - t(0)) * ViYPlusHalf) / (t(2) - t(0))


    ' Location of Controlpoints
    Dim PointList As New PointCollection
    PointList.Add(p1)
    PointList.Add(New Point(p1.X + (1 / 3) * (t(2) - t(1)) * ViX, p1.Y + (1 / 3) * (t(2) - t(1)) * ViY))

    ViXPlusHalf = (p3.X - p2.X) / (t(3) - t(2))
    VixMinusHalf = (p2.X - p1.X) / (t(2) - t(1))
    ViYPlusHalf = (p3.Y - p2.Y) / (t(3) - t(2))
    ViYMinusHalf = (p2.Y - p1.Y) / (t(2) - t(1))

    ViX = ((t(3) - t(2)) * VixMinusHalf + (t(2) - t(1)) * ViXPlusHalf) / (t(3) - t(1))
    ViY = ((t(3) - t(2)) * ViYMinusHalf + (t(2) - t(1)) * ViYPlusHalf) / (t(3) - t(1))

    PointList.Add(New Point(p2.X - (1 / 3) * (t(3) - t(2)) * ViX, p2.Y - (1 / 3) * (t(3) - t(2)) * ViY))
    PointList.Add(p2)

    ' Get the calcualted value from the 3rd degree Bezier curve
    Return PointBezierFunction(PointList, u)
End Function

I assumed that t(), and array of equal length to the number of points, would be the same in both X and Y directions. Now, how should adjust my t() values, and should they be dependent on the values X and Y, meaning that they have one value for X and one value for Y? And, is the implementation correct?

I found out that if I number the t vector = {1,2,3 ... ,n} the Bessel-Overhauser implementation turns into Catmull-Rom Spline, so I guess it means that it is correct.

A complete test project can be downloaded here:
Spline Interpolation - history, theory and implementation[^]

Does anyone have any comment on my factoring algo below:

A new method to factor large semi primes.

The plan is to factor large semi prime numbers. Start by picking 2 Prime Numbers of
appropriate digit size. Choose one as a 200 digit Prime Number, and the other as a 300
digit Prime Number, where their product would be a 500 digit prime number. Then convert
the two decimal digit prime numbers to binary values, and multiply them to form the binary
semi prime product. Calculate the bit size of the binary product. Save a copy of this semi
prime into another value whose bit size is the size of the semi prime rounded up to a
DWORD (mod 32 bit) bound, but left shift it until it fills the largest DWORD. Then save
another copy of this left shifted semi prime, but this time right shifted by 1 bit. These
copies of the semi prime will be used to compare the products of the highest test DWORDS
calculated in the algorithm below (see below for the explanation of why two copies are
used).

The semi prime product would then be factored using a new algorithm.

It has been suggested (by S. C. Coutinho in "The Mathemetics of Cipher", Chapter 11) that
to select the prime numbers for a key of a digit size of r, select one between a digit
size of ((4 / 10) * r) and ((45 / 100) * r). The starting point for the factoring will be
chosen as the middle value between these limits, and will step alternately higher and
lower until the factors are discovered. The first starting factor will then be the p (the
smaller number), and the other starting factor will be q (the larger number) which will be
set to ((10^r ) / p).

With an r of 500, the lowest limit (in digits) will be:

p = ((4 / 10) * r)
p = ((4 / 10) * 500)
p = (4 * 50)
p = 200 digits

The upper limit will be:

p = ((45 / 100) * r)
p = ((45 / 100) * 500)
p = (45 * 5)
p = 225 digits

The mid-point will be:

P = ((200 + 225) / 2)
p = 212 digits

Thus the starting p is a 212 digit number. The other starting factor will be:

q = ((10^r) / p)
q = ((10^500) / (10^212))
q = 288 digits

The p value is then the calculated digit size (converted to bits and then DWORDS), but
the q value must be calculated by subtracting the selected p value bit size from the
selected semi prime bit size (and then rounded up to DWORDS). These bit sizes define bit
fields that will be filled in from both ends with actual values during the factoring
process.

The bit size, digit size, and DWORD sizes of these values are as follows (all of these
powers of 2 are the largest exponent that would not exceed the associated power of 10,
and the DWORD count is the minimum count of DWORDS to contain that binary bit count):

2^707 = 10^212 = 23 DWORDS starting low
2^960 = 10^288 = 30 DWORDS starting high
2^999 = 10^300 = 32 DWORDS max high
2^1664 = 10^500 = 52 DWORDS max product

Three sets of pre-computed tables must be constructed to aid (actually enable) the
factoring of this semi prime.

The first table is saved as 32 bit odd factors (DWORDS), which, when multiplied and
masked to a DWORD, give the same masked product value. This table is saved as multiple
files with names matching the masked product value.

A second, similar, table is constructed for factors which all have the most significant
bit set and saved as files with names matching the highest 32 bits of the product value
(masked to 32 bits).

The third table is just a list of the prime numbers to be considered (all prime numbers
less than 300 decimal digits), but saved in a special way as a multi-level directory
structure as described below.

The entries for this third table consist of DWORD pairs where one of the DWORDS is the
lowest DWORD of the prime number and the other DWORD is the highest 32 bits of the prime
number. All prime numbers that share both of these end values and have the same bit size
will be saved in a directory whose name matches the two DWORDS (high DWORD value and low
DWORD value). The directory sets will be saved under a directory that relates to the bit
size.

One bit of explanation about the information that follows. When a string is enclosed in
quotation marks (""), it is an encoded alphanumeric string. When the characters are not
enclosed in quotes, then the characters are treated as independent decoded DWORD values.

At this point, a diagram may be appropriate. Consider a large prime number with DWORDS
indicated as Hx and Lx, and bit size as a 3 BYTE value indicated by Sx:

H1 H2 H3 H4 ... L4 L3 L2 L1 length Sx

The lowest level directory structure would be (where "Sx" are the different prime number
bit sizes from 2 to 1664):

"Sx"
"Sy"
...
"Sz"

Beneath each of these size directories would be a layer of directories named "H1x L1x"
(whose prime number bit size is "Sx" and where "H1x" and "L1x" are the first level
different highest and lowest DWORD values of the semi prime):

"H1x L1x"
"H1y L1y"
...
"H1z L1z"

Beneath each of these directories would be another layer of directories named "H2x L2x":

"H2x L2x"
"H2y L2y"
...
"H2z L2z"

Essentially, you are creating piecemeal path names on some drive or drives (at some
directory level) as a relative path name:

".\Sx\H1x L1x\H2x L2x\H3x L3x\H4x L4x\ ..."
".\Sx\H1y L1y\H2y L2y\H3y L3y\H4y L4y\ ...
...
".\Sx\H1z L1z\H2z L2z\H3z L3z\H4z L4z\ ..."

".\Sy\H1x L1x\H2x L2x\H3x L3x\H4x L4x\ ..."
".\Sy\H1y L1y\H2y L2y\H3y L3y\H4y L4y\ ...
...
".\Sy\H1z L1z\H2z L2z\H3z L3z\H4z L4z\ ..."

".\Sz\H1x L1x\H2x L2x\H3x L3x\H4x L4x\ ..."
".\Sz\H1y L1y\H2y L2y\H3y L3y\H4y L4y\ ...
...
".\Sz\H1z L1z\H2z L2z\H3z L3z\H4z L4z\ ..."

One other thing to note is that this method greatly reduces the duplication of some of
the DWORD values that would otherwise occur if the actual prime numbers were given as
just a list of full N digit values, i.e., consider 32 DWORD prime numbers of the
following form (where each prime number contained the same low 4 DWORDS and same high 4
DWORDS):

H1 H2 H3 H4 ... ... ... L4 L3 L2 L1

The "... ... ..." represent 24 DWORDS, some of which must be different (remember, this
is from a list of prime numbers that are all unique), yet H1 H2 H3 H4 and L4 L3 L2 L1
DWORDS would all be the same and thus not duplicated in this level structure. In the
maximum duplication form, the H1 through H15 and L1 through L15 DWORDS could all be the
same, and only the H16 and L16 DWORDS would be different.

Note that the data is not really binary data, but directory names which must be formed
with alphanumeric or special characters. What will be done to reduce the size of the data
is that the numeric DWORD values will be converted to encoded character values. Several
things were considered in this quest. If decimal conversion were done (0-9), then the
DWORDS would need 10 characters for each DWORD or 20 characters total. If 4 bit hex
conversions were used (0-9 and A-F), then the DWORDS would need 8 characters for each
DWORD or 16 characters total. If 6 bit ASCII64 encoding were used, then the DWORDS would
need 6 characters for each DWORD or 12 characters total but the resulting upper and lower
case letters would require the use of Unicode filenames, and this alone would double the
size of the data to 24 BYTES. Using a character set of upper case and numeric characters
and allowed special characters (26 + 10 + 16 or a 52 character conversion set) would allow
6 character conversions for each DWORD or 12 characters total, but would require divisions
to be used to extract the actual character indexes. Using a 32 character conversion set
(A-Z and 0-5) would result in 7 character conversions for each DWORD (splitting each
DWORD into seven 5 bit fields and then encoding each 5 bits) giving 14 characters for
both DWORDS, but since it would be better to treat both DWORDS as a single QWORD and
split the 64 bits into 13 parts saving even one more character, that method will be
selected. A multi-level table lookup will be used to convert the encoded 13 characters
back into a binary value (a column of 32 QWORD values indexed by the character value, and
13 columns of these QWORD values indexed by which character of the encoded value was
being decoded, all 13 resulting values would be accumulated by adding to yield the QWORD
binary value, then split to two DWORDS).

The size values in the first (lowest level) directory (three BYTE size values) also need
to be converted to character values. For small prime number size values (2, 3, 4, ...)
this would create character strings such as "AAAAB", and there would also be no need for
the High DWORD, thus such small prime numbers would be saved as only a truncated size
field (delete any leading 'A' characters) and would only contain a single DWORD value in
the next lower directory level (second level) such as:

First Second
Level Level

"B" for 2 bits
"AAAAAAB" for prime number 2
"AAAAAAC" for prime number 3
"C" for 3 bits
"AAAAAAE" for prime number 5
"AAAAAAG" for prime number 7
"D" for 4 bits
"AAAAAAK" for prime number 11
...
...

No attempt will be made to delete the leading 'A' characters in these short prime numbers.

When the bit size exceeds 32 bits, the second DWORD would be created to precede the first
DWORD in the directory name. This second (most significant) DWORD would consist of the
most significant 32 bits of the prime number and would always have the most significant
bit set (always be at least an encoded "B..."), and would result in an encoded 13
character directory name.

The max size of a prime number of 300 digits is 32 DWORDS or 16 QWORDS which would convert
to a directory name of 232 characters ((13 + 1) * 16) + (5 + 1) + (1 + 1)) (max directory
name size plus a separator times 16 sub directory levels plus size directory name plus a
separator plus the drive name plus a separator). This is well within the MAX_PATH for a
directory entry which is 260 BYTES so unicode path names are not required.

The factoring algorithm starts out by factoring the semi prime's lowest DWORD and highest
32 bits into lists of DWORD pairs (masked to a low DWORD for the low DWORD pair and to a
high DWORD for a high 32 bit pair) whose DWORD products match the appropriate end values
of the semi prime. Prime numbers whose ends match these pairs are the only prime numbers
that need to be considered in the factoring process, all other prime numbers cannot
possibly be the correct values. All four of the possibilities must be considered (one at
a time) when considering which value of a pair belongs to the larger prime number and
which value belongs to the smaller prime number. Assume that the end values for the first
(odd number) table pair values were A and B and the second (MSD bit set) table values were
C and D. The possible test primes would then be:

Small and Large

C ... A and D ... B
C ... B and D ... A
D ... A and C ... B
D ... B and C ... A

Start the building process by creating four number buffers (bit fields) whose bit size is
the bit size of the semi prime extended to a mod 32 bit size. Now process each of the
above mentioned 4 possible pairs, let us say, starting with the C ... A and D ... B pair.
Multiply them as follows:

C ... A
* D ... B
--------
RS = B * A (R ignored)
TU = D * C (U ignored)

One word about my selection of variable names R, S, T, U, W, X, Y, and Z (in the
example above and in the example below). I did not use the letter "V" because in my
text editor (PFE), while using the OEM fixed pitch font, the "U" (Uniform) and "V"
(Victor) letters appear almost the same and are easily confused.

Now, S would be equal to the lowest DWORD of the semi prime because the A and B characters
were entries in the factor table for the lowest DWORD in the semi prime, but R needs to be
calculated and saved. T would be equal to the highest 32 bits of the semi prime for the
same reason and U needs to be calculated. S and T will always match the end DWORDS of the
left shifted semi prime (also, see below). The R must be calculated as (A * B) >>= 32 and
T must be calculated as (C * D) >>= 32. This will be done once as the factor pair lists are
read and the R values and U values (1 for each pair) saved in two arrays for later use.

Look up the directory ".\Ss\C A" and the directory ".\Sl\D B" (where "Ss" is the small
test prime number size and "Sl" is the large prime number size), and the lowest of their next
level sub-directories (treat the sub-directories as a list of possible useful matches). If
either such directory or sub-directory entry does not exist, then skip to the next low or
high list entry and retry.

Access the first of the ".\Ss\C A" sub-directory entries and assume that it contains the 2
DWORDS as E F and that the first of the ".\Sl\D B" sub-directory entries contains the 2
DWORDS as G H. Combine the sub directory values with the parent directory values to
continue building the prime numbers as:

C E ... F A
* D G ... H B
------------

Multiply F A and H B and mask the product to a QWORD and check if the result matches the
least significant QWORD of the semi prime:

F A
* H B
----
RS = B * A (already computed above)
TU = B * F (T ignored)
WX = H * A (W ignored)
YZ = H * F (YZ ignored)
----
IJKL (IJ ignored)

K ((R + U + X) / DWORD) should match second lowest DWORD of the semi prime. If not, select
the next low pair and try again,

Multiply C E and D G and check if the most significant 64 bits of the product matches the
most significant 64 bits of the semi prime:

C E
* D G
----
RS = G * E
TU = G * C
WX = D * E
YZ = D * C (already computed above)
----
MNOP (OP ignored)

MNOP may possibly need to be adjusted because the product may just be 127 bits and not 128
bits. If and only if (IFF) the most significant two bits of C and D are both set, then the
product would be 128 bits, otherwise it would be 127 bits and require a shift to be
correct for a compare with the high semi prime value (the semi prime high end would be
already left shifted to set the most significant bit of a DWORD). Because of this anomoly,
all products starting with these pairs (for all 16 levels) must be left shifted by one
bit. This is time comsuming. To avoid this, the left shifted semi prime has been also
saved as a right shifted by one bit value. When the high factor pair is first selected, a
check will be made and the correct pointer to the semi prime check value will be
established for that pair, and product shifting will not be required.

M and N should match the highest 64 bits of the check semi prime. If not, select a new
high pair and reset the low pair to the beginning (working with a new high pair) and start
again until both pairs match. Note that O and P cannot be checked at this time because it
is unknown how much the carry DWORDS will be until the next level testing can select the
third DWORD pair.

If the end of either factor list is reached before the double match, change which of the
four initial pairs is being used, and if all four pairs have been checked, then change the
test number bit sizes (alternate lower and higher around the smaller value mid-point that
was selected and then recalculate the larger number bit count), and reset the lists and
start again until the actual factors are found.

If both end points match then check if you are at the lowest level. Do this by maintaining
the bit sizes by decrementing by 64 as you go to each next level - when the smaller test
number lower size is less than 64, then hold the smaller test number and just process the
larger test number for subsequent levels.

If both sizes are less than 64 then you have the final test primes. Just multiply out the
DWORD values (to get all middle carries) to verify whether this is a valid solution. To
get these values (they are split into 4 parts during this testing), the low DWORD values
are correct, it is only the high values that need to be adjusted and concatenated to the
top of the low values and then multiplied. To do this for the smaller test value, subtract
the bit size of the smaller test value lower test DWORDS (DWORD count * 32) from the total
bit size of the smaller test value. Divide the result by 32 (a shift works really well)
and what remains is the bit count to right shift the smaller test value high DWORD values
deleting low bits that are shifted out of the lowest DWORD value in the high piece (use
right shift double on pairs of adjacent DWORDS), then concatenate the result to the low
DWORDS. Do the same thing to adjust the larger test value high DWORDS. Then multiply and
check the product with the semi prime itself.

If both sizes are > 63 then drop down both of the directory levels and get the next two
pair of DWORDS for the test prime numbers and extend the multiplies by a DWORD each and
check again until you have the final test primes.

Obviously, quit when you find the two final prime numbers that, when multiplied, result in
the semi prime product.

Dave.

That's bigger than the common homework questions here. Is it for your thesis?

Bernhard,

No this is not for any homework assignment. It is just an attempt to find another algorithm for factoring numbers - primarily to factor a semi prime. I know that this will be Big Data, but before I start implementing, is there any problem within the algorithm itself?

Dave.

Maybe not thesis but can be a good resource - http://www.codeproject.com/Insider.aspx/undefined/?fid=1658735&tid=4791393[^]

I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is. (V)