Click here to Skip to main content
15,896,497 members
Articles / General Programming / Algorithms

A few thoughts on faster sorting

Rate me:
Please Sign up or sign in to vote.
5.00/5 (3 votes)
28 Apr 2024CPOL4 min read 3.6K   4   1
In this article, I am presenting a sorting schema, hugely inspired by the Bucket and Proxmap sorting algorithms.

In this article, I am presenting a sorting schema, hugely inspired by the Bucket and Proxmap sorting algorithms. I call it "schema", rather than "algorithm", simply because it introduces a (or another, depending on the sorting algorithm) "devide and conquer" element to the existing sorting algorithms, aiming (imho) to make them faster.

Also, let's recall that most of the sorting algorithms come with the time complexity

${\displaystyle O(N \log{N})} \tag{1}$

With more details ...

 

1. The description of the schema

Let's suppose we have a list L with N (list size, thus a number) objects to sort. Now, let's imagine that we have M (a number) buckets which are pre-sorted (buckets, not the content) according to the same criteria applied to sort the objects from the list L. Let's use strings as an example, if

$L=\{\text{BBBB}, \text{BBAA}, \text{AAAA},\text{...}\}$

then we can "bucket" by the 1st letter of the string, giving us a total of 26 letters in the English alphabet (and yes let's ignore the case sensitivity and numbers and ... everything to keep things easy). We keep these buckets sorted as

$B_1=\{A\}, B_2=\{B\},...,B_{26}=\{Z\}$

Then we partition/distribute the content of L accross the buckets, sort the buckets individually and (finally) merge the content of the buckets to the final sorted list (simple traversal of the buckets, since they are already sorted). Something like this

 

What we have (in terms of time complexity) as a result

  • List L traversal:
    ${\displaystyle O(N)}$
  • It's important that the distribution of objects to buckets is
    ${\displaystyle O(1)}$
    per object, e.g. using hash maps
  • Sorting per bucket (from (1)) is
    ${\displaystyle O\left(\frac{N}{M} \cdot\log{\frac{N}{M}}\right)}$
    Assumptions is that the content of the list L is "uniformly" distributed (!!!) across the buckets.
  • Merging into the final list is
    ${\displaystyle O(M)}$
    Again, buckets are pre-sorted.

2. The mathematical argument

Time complexity in the schema described above is

${\displaystyle O\left(N + M \cdot \frac{N}{M}\cdot \log{\left(\frac{N}{M}\right)}+M\right)}$

which is

${\displaystyle O\left(N + N\cdot\log{\left(\frac{N}{M}\right)}+M\right)} \tag{2}$

If we compare (1) and (2)

$\frac{N + N\cdot\log{\left(\frac{N}{M}\right)}+M}{N \cdot\log{N}}= 1-\frac{\log{M} - 1}{\log{N}}+\frac{M}{N\cdot\log{N}} \tag{3}$

Assuming large enough M < N we have

$\frac{N + N\cdot\log{\left(\frac{N}{M}\right)}+M}{N \cdot\log{N}}< 1-\frac{\log{M} - 2}{\log{N}}< 1 \tag{4}$

In other words, for large N and M we should expect this schema to make most of the sorting algorithms "a little bit" faster. By how much? Let's see ...

 

3. Specific example

Let's consider lists of strings as an example. Here is the source code of a little PoC project which generates lists of randomly generated strings and sorts them using

  1. the Java's Collections::sort and
  2. an implementation of the schema above, which internally uses Collections::sort as well to sort the content of the buckets

and compares the results. No parallelisation is used, although sorting buckets in parallel could significantly improve the speed in the second (B) case, but no cheating!

 

Most of the aspects in the code are easy to tune, but we will consider only

  • lists of size N=1000000 and N=3000000 and
  • bucketing by the first 2 letters, giving the total number of buckets M=262

Plugging these numbers into the formula (3) we should expect

M=262 and N=1000000 M=262 and N=3000000
0.60 0.63

An improvement of nearly 40%!? No way ...

4. Actual results

One more test parameter I should mention (and which is not part of the formula (3)) is X the length of each string added to the list L, this is to count for Java string comparison where string size plays a role. We will look for X=10, X=50 and X=100. Here are some results from running the PoC code on my fairly old computer with an Intel i5-3330 3.00GHz 4 Cores CPU on board, using Java 17

X=10, M=262 and N=1000000 X=10, M=262 and N=3000000
Reported result:
Fast Sort avg: 422.33ms
Std Sort avg: 677.96ms
Reported result:
Fast Sort avg: 1484.24ms
Std Sort avg: 2395.39ms
$\frac{422.33}{677.96}\approx 0.6229$
$\frac{1484.24}{2395.39}\approx 0.6196$

A few more results

X=50, M=262 and N=3000000 X=100, M=262 and N=3000000
Reported result:
Fast Sort avg: 1545.82ms
Std Sort avg: 2773.65ms
Reported result:
Fast Sort avg: 1561.84ms
Std Sort avg: 2883.08ms
$\frac{1545.82}{2773.65}\approx 0.5573$
$\frac{1561.84}{2883.08}\approx 0.5417$

5. Conclusions

Well, the test results are not too far off from the calculated results. So, the schema does improve the sorting ... However, I should mention the following:

  • It works well when the list L is "uniformly distributed" across the buckets (already mentioned in section 1). Imagine that all the objects fall into one bucket, then we are back to
    ${\displaystyle O(N \log{N})}$
    or slightly worse.
  • The test times don't include the time required to "build" the buckets. Buckets are meant to be "built" once and re-used.

And finally, the schema allows for

  • parallelisation, content of the buckets can be sorted in parallel and
  • imagine a data streaming scenario, if one bucket is updated, we don't have to re-sort the entire list L

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior) BlackRock
United Kingdom United Kingdom
My name is Ruslan Ciurca. Currently, I am a Software Engineer at BlackRock.

Comments and Discussions

 
GeneralMy vote of 5 Pin
Ștefan-Mihai MOGA4-May-24 7:10
professionalȘtefan-Mihai MOGA4-May-24 7:10 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.