## Introduction

Sometimes no matter how fast sorting data is, more speed is required. Here are a few ideas about algorithms and other ways to decrease sort time.

**Sort time for some systems may nearly disappear.**

Questions are added to keep challenging our assumptions.

## Background

There is some data to sort. It may be a large to a VERY large amount of data.

However much data exists, there is significant time (that we humans can perceive) that is needed for the sorting to finish. If the sort time is less than 1 second, it may still be useful for us to keep on with this investigation.

#### Lazy Optimization: Can we just skip sorting altogether?

No. Follow on processing needs the data in a specific order.

### Find the Sorting Characteristics

Because we want to know more about where the sorting effort is going, please consider either instrumenting the code and/or using a profiling tool.

Absolute minimum sort instrumentation would be time used and number of data pieces to sort.

Better sort instrumentation would also include: number of comparisons, number of exchanges, time spent doing comparisons, time spent doing exchanges, time used by other operations of the sorting code (number of memory allocations/reallocations for example).

We should also know more about the data itself as it effects sort performance and which sort algorithm would be likely to work well with the data.

What is the structure of the data? Are there other structures that would be more efficient?

We don't typically write code that jumps all over the place when running. This helps the processor fetch instructions with fewer CPU cache misses. The same thing applies to data structures. **Locality of reference **(for data) can definitely be our friend.

Do we have to worry about the data structure sizes or is there a little slack to support faster sorting by adding a helper member to the data structure?

Where is the data stored and can it be moved to faster access storage mediums while sorting?

Sorting a file temporarily loaded into memory is usually significantly faster that sorting it on disk. For example, on the old Palm OS handhelds, everything was already in memory but the file APIs were slower than using an array in memory.

So we search this website - insertion_sorted_list.aspx

(above is similar to what is suggested here)

http://www.codeproject.com/KB/tips/optimizationenemy.aspx

(and try not to get carried away with optimization)

and elsewhere on the web - http://en.wikipedia.org/wiki/Sorting_algorithm

to find better sorting algorithms.

We learn about **divide and conquer algorithms**: especially as related to sorting.

http://en.wikipedia.org/wiki/Divide-and-conquer_algorithm

We find a new variation on Quicksort that divides into 3 parts instead of 2.

We try out the 3 way Quicksort and it is somewhat faster. Let's say total time used drops by 1/3.

**Is this fast enough? **

Maybe not if we review some assumptions.

## What's Wrong with this Picture?

Let's step back and try to reason by analogy.

### Warehouse This!

We have an empty warehouse. Trucks are arriving all the time. Some of the trucks have full containers and others have a few small boxes. We don't have much time so we just pile everything into one corner of the warehouse and keep putting more stuff in there expanding towards the rest of the warehouse as the pile fills up. Finally the last truck (for now) has been unloaded. A customer shows up and wants a list of items from the pile.

Time to turn loose the sorting helpers. Perhaps in a few days or weeks, all the items from the customers' list are found.

**Nobody runs a warehouse this way. Why do we get computer data this way?**

We should clarify what is **required **by some formal or informal architecture or design approach and what is **really needed** by follow on processing.

**Follow on processing really needs sorted data** for more analysis or UI presentation or whatever. Follow on processing does not care how or when the data sorting happens so long as it happens. So what is really required here is the sorting and not **when **the sorting happens.

Going back to our warehouse analogy: It makes a lot more sense to sort the items from the trucks as they are unloaded into the warehouse. Then getting the list of customer items is much easier and faster. This is the approach we should very much consider for computer data: sort it as we get it.

Let's go back to **Sorting Algorithms definition** (sort of Computer Science approach):

We have a list (or array or whatever) of N items. Sort the list in some wanted order.

We should consider the (maybe more practical) **system design definition**:

We have a system that both acquires and sorts data for later processing. Data may be sorted after all acquisition is finished or during acquisition. We should prototype and benchmark sorting during acquisition and compare with sorting after all acquisition. We will revise this definition as needed (there are other trade-offs to consider here as well).

### But wait! There is no sorting algorithm to sort data as it is acquired! ? !

That is exactly right. We are now thinking outside the box of Computer Science sorting algorithms definition and trying to make software more effective.

### Sorted Insertion "Algorithm" Pseudo-code

- Allocate an empty list (but consider allocating for N items if N is known).
- Call outside to get some data (may be more than 1 piece at a time).
- Put the first data piece in the list without comparing.
- Set the insertion point variable to be the first list position.
*While *there is more data:
Use the inner loop of the Insertion Sort algorithm to do a sorted insertion into the already sorted list 1 data piece at a time.

It may be faster to first compare with the insertion point and/or the list element just after it to find the proper insertion point for this data piece. Save the current insertion point for later comparison.

// The insertion point approach works because sometimes
// data will be already ordered correctly and only a couple
// of compares are needed to find the proper place.

- Call outside to get some more data or else done.
- Loop back to
*While* until there is no more data to get.
- (Done.)

### Rough Estimates For O(n) Sorted Insertion Performance

Keep in mind that the number of sorted items to check with here goes from 1 to N.

All other sorting algorithms have N items to wade through.

So the **average number of items to wade through is N/2 and not always N**.

**Best Case**: Data to sort is already in sorted order.

Overhead is 2 comparisons per data item and 1 list insertion.

I think this is something like O(n) (linear cost).

**Average Case**: Not really sure how to estimate this. Highly dependent on exactly what the variations of actual or test data patterns are and the exact data structures used.

**Worst Cases**: Data is in reverse of sorted order or Data is in random order.

I think this is something like O(n^2) (quadratic cost). But n is a lower value here than other algorithms. If sorting using arrays, then a Binary search can help find the insertion point.

Reverse sorted order data performance can be helped by backing up to previous point from last known insertion point. We could expect that the number of list items to back up through would be small. Then overhead would be similar to best case.

**Is this the best we can hope for? Let's go from code to our analogy.**

### Warehouse This! (Part 2)

We will now apply our pseudo code back to the warehouse. When a truck arrives, a worker unloads a box and then goes and puts it in the right place in the warehouse. Then the worker goes back to get another box from the truck. Continue until all the boxes are unloaded. Finally the last truck (for now) has been unloaded.

**This is another poor example of warehouse operations!**

Most warehouses would use 2 or more workers and 1 of them would unload the truck and place the boxes inside at some temporary area. Another worker would take from the temporary area to where each box should be. We now need to consider 2 or more worker threads in our algorithm.

### Sorted Insertion "Algorithm" Pseudo-code (Multiple Threads)

- Allocate an empty list (but consider allocating for N items if N is known).
- Set the insertion point variable to be the first list position.
- Create Thread 2 and have it wait to be started with data to sort.
*While *there is more data:
- Thread 1: Get some data (may be more than 1 piece at a time).
- Thread 1: May place data in a queue or just wake up Thread 2.
- Thread 1: Loop back to
*While*.
- Thread 2: Wakes up and starts doing a Sorted Insertion
Thread 2: Use the inner loop of the Insertion Sort algorithm to do a sorted insertion into the already sorted list 1 data piece at a time.

Thread 2: It may be faster to first compare with the insertion point and/or the list element just after it to find the proper insertion point for this data piece. Save the current insertion point for later comparison.

// The insertion point approach works because sometimes
// data will be already ordered correctly and only a couple
// of compares are needed to find the proper place.

- Thread 2: Keep sorting data until no data remains.

## Estimates For O(n) Multi Thread Sorted Insertion Performance

IF the time to get the data is greater or equal to the time to sort the previous data, then assuming that there was no CPU starvation of the sort thread(s), the sort time would only be as much as the very last few items after no more data was found.

Let S represent the Small (when compared to N) number of items in the last sort call.

**Best Case**: Data to sort is already in sorted order.

Overhead is 2 comparisons per data item and 1 list insertion.

I think this is something like O(s) (small linear cost).

**Average Case**: Not really sure how to estimate this. Highly dependent on exactly what the variations of actual or test data patterns are and the exact data structures used. For now, let's use O(s^2) (small quadratic cost).

**Worst Cases**: Data is in reverse of sorted order or Data is in random order.

I think this is something like O(s^2) (small quadratic cost). But n is a lower value here than other algorithms. If sorting using arrays, then a Binary search can help find the insertion point.

Reverse sorted order data performance can be helped by backing up to previous point from last known insertion point. We could expect that the number of list items to back up through would be small. Then overhead would be similar to best case.

## Experience 1: Sorting a Linked List

Backing up when sorting a **forward linked list** is going to be a problem. Not sure if there is a graceful solution here. A not so graceful approach that worked for me was to also keep track of the 1/4 and 1/2 and 3/4 positions of the forward linked list and then move from there forward through the links to find the insertion point. This is like a coarse grained list of starting points kind of like a crude **Binary **search to then run a forward **Linear **search. Because the Linear search was started at the nearest 1/4 way point and the average number of items is N/2, then the average number of items covered by the Linear search at any time is N/16 (1/2 of 1/4 of N/2). This refinement decreased sort time by about another 30% compared to using only the insertion point approach to sort a forward linked list of over 10,000+ items. Keeping track of the 1/8 positions MAY see another worthwhile speed improvement.

Sort times were originally over an hour for the 10,000+ items. A team member changed code to use Quicksort and the time was just under an hour (about 25% faster). Change to Sorted Insertion (no helping with faster Linear search) dropped the time down to about 1 minute (60x faster). Sorted Insertion with help for Linear search dropped the time to around 40 seconds. Environment was original IBM PC 8088 CPU at 4.77 MHz running graphics engine written in 16 bit ASM as a DOS TSR. The forward linked list was an edge list that had to be sorted top to bottom to work with the supported 4096 x 1367 24 bit RGB digital film recorder and/or various printers and displays.

(Below is about using divide and conquer to work with data larger than the main DOS 1 MB area. This is pretty much half of a sort merge where the merge was not needed.)

Later on, the code was enhanced to work with even larger edge lists that would no longer fit in memory. Sorted Insertion was used until about to run out of memory and then the 1/2 way insertion position was used to split the list in memory out to 2 RAM disk files. More edges could then be added as the memory was also split and used as 2 buffers. Buffers would be written until they filled up. Then a filled up buffer would be split again. Splitting could continue until up to 26 temporary RAM disk files and 26 memory buffers were in use. Organization of the disk files was kept consecutive so later processing where the edge lists are changed to high res RGB raster in 3 separate passes went fast enough that there was no big film recorder throughput decrease.

### Experience 2: Sorting an Array of Strings

Sort times were originally over 90 minutes for the 30,000+ items. A team member changed code to use Quicksort and the time was just over an hour (about 30% faster). Change to Sorted Insertion dropped the time down to about 45 seconds (80x faster and this includes time to acquire the data from the Windows and NetWare network APIs). Because an array was used, there was no code needed to help with Linear searching. Environment was Win95 Pentium 90 MHz running OS and app install setup over a LAN written in 32 bit C++. The array list was a list that had to be sorted for display in the UI showing all the network node names or user names of a large corporate intranet.

### Sorted Insertion Summary

There are several other approaches that should also help with faster sorting not covered here.

Alternate ways to structure data (with related algorithms) and throwing HW at the problem with multiprocessing and/or multithreading seem to be top candidates. See parallelSort.aspx.

### Pros

**When **to sort data along with worker threads may greatly reduce sort time
- IF time to get Data >= time to Sort previous data, THEN Sort time can be nearly zero!
- For better performance, a pool of threads getting Data and another pool of threads Sorting may be needed.

### Cons

- Added complexity to the data acquisition logic
- Strongly recommend prototyping/benchmarking for your use
- I have not seen any research on this approach (maybe I missed those web sites)

## Points of Interest

Sometimes you just have to follow your programmer's instincts even when coworkers scoff at the idea of sorting faster than Quicksort.

When using an analogy, try going both to and from the analogy and check for unreasonable statements. Keeping in mind that an analogy is an approximation, the question remains: is it useful for this context?

## History

- 11
^{th} June, 2011: Initial post
- 14
^{th} June, 2011: Article updated