|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
IntroductionWhen we sit down to solve a computational problem, the most obvious solution is often the worst. Let's look at Bubble Sort, for example. It is likely the first algorithm to appear to someone who is just beginning to think about the problem of sorting, given a computer to do the work: start with the first two items, compare them, swap them if the order in the list is reversed, and move to the next item. This algorithm performs equally "well" regardless of the initial condition of the list it is used to sort. Given n items in a list, there will always need to be n2-n comparisons. Usually, we'd move on from this algorithm quickly, noting that this O(n2) performance can get out of hand very, very quickly, and this makes Bubble Sort virtually useless in the face of much better algorithms which are not too much more difficult to implement. There is something we can use it for, however - testing how quickly we can run a bad algorithm. Since performance degrades geometrically (specifically: quadratically), we see performance differences between implementations more clearly, since doubling the number of items will much more than double the time that the algorithm takes. Plus, Bubble Sort is very easy to implement, and when you are hand-coding SSE2 assembly instructions for the first time, this is a big help. BackgroundI'm working on a set of computational geometry algorithms for NetTopologySuite, and some of them have O(n2) performance. Since the algorithms are mostly numerical and perform sorting, and alternative algorithms appear to be hard to implement, I wanted to find a way to take advantage of modern hardware to speed up the operations. Since SSE2 is present on most peoples' computers these days (both Intel and AMD), it seemed like a good choice to work with. I also found, while investigating various data structures to hold numbers, and to my dismay, that Java does a better job of sorting arrays of numbers, since apparently (after browsing the process' code pages during run-time) Sun's HotSpot Java VM takes advantage of SSE2 instructions, and MS's CLR JIT compiler doesn't. It was a shocker. I wanted to set things right again, and out came the C++/CLI guns to regain some ownership on the platform. Et Voilá: a managed assembly which smokes all comers. Too bad it isn't verifiable, though. I think MS has some work to do on the JIT compiler in this area to get the JIT to create more specialized code, especially for vector arrays. Preparing to use SSE2The first thing to know when writing to SSE2 instructions is that we will be using registers and memory locations which are 16 bytes (128 bits) wide. The names of the registers are XMM0 through XMM7 (64-bit systems also have XMM8 through XMM15). They are usually divided into two 64-bit regions called "quadwords" (for double precision floating-point values). They can also be divided into four 32-bit regions called "doublewords" (for single precision floating-point values). Similar divisions exist for integers, words, and bytes. The whole length is called a "double quadword". The registers look like this:
If you are used to the general-purpose registers on the x86 platform (EAX, EBX, etc.), these registers are four times larger. In order to move data in and out of these registers, you need1 to work with a block of memory which is aligned to a 16-byte (128-bit) boundary. The MS CRT has a function to do this for us: Plodding along with the codeThe next thing I found out is that there is very little guidance in writing the SSE2 assembly. Sure, there are some simple examples of basic operations usage or data movement as well as examples about fairly advanced numerical methods on the 'Net (and here at CodeProject), but I couldn't find any which shows how to construct a basic general-purpose algorithm based on the instructions. The Intel processor manuals were essential, although they were much more about reference, and any usage had to be inferred after much trial and error. Further, having done regular 32-bit x86 assembly programming, working with SSE2 was like a separate world in the processor altogether. There are a few ways to communicate from the XMM registers to the general registers (like EAX, EBX, etc.). Eventually, I found the I also had to go through three or four approaches before I figured out how to move data around from register to register. I mostly struggled with The codeThe code provided consists of a method which contains the implementation of the Bubble Sort algorithm in an "asm" code block and a managed C++/CLI framework around it to run it and output some interesting stats to the console. This approach also highlights that it is possible to use C++/CLI to hand code assembly blocks and easily use them from any managed language like C#, VB.NET or IronPython. After setting up an aligned buffer (see above), we can get the address of the list and set up our loops: // values is a double* to an _aligned_malloc() block
// byteCount is the size of the aligned buffer at *values
_asm
{
// save state
pusha;
begin:
mov esi, values; // get pointer to data
// init loop counters to itemCount
// ECX: outer loop counter
// EDX: inner loop counter
mov ecx, 0;
outer_loop:
// check if counter is 0; end sorting if true
cmp ecx, byteCount;
je end_outer;
// reset inner loop counter
mov edx, byteCount;
sub edx, ecx;
// setup data indexer
mov ebx, 0
inner_loop:
cmp edx, 16;
je end_inner;
Then comes the body, where the SSE2 registers and instructions are used. There are generally two variants of each SSE2 instruction: a packed variant, for working on multiple data values, and a scalar variant, for working on single values. We want the packed variants so we can compare two numbers at once. The packed variants end in "pd" for "packed double". First, we move data in from the buffer: // load xmm registers with data to sort
movapd xmm0, [esi+ebx];
movapd xmm1, xmm0;
movapd xmm2, [esi+ebx+16];
movapd xmm3, xmm2;
Now, we have two copies of each of the first four entries in the list: the first two in the XMM0 and XMM1 registers, and the second two in the XMM2 and XMM3 registers. The XMM0 and XMM1 registers look like this:
Since we are not sure that these two values are ordered within the register, we need to compare them. To do this, we will swap the doubles within one of the registers for each pair (this is why we loaded each pair into two registers): // reverse values to compare
shufpd xmm1, xmm1, 1;
shufpd xmm3, xmm3, 1;
The operation
When called with the same register, The registers XMM0, XMM1, XMM2, and XMM3 now exist in a condition such that some choice of XMM0 or XMM1 and some choice of XMM2 or XMM3 results in four sorted elements. The trick was to find SSE2 instructions which would make the decision on which pair of registers to keep. After rummaging through the Intel processor manuals for a few hours over a month or so, I found some that work: // save value for use in max compare
movapd xmm4, xmm0;
Now comes a series of comparisons to push the smallest values into XMM0 and XMM1 and the largest into XMM2 and XMM3: // push the minimum values down into xmm[0-1]
minpd xmm0, xmm2;
minpd xmm1, xmm2;
// push the maximum values up into xmm[2-3]
maxpd xmm2, xmm4;
maxpd xmm3, xmm4;
The code isn't quite obvious, so let's look at it a bit. The instruction Finally, we need to compare intra-register to make sure the two double values within a register are ordered: // order the doubles in xmm[0-1]
order_min0:
movapd xmm4, xmm0;
shufpd xmm4, xmm4, 1;
cmpltpd xmm4, xmm0;
movmskpd eax, xmm4;
cmp eax, 2;
je order_min1;
shufpd xmm0, xmm0, 1;
order_min1:
movapd xmm4, xmm1;
shufpd xmm4, xmm4, 1;
cmpltpd xmm4, xmm1;
movmskpd eax, xmm4;
cmp eax, 2;
je finish_min;
shufpd xmm1, xmm1, 1;
finish_min:
minpd xmm0, xmm1;
// order the doubles in xmm[2-3]
order_max0:
movapd xmm4, xmm2;
shufpd xmm4, xmm4, 1;
cmpltpd xmm4, xmm2;
movmskpd eax, xmm4;
cmp eax, 2;
je order_max1;
shufpd xmm2, xmm2, 1;
order_max1:
movapd xmm4, xmm3;
shufpd xmm4, xmm4, 1;
cmpltpd xmm4, xmm3;
movmskpd eax, xmm4;
cmp eax, 2;
je finish_max;
shufpd xmm3, xmm3, 1;
finish_max:
maxpd xmm2, xmm3;
The trick here is to use
The value of 2 in the At this point, the registers XMM0 and XMM2 are ordered correctly, and we save them back to the data buffer, moving on to the next pair in the Bubble Sort. ResultsThe sort appears to be almost, but not quite, twice as fast as a scalar pointer implementation. This is about what we should expect, since the SSE2 version is making twice the compares in the same number of executions - which is the whole benefit of a vector processor, after all. This isn't true when dealing with a small number of items, say under 64. In this case, it appears that the overhead of SSE2 makes the pointer implementation faster. Perhaps, there are ways to eliminate this overhead by getting the processor to fetch pages from the RAM into the cache before processing them. The following shows a typical run:
If we compare a scalar implementation using pointers, we find the following timings:
Further investigationI've made this code part of a project on CodePlex: Code Rule. The idea here is to implement algorithms across a number of languages / frameworks to help visualize the relative performance of algorithms and the machines they are run on. Some of the more interesting things to try out with this example would be to use pre-fetching to control the caching of upcoming data, if the buffer is too big to fit in the processor's data cache. Exploiting the extra registers on a 64-bit system or the other cores in a multi-core machine would also be useful. Of course, this would ultimately be practical only if a better algorithm were implemented, like QuickSort. ReferenceI mostly used the Intel processor manuals, Volumes 1, 2A, and 2B to construct this example. These, and more, can also be found here. Footnotes1) OK, you don't need to, however, you probably really, really want to, since it is a big performance hit otherwise. History
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||