# Using SSE/SSE2 for optimization

, 3 Oct 2004
 Rate this:
A beginner's introduction to one of the optimization methods.

## Introduction

This article is demonstrating Intel's SIMD (Single Instruction Multiple Data) extension technology. Optimization by using new Intel instruction like `movdqa`, will move (copy) data faster than typical ones.

## Recall

Before we move on, let's recall some existing knowledge we have now. Nowadays, or more commonly, we are using 32 bits processor at home, even in industry. General purpose registers like `eax`, `ebx`.. etc. are 32 bits. `sizeof(int)` = 4 (bytes). But not all registers are 32 bits, there are some registers having longer bit length. Since a decade ago, Intel introduced MMX extension, in which there are 8 registers `mm0`, `mm1` .. `mm7` having 64 bits length. After that, Intel introduced SSE extension, which has another new 8 registers `xmm0`, `xmm1` .. `xmm7` having 128 bits length. If you want to know more details, please go to my Links section. Look for Intel.

## Requirement

Ask yourself first, what machine you are using. It should be Intel P3 or newer. You must bear in mind that this optimization method is machine dependant, which means that if your hardware does not support, you won't able to see the difference.

## Code

The sample that I created, I purposely made it simple that it runs in console mode. Don't cut and paste, I rather want the reader understand and try it themselves. Here's the sample start..

The demo code will let you see the difference between these two functions that serve the same purpose. Start from here, I won't explain much, you will be alone and please read the comments within the code. I'm sure you will able to catch up. =)

Wait! Get your break point ready first, sit tight. When you do debugging, please try to step through both functions, you will notice the difference.

"`DataTransferTypical`" it will copy one `int` per loop (`sizeof(int)`=4bytes ) whereas "`DataTransferOptimised`" it will copy four `int` per loop (4*`sizeof(int)`=16bytes).

Setting up your Watch window.. In your Watch window, watch "piDst, 101". Then you will see how it is changing...

P.S.: You need to install processor pack in order to get your MSVC++ compile this code. See Links section.

```int DataTransferTypical(int* piDst, int* piSrc, unsigned long SizeInBytes);
int DataTransferOptimised(int* piDst, int* piSrc, unsigned long SizeInBytes);```
```int main(int argc, char* argv[])
{
// var keep the start and end time. simple one. if u wish to have accurate
// one, please look for other article.
unsigned long dwTimeStart = 0;
unsigned long dwTimeEnd = 0;

// temporary variable
int *piSrc = NULL;
int *piDst = NULL;

int i = 0;
char cKey = 0;

unsigned long dwDataSizeInBytes = sizeof(int) * DATA_SIZE;

// u need to install processor pack in order to get msvc++ compile this
piSrc = (int *)_aligned_malloc(dwDataSizeInBytes,dwDataSizeInBytes);
piDst = (int *)_aligned_malloc(dwDataSizeInBytes,dwDataSizeInBytes);

do
{
// initialise
memset(piSrc, 1, dwDataSizeInBytes);
memset(piDst, 0, dwDataSizeInBytes);

dwTimeStart = clock();
for(i = 0; i < ITERATION; i++)
DataTransferTypical(piDst, piSrc, dwDataSizeInBytes);
dwTimeEnd = clock();
printf("== Typical Transfer of %d * %d times of %d bytes data ==\nTime
Elapsed = %d msec\n\n",
ITERATION, DATA_SIZE, sizeof(int), dwTimeEnd - dwTimeStart);

// initialise
memset(piSrc, 1, dwDataSizeInBytes);
memset(piDst, 0, dwDataSizeInBytes);

dwTimeStart = clock();
for(i = 0; i < ITERATION; i++)
DataTransferOptimised(piDst, piSrc, dwDataSizeInBytes);
dwTimeEnd = clock();
printf("== Optimised Transfer of %d * %d times of %d bytes data ==\nTime
Elapsed = %d msec\n\n",
ITERATION, DATA_SIZE, sizeof(int), dwTimeEnd - dwTimeStart);

printf("Rerun? (y/n) ");
cKey = getche();
printf("\n\n");
}while(cKey == 'y');

_aligned_free(piSrc);
_aligned_free(piDst);

return 0;
}

#pragma warning(push)
#pragma warning(disable:4018 4102)

int DataTransferTypical(int* piDst, int* piSrc, unsigned long SizeInBytes)
{
unsigned long dwNumElements = SizeInBytes / sizeof(int);

for(int i = 0; i < dwNumElements; i++)
{
// i is offset.
*(piDst + i) = *(piSrc + i);
}

return 0;
}

int DataTransferOptimised(int* piDst, int* piSrc, unsigned long SizeInBytes)
{
unsigned long dwNumElements = SizeInBytes / sizeof(int);
// not really using it, just for debuging. it keeps number of looping.
// it also means number of packed data.
unsigned long dwNumPacks = dwNumElements / (128/(sizeof(int)*8));

_asm
{
// remember for cleanup
pusha;
begin:
// init counter to SizeInBytes
mov  ecx,SizeInBytes;
// get destination pointer
mov  edi,piDst;
// get source pointer
mov  esi,piSrc;
begina:
// check if counter is 0, yes end loop.
cmp  ecx,0;
jz  end;
body:
// calculate offset
mov  ebx,SizeInBytes;
sub  ebx,ecx;
// copy source's content to 128 bits registers
movdqa xmm1,[esi+ebx];
// copy 128 bits registers to destination
movdqa [edi+ebx],xmm1;

bodya:
// we've done "1 packed == 4 * sizeof(int)" already.
sub  ecx,16;
jmp  begina;
end:
// cleanup
popa;
}

return 0;
}

#pragma warning(pop)```

## Finally

This is my first article in Code Project, please bear with me if something is not right. Also, I hope the demo that I uploaded here is simple enough for beginners. Nothing fancy. Learning is fun, right? =)

## History

I will only update this article when people are requesting. The sample code will not be maintained.

A list of licenses authors might use can be found here

## Share

Software Developer (Senior)
Singapore
He started programming in dBase, pascal, c then assembly. Actively working on image processing algorithm and customised vision applications. His major actually is more on control engineering, motion control, machine vision & satistics.
He did like to work on many projects that require careful analytical method.
He can be reached at albertoycc@hotmail.com.

 First Prev Next
 Some changes in DataTransferOptimised JoaquinMonleon 3-Jan-12 8:41
 Another thing to consider [modified] kbrafford 9-Nov-07 12:19
 ultimate copy tydok 21-Dec-06 1:39
 Not working xtent 11-Jul-06 4:46
 Re: Not working f2 11-Jul-06 8:38
 movntdq Anonymous 24-Apr-05 19:29
 Unfortunately, Above sample is not optimized. unicon88 10-Dec-04 7:54
 I have applied SSE memory copy to 800*600*sizeof(short) memory block. (because..nowadays I have abosbed in 800*600(16BIT) resolution PC game programing.)   My test results are defferent from above sample. and other memory block copy tests has same result.   Above sample results are related to CPU cache.   If source & destination pointer pointed by DataTransferOptimised func" are chaged then SSE memory copy has no cache gains.   My test program here.   //--------------------------------------------------------------------------- #include < stdio.h > #include < stdlib.h > #include < conio.h > #include < malloc.h >   #ifndef __STANDARD_TYPEDEFS__ #define __STANDARD_TYPEDEFS__ typedef unsigned char U8; typedef unsigned short U16; typedef unsigned int U32; #endif   //MEMCPY32 is not satefy in miss-aligned memory. void __fastcall MEMCPY32(U16 *lpiDst, U16 *lpiSrc, int nSize) { _asm { mov ecx, nSize mov edi, lpiDst shr ecx, 2 mov esi, lpiSrc rep movsd } }   unsigned long g_dwCLOCK_HI; unsigned long g_dwCLOCK_LO;   //reset clock counter. void StartClockCounter(void) { _asm { rdtsc mov g_dwCLOCK_HI, EDX mov g_dwCLOCK_LO, EAX } }   //get clock counter. unsigned long lGetClockCounter(void) { _asm { rdtsc sub EDX, g_dwCLOCK_HI sub EAX, g_dwCLOCK_LO mov g_dwCLOCK_HI, EDX mov g_dwCLOCK_LO, EAX }   return g_dwCLOCK_LO; }     int DataTransferTypical(int* piDst, int* piSrc, unsigned long SizeInBytes) { unsigned long dwNumElements = SizeInBytes / sizeof(int);   for(int i = 0; i < dwNumElements; i++) { // i is offset. *(piDst + i) = *(piSrc + i); }   return 0; }     int DataTransferOptimised(int* piDst, int* piSrc, unsigned long SizeInBytes) { unsigned long dwNumElements = SizeInBytes / sizeof(int); // not really using it, just for debuging. it keeps number of looping. // it also means number of packed data. unsigned long dwNumPacks = dwNumElements / (128/(sizeof(int)*8));   _asm { // remember for cleanup pusha; begin: // init counter to SizeInBytes mov ecx,SizeInBytes; // get destination pointer mov edi,piDst; // get source pointer mov esi,piSrc; begina: // check if counter is 0, yes end loop. cmp ecx,0; jz end; body: // calculate offset mov ebx,SizeInBytes; sub ebx,ecx; // copy source's content to 128 bits registers movdqa xmm1,[esi+ebx]; // copy 128 bits registers to destination movdqa [edi+ebx],xmm1;   bodya: // we've done "1 packed == 4 * sizeof(int)" already. sub ecx,16; jmp begina; end: // cleanup popa; }   return 0; }     void main(void) { int n, nLoop; int nBufWidth, nBufHeight; int nDataSize; char chKey; unsigned long Time1, Time2;   U16 *lp1, *lp2; U16 *lp1_aligned, *lp2_aligned;   nBufWidth = 800; nBufHeight = 600;   nDataSize = nBufWidth*nBufHeight*sizeof(U16); //make 800*600 (16BIT) Buffer. lp1_aligned = (U16 *)_aligned_malloc(nDataSize, 16); lp2_aligned = (U16 *)_aligned_malloc(nDataSize, 16);   nLoop = nBufHeight;   do { lp1 = lp1_aligned; lp2 = lp2_aligned;   StartClockCounter(); for(n=0; n < nLoop; n++) { MEMCPY32(lp1, lp2, nBufWidth*2); //32BIT block copy by using MOVSD. lp1 += nBufWidth; lp2 += nBufWidth; }   Time1 = lGetClockCounter();   printf("Elapsed Time1:%d\n", Time1);   lp1 = lp1_aligned; lp2 = lp2_aligned;   StartClockCounter();   for(n=0; n < nLoop; n++) { DataTransferOptimised((int *)lp1, (int *)lp2, nBufWidth*2); lp1 += nBufWidth; lp2 += nBufWidth; }   Time2 = lGetClockCounter();   printf("Elapsed Time2:%d\n", Time2);   if (Time2 < Time1) printf("%.2f%% Faster.\n", (float)((float)Time1/(float)Time2)*100-100.); else printf("%.2f%% Slower.\n", (float)((float)Time2/(float)Time1)*100-100.); printf("More?(y/n)"); chKey = getche(); printf("\n\n");   }while(chKey == 'y'); _aligned_free(lp1_aligned); _aligned_free(lp2_aligned); } //---------------------------------------------------------------------------
 Re: Unfortunately, Above sample is not optimized. f2 10-Dec-04 19:19
 Block copy Isostar 7-Oct-04 5:21
 just same... Re: Block copy f2 7-Oct-04 7:43
 Optimizing some more msd 5-Oct-04 0:14
 Re: Optimizing some more f2 5-Oct-04 2:15
 Re: Optimizing some more msd 5-Oct-04 2:30
 Re: Optimizing some more f2 5-Oct-04 2:43
 Re: Optimizing some more msd 5-Oct-04 2:59
 Re: Optimizing some more f2 5-Oct-04 3:12
 Re: Optimizing some more msd 5-Oct-04 3:31
 memcpy Simon Hughes 4-Oct-04 11:49
 seen memcpy source code? =) Re: memcpy f2 4-Oct-04 16:51
 Re: seen memcpy source code? =) Re: memcpy Qhimm 4-Oct-04 18:39
 Re: memcpy Martin Wittershagen 5-Oct-04 0:29
 Re: memcpy f2 5-Oct-04 2:37