Click here to Skip to main content
15,860,943 members
Articles / Programming Languages / Objective C

Using SSE/SSE2 for optimization

Rate me:
Please Sign up or sign in to vote.
3.97/5 (29 votes)
3 Oct 20043 min read 139.1K   865   35   27
A beginner's introduction to one of the optimization methods.

Sample Image - Fast_Data_Transfer_Sample.jpg

Contents

Objective

My objective of posting this article is to share some simple optimization methods. In future, I will try to spend some time to write more articles.

Introduction

This article is demonstrating Intel's SIMD (Single Instruction Multiple Data) extension technology. Optimization by using new Intel instruction like movdqa, will move (copy) data faster than typical ones.

Recall

Before we move on, let's recall some existing knowledge we have now. Nowadays, or more commonly, we are using 32 bits processor at home, even in industry. General purpose registers like eax, ebx.. etc. are 32 bits. sizeof(int) = 4 (bytes). But not all registers are 32 bits, there are some registers having longer bit length. Since a decade ago, Intel introduced MMX extension, in which there are 8 registers mm0, mm1 .. mm7 having 64 bits length. After that, Intel introduced SSE extension, which has another new 8 registers xmm0, xmm1 .. xmm7 having 128 bits length. If you want to know more details, please go to my Links section. Look for Intel.

Requirement

Ask yourself first, what machine you are using. It should be Intel P3 or newer. You must bear in mind that this optimization method is machine dependant, which means that if your hardware does not support, you won't able to see the difference.

Code

The sample that I created, I purposely made it simple that it runs in console mode. Don't cut and paste, I rather want the reader understand and try it themselves. Here's the sample start..

The demo code will let you see the difference between these two functions that serve the same purpose. Start from here, I won't explain much, you will be alone and please read the comments within the code. I'm sure you will able to catch up. =)

Wait! Get your break point ready first, sit tight. When you do debugging, please try to step through both functions, you will notice the difference.

"DataTransferTypical" it will copy one int per loop (sizeof(int)=4bytes ) whereas "DataTransferOptimised" it will copy four int per loop (4*sizeof(int)=16bytes).

Setting up your Watch window.. In your Watch window, watch "piDst, 101". Then you will see how it is changing... Image 2

P.S.: You need to install processor pack in order to get your MSVC++ compile this code. See Links section.

C++
int DataTransferTypical(int* piDst, int* piSrc, unsigned long SizeInBytes);
int DataTransferOptimised(int* piDst, int* piSrc, unsigned long SizeInBytes);
C++
int main(int argc, char* argv[])
{
 // var keep the start and end time. simple one. if u wish to have accurate 
 // one, please look for other article.
 unsigned long dwTimeStart = 0;
 unsigned long dwTimeEnd = 0;

 // temporary variable
 int *piSrc = NULL;
 int *piDst = NULL;

 int i = 0;
 char cKey = 0;

 unsigned long dwDataSizeInBytes = sizeof(int) * DATA_SIZE;

 // u need to install processor pack in order to get msvc++ compile this 
 // code. see Link section.
 piSrc = (int *)_aligned_malloc(dwDataSizeInBytes,dwDataSizeInBytes);
 piDst = (int *)_aligned_malloc(dwDataSizeInBytes,dwDataSizeInBytes);

 do
 {
  // initialise
  memset(piSrc, 1, dwDataSizeInBytes);
  memset(piDst, 0, dwDataSizeInBytes);

  dwTimeStart = clock();
  for(i = 0; i < ITERATION; i++)
   DataTransferTypical(piDst, piSrc, dwDataSizeInBytes);
  dwTimeEnd = clock();
  printf("== Typical Transfer of %d * %d times of %d bytes data ==\nTime 
          Elapsed = %d msec\n\n", 
          ITERATION, DATA_SIZE, sizeof(int), dwTimeEnd - dwTimeStart);

  // initialise
  memset(piSrc, 1, dwDataSizeInBytes);
  memset(piDst, 0, dwDataSizeInBytes);

  dwTimeStart = clock();
  for(i = 0; i < ITERATION; i++)
   DataTransferOptimised(piDst, piSrc, dwDataSizeInBytes);
  dwTimeEnd = clock();
  printf("== Optimised Transfer of %d * %d times of %d bytes data ==\nTime 
         Elapsed = %d msec\n\n", 
         ITERATION, DATA_SIZE, sizeof(int), dwTimeEnd - dwTimeStart);

  printf("Rerun? (y/n) ");
  cKey = getche();
  printf("\n\n");
 }while(cKey == 'y');

 _aligned_free(piSrc);
 _aligned_free(piDst);

 return 0;
}

#pragma warning(push)
#pragma warning(disable:4018 4102)

int DataTransferTypical(int* piDst, int* piSrc, unsigned long SizeInBytes)
{
 unsigned long dwNumElements = SizeInBytes / sizeof(int);

 for(int i = 0; i < dwNumElements; i++)
 {
  // i is offset.
  *(piDst + i) = *(piSrc + i);
 }

 return 0;
}

int DataTransferOptimised(int* piDst, int* piSrc, unsigned long SizeInBytes)
{
 unsigned long dwNumElements = SizeInBytes / sizeof(int);
 // not really using it, just for debuging. it keeps number of looping. 
 // it also means number of packed data.
 unsigned long dwNumPacks = dwNumElements / (128/(sizeof(int)*8));

 _asm
 {
  // remember for cleanup
  pusha;
begin:
  // init counter to SizeInBytes
  mov  ecx,SizeInBytes;
  // get destination pointer
  mov  edi,piDst;
  // get source pointer
  mov  esi,piSrc;
begina:
  // check if counter is 0, yes end loop.
  cmp  ecx,0;
  jz  end;
body:
  // calculate offset
  mov  ebx,SizeInBytes;
  sub  ebx,ecx;
  // copy source's content to 128 bits registers
  movdqa xmm1,[esi+ebx];
  // copy 128 bits registers to destination
  movdqa [edi+ebx],xmm1;

bodya:
  // we've done "1 packed == 4 * sizeof(int)" already.
  sub  ecx,16;
  jmp  begina;
end:
  // cleanup
  popa;
 }

 return 0;
}

#pragma warning(pop)

Finally

This is my first article in Code Project, please bear with me if something is not right. Also, I hope the demo that I uploaded here is simple enough for beginners. Nothing fancy. Learning is fun, right? =)

Links

History

I will only update this article when people are requesting. The sample code will not be maintained.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
f2
Software Developer (Senior)
Singapore Singapore
He started programming in dBase, pascal, c then assembly. Actively working on image processing algorithm and customised vision applications. His major actually is more on control engineering, motion control, machine vision & satistics.
He did like to work on many projects that require careful analytical method.
He can be reached at albertoycc@hotmail.com.

Comments and Discussions

 
QuestionSome changes in DataTransferOptimised Pin
JoaquinMonleon3-Jan-12 7:41
JoaquinMonleon3-Jan-12 7:41 
GeneralAnother thing to consider [modified] Pin
kbrafford9-Nov-07 11:19
kbrafford9-Nov-07 11:19 
Generalultimate copy Pin
tydok21-Dec-06 0:39
tydok21-Dec-06 0:39 
GeneralNot working Pin
xtent11-Jul-06 3:46
xtent11-Jul-06 3:46 
GeneralRe: Not working Pin
f211-Jul-06 7:38
f211-Jul-06 7:38 
Generalmovntdq Pin
Anonymous24-Apr-05 18:29
Anonymous24-Apr-05 18:29 
GeneralUnfortunately, Above sample is not optimized. Pin
unicon8810-Dec-04 6:54
unicon8810-Dec-04 6:54 
GeneralRe: Unfortunately, Above sample is not optimized. Pin
f210-Dec-04 18:19
f210-Dec-04 18:19 
GeneralBlock copy Pin
Isostar7-Oct-04 4:21
Isostar7-Oct-04 4:21 
Generaljust same... Re: Block copy Pin
f27-Oct-04 6:43
f27-Oct-04 6:43 
GeneralOptimizing some more Pin
msd4-Oct-04 23:14
msd4-Oct-04 23:14 
GeneralRe: Optimizing some more Pin
f25-Oct-04 1:15
f25-Oct-04 1:15 
GeneralRe: Optimizing some more Pin
msd5-Oct-04 1:30
msd5-Oct-04 1:30 
GeneralRe: Optimizing some more Pin
f25-Oct-04 1:43
f25-Oct-04 1:43 
GeneralRe: Optimizing some more Pin
msd5-Oct-04 1:59
msd5-Oct-04 1:59 
GeneralRe: Optimizing some more Pin
f25-Oct-04 2:12
f25-Oct-04 2:12 
GeneralRe: Optimizing some more Pin
msd5-Oct-04 2:31
msd5-Oct-04 2:31 
Generalmemcpy Pin
Simon Hughes4-Oct-04 10:49
Simon Hughes4-Oct-04 10:49 
Generalseen memcpy source code? =) Re: memcpy Pin
f24-Oct-04 15:51
f24-Oct-04 15:51 
GeneralRe: seen memcpy source code? =) Re: memcpy Pin
Qhimm4-Oct-04 17:39
Qhimm4-Oct-04 17:39 
GeneralRe: memcpy Pin
Martin Wittershagen4-Oct-04 23:29
Martin Wittershagen4-Oct-04 23:29 
GeneralRe: memcpy Pin
f25-Oct-04 1:37
f25-Oct-04 1:37 
GeneralRe: memcpy Pin
ArcadyG14-May-10 7:29
ArcadyG14-May-10 7:29 
QuestionAfter 3-4 runs..no optimization??? Pin
Balkrishna Talele3-Oct-04 18:54
Balkrishna Talele3-Oct-04 18:54 
Answermachine dependant Re: After 3-4 runs..no optimization??? Pin
f23-Oct-04 19:12
f23-Oct-04 19:12 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.