11,647,828 members (61,524 online)

# Introduction to SSE Programming

, 10 Jul 2003 CPOL 436.2K 7.7K 121
 Rate this:
An article describes programming floating-point calculations using Streaming SIMD Extensions

## Introduction

The Intel Streaming SIMD Extensions technology enhance the performance of floating-point operations. Visual Studio .NET 2003 supports a set of SSE Intrinsics which allow the use of SSE instructions directly from C++ code, without writing the Assembly instructions. MSDN SSE topics [2] may be confusing for the programmers who are not familiar with the SSE Assembly progamming. However, reading the Intel Software manuals [1] together with MSDN gives the opportunity to understand the basics of SSE programming.

SIMD is a single-instruction, multiple-data (SIMD) execution model. Consider the following programming task: computing of the square root of each element in a long floating-point array. The algorithm for this task may be written by such way:

```for each  f in array
f = sqrt(f)```
Let's be more specific:
```for each  f in array
{
load f to the floating-point register
calculate the square root
write the result from the register to memory
}```
Processor with the Intel SSE support have eight 128-bit registers, each of which may contain 4 single-precision floating-point numbers. SSE is a set of instructions which allow to load the floating-point numbers to 128-bit registers, perform the arithmetic and logical operations with them and write the result back to memory. Using SSE technology, algorithms may be written as:
```for each  4 members in array
{
load 4 members to the SSE register
calculate 4 square roots in one operation
write the result from the register to memory
}```
The C++ programmer writing a program using SSE Intrinsics doesn't care about registers. He has a 128-byte `__m128` type and a set of functions to perform the arithmetic and logical operations. It's up to the C++ compiler to decide which SSE register to use and to make code optimizations. SSE technology may be used when some operation is done with each element of a long floating-point arrays.

## SSE Programming Details

### Include Files

All SSE instructions and `__m128` data type are defined in `xmmintrin.h` file:
`#include <xmmintrin.h>`
Since SSE instructions are compiler intrinsics and not functions, there are no lib-files.

### Data Alignment

Each `float` array processed by SSE instructions should have 16 byte alignment. A static array is declared using the `__declspec(align(16))` keyword:
`__declspec(align(16)) float m_fArray[ARRAY_SIZE];`
Dynamic array should be allocated using new `_aligned_malloc` function:
`m_fArray = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);`
Array allocated by the `_aligned_malloc` function is released using the `_aligned_free` function:
`_aligned_free(m_fArray);`

### __m128 Data Type

Variables of this type are used as SSE instructions operands. They should not be accessed directly. Variables of type `_m128` are automatically aligned on 16-byte boundaries.

### Detection of SSE Support

SSE instructions may be used if they are supported by the processor. The Visual C++ CPUID sample [4] shows how to detect support of the SSE, MMX and other processor features. It is done using the `cpuid` Assembly command. See details in this sample and in the Intel Software manuals [1].

## SSETest Demo Project

`SSETest` project is a dialog-based application which makes the following calculation with three `float` arrays:
```fResult[i] = sqrt( fSource1[i]*fSource1[i] + fSource2[i]*fSource2[i] ) + 0.5

i = 0, 1, 2 ... ARRAY_SIZE-1```
ARRAY_SIZE is defined as 30000. Source arrays are filled using `sin` and `cos` functions. The Waterfall chart control written by Kris Jearakul [3] is used to show the source arrays and the result of calculations. Calculation time (ms) is shown in the dialog. Calculation may be done using one of three possible ways:
• C++ code;
• C++ code with SSE Intrinsics;
• Inline Assembly with SSE instructions.
C++ function:
```void CSSETestDlg::ComputeArrayCPlusPlus(
float* pArray1,                   // [in] first source array
float* pArray2,                   // [in] second source array
float* pResult,                   // [out] result array
int nSize)                        // [in] size of all arrays
{

int i;

float* pSource1 = pArray1;
float* pSource2 = pArray2;
float* pDest = pResult;

for ( i = 0; i < nSize; i++ )
{
*pDest = (float)sqrt((*pSource1) * (*pSource1) + (*pSource2)
* (*pSource2)) + 0.5f;

pSource1++;
pSource2++;
pDest++;
}
}```
Now let's rewrite this function using the SSE Instrinsics. To find the required SSE Instrinsics I use the following way:
• Find Assembly SSE instruction in Intel Software manuals [1]. First I look for this instruction in Volume 1, Chapter 9, and after this find the detailed Description in Volume 2. This description contains also appropriate C++ Intrinsic name.
• Search for SSE Intrinsic name in the MSDN Library.
Some SSE Intrinsics are composite and cannot be found by this way. They should be found directly in the MSDN Library (descriptions are very short but readable). The results of such search may be shown in the following table:

 Required Function Assembly Instruction SSE Intrinsic Assign float value to 4 components of 128-bit value movss + shufps _mm_set_ps1 (composite) Multiply 4 float components of 2 128-bit values mulps _mm_mul_ps Add 4 float components of 2 128-bit values addps _mm_add_ps Compute the square root of 4 float components in 128-bit values sqrtps _mm_sqrt_ps

C++ function with SSE Intrinsics:

```void CSSETestDlg::ComputeArrayCPlusPlusSSE(
float* pArray1,                   // [in] first source array
float* pArray2,                   // [in] second source array
float* pResult,                   // [out] result array
int nSize)                        // [in] size of all arrays
{
int nLoop = nSize/ 4;

__m128 m1, m2, m3, m4;

__m128* pSrc1 = (__m128*) pArray1;
__m128* pSrc2 = (__m128*) pArray2;
__m128* pDest = (__m128*) pResult;

__m128 m0_5 = _mm_set_ps1(0.5f);        // m0_5[0, 1, 2, 3] = 0.5

for ( int i = 0; i < nLoop; i++ )
{
m1 = _mm_mul_ps(*pSrc1, *pSrc1);        // m1 = *pSrc1 * *pSrc1
m2 = _mm_mul_ps(*pSrc2, *pSrc2);        // m2 = *pSrc2 * *pSrc2
m3 = _mm_add_ps(m1, m2);                // m3 = m1 + m2
m4 = _mm_sqrt_ps(m3);                   // m4 = sqrt(m3)
*pDest = _mm_add_ps(m4, m0_5);          // *pDest = m4 + 0.5

pSrc1++;
pSrc2++;
pDest++;
}
}```
This doesn't show the function using inline Assembly. Anyone who is interested may read it in the demo project. Calculation times on my computer:
• C++ code - 26 ms
• C++ with SSE Intrinsics - 9 ms
• Inline Assembly with SSE instructions - 9 ms
Execution time should be estimated in the Release configuration, with compiler optimizations.

## SSESample Demo Project

`SSESample` project is a dialog-based application which makes the following calculation with `float` array:
```fResult[i] = sqrt(fSource[i]*2.8)

i = 0, 1, 2 ... ARRAY_SIZE-1```
The program also calculates the minimum and maximum values in the result array. ARRAY_SIZE is defined as 100000. Result array is shown in the listbox. Calculation time (ms) for each way is shown in the dialog:
• C++ code - 6 ms on my computer;
• C++ code with SSE Intrinsics - 3 ms;
• Inline Assembly with SSE instructions - 2 ms.

Assembly code performs better because of intensive using of the SSX registers. However, usually C++ code with SSE Intrinsics performs like Assembly code or better, because it is difficult to write an Assembly code which runs faster than optimized code generated by C++ compiler.

C++ function:

```// Input: m_fInitialArray
// Output: m_fResultArray, m_fMin, m_fMax
void CSSESampleDlg::OnBnClickedButtonCplusplus()
{
m_fMin = FLT_MAX;
m_fMax = FLT_MIN;

int i;

for ( i = 0; i < ARRAY_SIZE; i++ )
{
m_fResultArray[i] = sqrt(m_fInitialArray[i]  * 2.8f);

if ( m_fResultArray[i] < m_fMin )
m_fMin = m_fResultArray[i];

if ( m_fResultArray[i] > m_fMax )
m_fMax = m_fResultArray[i];
}
}```
C++ function with SSE Intrinsics:
```// Input: m_fInitialArray
// Output: m_fResultArray, m_fMin, m_fMax
void CSSESampleDlg::OnBnClickedButtonSseC()
{
__m128 coeff = _mm_set_ps1(2.8f);      // coeff[0, 1, 2, 3] = 2.8
__m128 tmp;

__m128 min128 = _mm_set_ps1(FLT_MAX);  // min128[0, 1, 2, 3] = FLT_MAX
__m128 max128 = _mm_set_ps1(FLT_MIN);  // max128[0, 1, 2, 3] = FLT_MIN

__m128* pSource = (__m128*) m_fInitialArray;
__m128* pDest = (__m128*) m_fResultArray;

for ( int i = 0; i < ARRAY_SIZE/4; i++ )
{
tmp = _mm_mul_ps(*pSource, coeff);      // tmp = *pSource * coeff
*pDest = _mm_sqrt_ps(tmp);              // *pDest = sqrt(tmp)

min128 =  _mm_min_ps(*pDest, min128);
max128 =  _mm_max_ps(*pDest, max128);

pSource++;
pDest++;
}

// extract minimum and maximum values from min128 and max128
union u
{
__m128 m;
float f[4];
} x;

x.m = min128;
m_fMin = min(x.f[0], min(x.f[1], min(x.f[2], x.f[3])));

x.m = max128;
m_fMax = max(x.f[0], max(x.f[1], max(x.f[2], x.f[3])));
}```

## Sources

1. Intel Software manuals.
2. MSDN, Streaming SIMD Extensions (SSE). http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/vcrefstreamingsimdextensions.asp
3. Waterfall chart control written by Kris Jearakul. http://www.codeguru.com/controls/Waterfall.shtml
4. Microsoft Visual C++ CPUID sample. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vcsample/html/vcsamcpuiddeterminecpucapabilities.asp
5. Matt Pietrek. Under The Hood. February 1998 issue of Microsoft Systems Journal. http://www.microsoft.com/msj/0298/hood0298.aspx

## Share

 Antarctica
No Biography provided

## You may also be interested in...

 First PrevNext
 installation requirement for AVX/SSE Programming siva rama krishna bhuma11-Feb-15 18:52 siva rama krishna bhuma 11-Feb-15 18:52
 Code modification Luc Morin31-Oct-11 7:21 Luc Morin 31-Oct-11 7:21
 Re: Code modification Alex Fr31-Oct-11 8:00 Alex Fr 31-Oct-11 8:00
 simd metyouba22-Apr-10 22:04 metyouba 22-Apr-10 22:04
 How to make it more efficiently Shang Chieh, Chou15-Mar-09 16:46 Shang Chieh, Chou 15-Mar-09 16:46
 Re: How to make it more efficiently Alex Fr16-Mar-09 9:15 Alex Fr 16-Mar-09 9:15
 Faulty performance comparison JAF123456789020-Jul-07 8:38 JAF1234567890 20-Jul-07 8:38
 Re: Faulty performance comparison Andyb197927-Jan-10 2:31 Andyb1979 27-Jan-10 2:31
 Re: Faulty performance comparison Andyb197927-Jan-10 3:45 Andyb1979 27-Jan-10 3:45
 Re: Faulty performance comparison Andyb197927-Jan-10 6:07 Andyb1979 27-Jan-10 6:07
 SSE instructions!! minabeh14-Jun-07 1:37 minabeh 14-Jun-07 1:37
 A question lei_ma200318-Apr-07 22:59 lei_ma2003 18-Apr-07 22:59
 Question shaihnc22-Jul-05 10:41 shaihnc 22-Jul-05 10:41
 Re: Question punkbuster17-Jan-06 18:55 punkbuster 17-Jan-06 18:55
 A question Sachini M7-Jun-05 16:39 Sachini M 7-Jun-05 16:39
 Re: A question Alex Fr8-Jun-05 2:43 Alex Fr 8-Jun-05 2:43
 Re: A question punkbuster16-Jan-06 18:46 punkbuster 16-Jan-06 18:46
 array memory alignment ? not_happy011-May-05 20:08 not_happy0 11-May-05 20:08
 Re: array memory alignment ? Alex Fr12-May-05 3:07 Alex Fr 12-May-05 3:07
 performance loss using SSE David St. Hilaire3-Dec-04 8:18 David St. Hilaire 3-Dec-04 8:18
 Re: performance loss using SSE Alex Farber3-Dec-04 8:33 Alex Farber 3-Dec-04 8:33
 640 is not significant number to use SSE. You need to do this for very long arrays, whuch are used in image processing, graphics, 3D etc. My second sample shows how to find minimum and maximum, I don't see something similar in your code. Does it give right result? Instead of copying of the whole array to aligned array, you need to start from the first aligned input array member. Anyway, you need to use MMX for this short numbers, take a look at my MMX article. On Pentium 4 you can use SSE2. Sorry that I don't try to understand your code, SSE programming takes a lot of time. I can try to do this, but code must be clear, without float-short tricks.
 Re: performance loss using SSE David St. Hilaire3-Dec-04 9:38 David St. Hilaire 3-Dec-04 9:38
 Re: performance loss using SSE Alex Farber3-Dec-04 21:30 Alex Farber 3-Dec-04 21:30
 AMD support Jens froslev-nielsen1-Dec-04 1:20 Jens froslev-nielsen 1-Dec-04 1:20
 q: movaps vs. movups yoaz4-Nov-04 8:31 yoaz 4-Nov-04 8:31
 Re: q: movaps vs. movups Alex Farber5-Nov-04 2:35 Alex Farber 5-Nov-04 2:35
 Re: q: movaps vs. movups yoaz5-Nov-04 2:54 yoaz 5-Nov-04 2:54
 Excelent! + a question yoaz19-Sep-04 23:52 yoaz 19-Sep-04 23:52
 Re: Excelent! + a question Alex Farber20-Sep-04 0:39 Alex Farber 20-Sep-04 0:39
 Re: Excelent! + a question yoaz20-Sep-04 1:54 yoaz 20-Sep-04 1:54
 Is this a VC 2003 compiler BUG ? leandrobecker10-Jun-04 4:51 leandrobecker 10-Jun-04 4:51
 Intel compiler Lars Schouw18-Apr-04 21:06 Lars Schouw 18-Apr-04 21:06
 Re: Intel compiler Alex Farber18-Apr-04 21:21 Alex Farber 18-Apr-04 21:21
 How about double data type? mrskyok16-Feb-04 0:28 mrskyok 16-Feb-04 0:28
 Re: How about double data type? Alex Farber16-Feb-04 0:35 Alex Farber 16-Feb-04 0:35
 Re: How about double data type? nutty9-Feb-05 3:56 nutty 9-Feb-05 3:56
 Re: How about double data type? doug6553611-Aug-08 0:50 doug65536 11-Aug-08 0:50
 Very good! Vincent Leong773-Aug-03 20:02 Vincent Leong77 3-Aug-03 20:02
 how to use SSE under Linux? EagleCalifornia9-Sep-03 1:43 EagleCalifornia 9-Sep-03 1:43
 Re: how to use SSE under Linux? Alex Farber9-Sep-03 2:18 Alex Farber 9-Sep-03 2:18
 Re: how to use SSE under Linux? gnuLNX23-Sep-03 3:59 gnuLNX 23-Sep-03 3:59
 Re: how to use SSE under Linux? Christophe Avoinne18-Oct-03 2:39 Christophe Avoinne 18-Oct-03 2:39
 Re: how to use SSE under Linux? gnuLNX20-Oct-03 1:53 gnuLNX 20-Oct-03 1:53
 Re: how to use SSE under Linux? PSuade20-Oct-03 8:40 PSuade 20-Oct-03 8:40
 Re: how to use SSE under Linux? gnuLNX20-Oct-03 10:57 gnuLNX 20-Oct-03 10:57
 Re: how to use SSE under Linux? PSuade21-Oct-03 9:40 PSuade 21-Oct-03 9:40
 Re: how to use SSE under Linux? gnuLNX22-Oct-03 2:25 gnuLNX 22-Oct-03 2:25
 Re: how to use SSE under Linux? PSuade22-Oct-03 4:22 PSuade 22-Oct-03 4:22
 SSE2 Examples... godot_gildor31-Jul-03 6:52 godot_gildor 31-Jul-03 6:52
 Re: SSE2 Examples... Alex Farber2-Aug-03 20:06 Alex Farber 2-Aug-03 20:06
 Last Visit: 31-Dec-99 18:00     Last Update: 4-Aug-15 9:29 Refresh 12 Next »