![]() |
General Programming »
Algorithms & Recipes »
Math
Intermediate
License: The Code Project Open License (CPOL)
Introduction to SSE ProgrammingBy Alex FrAn article describes programming floating-point calculations using Streaming SIMD Extensions |
VC7.1Win2K, WinXP, Win2003, MFC, VS.NET2003, Dev
|
|
Advanced Search |
|
|
|
||||||||||||||||
SIMD is a single-instruction, multiple-data (SIMD) execution model. Consider the following programming task: computing of the square root of each element in a long floating-point array. The algorithm for this task may be written by such way:
for each f in array
f = sqrt(f)
Let's be more specific: for each f in array
{
load f to the floating-point register
calculate the square root
write the result from the register to memory
}
Processor with the Intel SSE support have eight 128-bit registers, each of which may contain 4 single-precision floating-point numbers. SSE is a set of instructions which allow to load the floating-point numbers to 128-bit registers, perform the arithmetic and logical operations with them and write the result back to memory. Using SSE technology, algorithms may be written as: for each 4 members in array
{
load 4 members to the SSE register
calculate 4 square roots in one operation
write the result from the register to memory
}
The C++ programmer writing a program using SSE Intrinsics doesn't care about registers. He has a 128-byte __m128 type and a set of functions to perform the arithmetic and logical operations. It's up to the C++ compiler to decide which SSE register to use and to make code optimizations. SSE technology may be used when some operation is done with each element of a long floating-point arrays.
__m128 data type are defined in xmmintrin.h file: #include <xmmintrin.h>Since SSE instructions are compiler intrinsics and not functions, there are no lib-files.
float array processed by SSE instructions should have 16 byte alignment. A static array is declared using the __declspec(align(16)) keyword: __declspec(align(16)) float m_fArray[ARRAY_SIZE];Dynamic array should be allocated using new
_aligned_malloc function: m_fArray = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);Array allocated by the
_aligned_malloc function is released using the _aligned_free function: _aligned_free(m_fArray);
_m128 are automatically aligned on 16-byte boundaries.
cpuid Assembly command. See details in this sample and in the Intel Software manuals [1].
SSETest project is a dialog-based application which makes the following calculation with three float arrays: fResult[i] = sqrt( fSource1[i]*fSource1[i] + fSource2[i]*fSource2[i] ) + 0.5
i = 0, 1, 2 ... ARRAY_SIZE-1
ARRAY_SIZE is defined as 30000. Source arrays are filled using sin and cos functions. The Waterfall chart control written by Kris Jearakul [3] is used to show the source arrays and the result of calculations. Calculation time (ms) is shown in the dialog. Calculation may be done using one of three possible ways:
void CSSETestDlg::ComputeArrayCPlusPlus( float* pArray1, // [in] first source array float* pArray2, // [in] second source array float* pResult, // [out] result array int nSize) // [in] size of all arrays { int i; float* pSource1 = pArray1; float* pSource2 = pArray2; float* pDest = pResult; for ( i = 0; i < nSize; i++ ) { *pDest = (float)sqrt((*pSource1) * (*pSource1) + (*pSource2) * (*pSource2)) + 0.5f; pSource1++; pSource2++; pDest++; } }Now let's rewrite this function using the SSE Instrinsics. To find the required SSE Instrinsics I use the following way:
| Required Function | Assembly Instruction | SSE Intrinsic |
| Assign float value to 4 components of 128-bit value | movss + shufps | _mm_set_ps1 (composite) |
| Multiply 4 float components of 2 128-bit values | mulps | _mm_mul_ps |
| Add 4 float components of 2 128-bit values | addps | _mm_add_ps |
| Compute the square root of 4 float components in 128-bit values | sqrtps | _mm_sqrt_ps |
C++ function with SSE Intrinsics:
void CSSETestDlg::ComputeArrayCPlusPlusSSE( float* pArray1, // [in] first source array float* pArray2, // [in] second source array float* pResult, // [out] result array int nSize) // [in] size of all arrays { int nLoop = nSize/ 4; __m128 m1, m2, m3, m4; __m128* pSrc1 = (__m128*) pArray1; __m128* pSrc2 = (__m128*) pArray2; __m128* pDest = (__m128*) pResult; __m128 m0_5 = _mm_set_ps1(0.5f); // m0_5[0, 1, 2, 3] = 0.5 for ( int i = 0; i < nLoop; i++ ) { m1 = _mm_mul_ps(*pSrc1, *pSrc1); // m1 = *pSrc1 * *pSrc1 m2 = _mm_mul_ps(*pSrc2, *pSrc2); // m2 = *pSrc2 * *pSrc2 m3 = _mm_add_ps(m1, m2); // m3 = m1 + m2 m4 = _mm_sqrt_ps(m3); // m4 = sqrt(m3) *pDest = _mm_add_ps(m4, m0_5); // *pDest = m4 + 0.5 pSrc1++; pSrc2++; pDest++; } }This doesn't show the function using inline Assembly. Anyone who is interested may read it in the demo project. Calculation times on my computer:
SSESample project is a dialog-based application which makes the following calculation with float array: fResult[i] = sqrt(fSource[i]*2.8)
i = 0, 1, 2 ... ARRAY_SIZE-1
The program also calculates the minimum and maximum values in the result array. ARRAY_SIZE is defined as 100000. Result array is shown in the listbox. Calculation time (ms) for each way is shown in the dialog:
Assembly code performs better because of intensive using of the SSX registers. However, usually C++ code with SSE Intrinsics performs like Assembly code or better, because it is difficult to write an Assembly code which runs faster than optimized code generated by C++ compiler.
C++ function:
// Input: m_fInitialArray // Output: m_fResultArray, m_fMin, m_fMax void CSSESampleDlg::OnBnClickedButtonCplusplus() { m_fMin = FLT_MAX; m_fMax = FLT_MIN; int i; for ( i = 0; i < ARRAY_SIZE; i++ ) { m_fResultArray[i] = sqrt(m_fInitialArray[i] * 2.8f); if ( m_fResultArray[i] < m_fMin ) m_fMin = m_fResultArray[i]; if ( m_fResultArray[i] > m_fMax ) m_fMax = m_fResultArray[i]; } }C++ function with SSE Intrinsics:
// Input: m_fInitialArray // Output: m_fResultArray, m_fMin, m_fMax void CSSESampleDlg::OnBnClickedButtonSseC() { __m128 coeff = _mm_set_ps1(2.8f); // coeff[0, 1, 2, 3] = 2.8 __m128 tmp; __m128 min128 = _mm_set_ps1(FLT_MAX); // min128[0, 1, 2, 3] = FLT_MAX __m128 max128 = _mm_set_ps1(FLT_MIN); // max128[0, 1, 2, 3] = FLT_MIN __m128* pSource = (__m128*) m_fInitialArray; __m128* pDest = (__m128*) m_fResultArray; for ( int i = 0; i < ARRAY_SIZE/4; i++ ) { tmp = _mm_mul_ps(*pSource, coeff); // tmp = *pSource * coeff *pDest = _mm_sqrt_ps(tmp); // *pDest = sqrt(tmp) min128 = _mm_min_ps(*pDest, min128); max128 = _mm_max_ps(*pDest, max128); pSource++; pDest++; } // extract minimum and maximum values from min128 and max128 union u { __m128 m; float f[4]; } x; x.m = min128; m_fMin = min(x.f[0], min(x.f[1], min(x.f[2], x.f[3]))); x.m = max128; m_fMax = max(x.f[0], max(x.f[1], max(x.f[2], x.f[3]))); }
General
News
Question
Answer
Joke
Rant
Admin
|
PermaLink |
Privacy |
Terms of Use
Last Updated: 10 Jul 2003 Editor: Rob Manderson |
Copyright 2003 by Alex Fr Everything else Copyright © CodeProject, 1999-2009 Web13 | Advertise on the Code Project |