Click here to Skip to main content
15,887,821 members
Articles / Programming Languages / C
Article

Inner Product Experiment: CPU, FPU vs. SSE*

Rate me:
Please Sign up or sign in to vote.
3.57/5 (16 votes)
20 May 2008GPL32 min read 69.2K   822   24   15
The article demonstrating inner product operation performed with shorts, ints, floats and doubles with CPU/FPU and SSE for comparison.

Introduction

The inner product (or dot product, scalar product) operation is the major one in digital signal processing field. It is used everywhere, Fourier (FFT, DCT), wavelet-analysis, filtering operations and so on. With advances of SSE technology you can parallelize this operation to perform multiplication and addition on several numbers instantly. However what precision in calculations to choose, integer, floats, doubles? In this article I demonstrate inner product operation on shorts, ints, floats, doubles performed with both CPU and SSE/SSE2/SSE3 optimized versions.

Background (optional)

You need understanding of inner product operation and SSE technology understanding. I like wikipedia for having answers to every question, have a look at inner product. In short you take 2 std::vector arrays (floats, ints, shorts, doubles) of equal length, multiply them element wise and sum up the entries in resulting vector producing one number. For SSE programming there is nice article Introduction to SSE Programming at codeproject.

Using the code

Just run the console application and provide the first argument as the length of array for inner product. It creates 2 vectors of the same length with random entries and computes their inner product printing the results and processing times for chars, shorts/shorts SSE2, ins, floats/floats SSE/floats SSE3, doubles/doubles SSE2.

C++
>inner.exe 5000000
 chars          processing time: 13 ms
 -119
 shorts         processing time: 17 ms
 3241
 shorts sse2    processing time: 6 ms
 3241
 ints           processing time: 16 ms
 786540
 floats         processing time: 30 ms
 1339854
 sse2 intrin    processing time: 13 ms
 1339854
 sse3 assembly  processing time: 12 ms
 1339854
 doubles        processing time: 22 ms
 1107822
 doubles sse2   processing time: 25 ms
 1107822

That provided results for inner product of 2 vectors of size 5000000. The second line after each type denotes rounded result of operation. I ran it on my AMD turion 64 2.2Ghz processor. We can see that short precision is the fastest one with SSE2. Novel haddps instruction in SSE3 allows faster processing than SSE2. Double precision performed with FPU outperformed FPU floats. However SSE2 optimization of doubles decrease the speed of computation compared to floats where we have more than 2 times increase in speed with SSE compared to FPU.

If you perform DSP on image data use SSE2 with shorts, using integer filters, or floats. Do not optimize doubles, that will lead to decrease of performance. If you are not going to use SSE optimization use double precision, it is faster than floats.

I did not try it on other processors and if your results will be different let me know. The code for performing inner product operation presented below

On floats with FPU:

C++
std::vector<float< v1;
std::vector<float> v2;
v1.resize(size);
v2.resize(size);
float* pv1 = &v1[0];
float* pv2 = &v2[0];

for (unsigned int i = 0; i < size; i++) {
        pv1[i] = float(rand() % 64 - 32);
        pv2[i] = float(rand() % 64 - 32);
}

float sum = 0;
for (unsigned int i = 0; i < size; i++)
    sum += pv1[i] * pv2[i];        
wprintf(L" %d\n", (int)sum);

SSE2 optimized shorts:

C++
short sse2_inner_s(const short* p1, const short* p2, unsigned int size)
{
        __m128i* mp1 = (__m128i *)p1;
        __m128i* mp2 = (__m128i *)p2;
        __m128i mres = _mm_set_epi16(0, 0, 0, 0, 0, 0, 0, 0);
        
        for(unsigned int i = 0; i < size/8; i++) {                 
                __m128i mtmp = _mm_mullo_epi16(_mm_loadu_si128(mp1),
                    _mm_loadu_si128(mp2)); 
                mres = _mm_add_epi16(mres, mtmp);
                mp1++;
                mp2++;
        }

        short res[8];
        __m128i* pmres = (__m128i *)res;
        _mm_storeu_si128(pmres, mres);

        return res[0]+res[1]+res[2]+res[3]+res[4]+res[5]+res[6]+res[7];
}

SSE optimized floats:

C++
float sse_inner(const float* a, const float* b, unsigned int size)
{
        float z = 0.0f, fres = 0.0f;
        __declspec(align(16)) float ftmp[4] = { 0.0f, 0.0f, 0.0f, 0.0f };
        __m128 mres;

        if ((size / 4) != 0) {
                mres = _mm_load_ss(&z);
                for (unsigned int i = 0; i < size / 4; i++)
                        mres = _mm_add_ps(mres, _mm_mul_ps(_mm_loadu_ps(&a[4*i]),
                        _mm_loadu_ps(&b[4*i])));

                //mres = a,b,c,d
                __m128 mv1 = _mm_movelh_ps(mres, mres);     //a,b,a,b
                __m128 mv2 = _mm_movehl_ps(mres, mres);     //c,d,c,d
                mres = _mm_add_ps(mv1, mv2);                //res[0],res[1]

                _mm_store_ps(ftmp, mres);                

                fres = ftmp[0] + ftmp[1];
        }

        if ((size % 4) != 0) {
                for (unsigned int i = size - size % 4; i < size; i++)
                        fres += a[i] * b[i];
        }

        return fres;
}

SSE3 optimized floats:

C++
float sse3_inner(const float* a, const float* b, unsigned int size)
{
        float z = 0.0f, fres = 0.0f;
        
        if ((size / 4) != 0) {
                const float* pa = a;
                const float* pb = b;
                __asm {
                        movss   xmm0, xmmword ptr[z]
                }
                for (unsigned int i = 0; i < size / 4; i++) {
                        __asm {
                                mov     eax, dword ptr[pa]
                                mov     ebx, dword ptr[pb]
                                movups  xmm1, [eax]
                                movups  xmm2, [ebx]
                                mulps   xmm1, xmm2
                                addps   xmm0, xmm1
                        }
                        pa += 4;
                        pb += 4;
                }  
                __asm {
                        haddps  xmm0, xmm0
                        haddps  xmm0, xmm0
                        movss   dword ptr[fres], xmm0                        
                }                
        }

        return fres;
}

SSE optimized doubles:

C++
double sse_inner_d(const double* a, const double* b, unsigned int size)
{
        double z = 0.0, fres = 0.0;
        __declspec(align(16)) double ftmp[2] = { 0.0, 0.0 };
        __m128d mres;
        
        if ((size / 2) != 0) {
                mres = _mm_load_sd(&z);
                for (unsigned int i = 0; i < size / 2; i++)
                        mres = _mm_add_pd(mres, _mm_mul_pd(_mm_loadu_pd(&a[2*i]),
                        _mm_loadu_pd(&b[2*i])));                

                _mm_store_pd(ftmp, mres);                

                fres = ftmp[0] + ftmp[1];
        }

        if ((size % 2) != 0) {
                for (unsigned int i = size - size % 2; i < size; i++)
                        fres += a[i] * b[i];
        }

        return fres;
}

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)


Written By
Engineer
Russian Federation Russian Federation
Highly skilled Engineer with 14 years of experience in academia, R&D and commercial product development supporting full software life-cycle from idea to implementation and further support. During my academic career I was able to succeed in MIT Computers in Cardiology 2006 international challenge, as a R&D and SW engineer gain CodeProject MVP, find algorithmic solutions to quickly resolve tough customer problems to pass product requirements in tight deadlines. My key areas of expertise involve Object-Oriented
Analysis and Design OOAD, OOP, machine learning, natural language processing, face recognition, computer vision and image processing, wavelet analysis, digital signal processing in cardiology.

Comments and Discussions

 
QuestionResults for Core2Duo E7500 Pin
ijkmailru22-Mar-12 6:11
ijkmailru22-Mar-12 6:11 
I'm using Cure2Duo CPU and obtained the following results:
chars           processing time: 6 ms
-27
shorts          processing time: 8 ms
-25874
shorts sse2     processing time: 4 ms
-25874
ints            processing time: 9 ms
1087688
floats          processing time: 17 ms
1166843
sse2 intrin     processing time: 8 ms
1166843
sse3 assembly   processing time: 8 ms
1166843
doubles         processing time: 18 ms
1533395
doubles sse2    processing time: 17 ms
1533395

Then i studied produced assembly code by Visual C++ 2008 compiler and found some strange thing for floats (IDA output for attached precompiled binary):
_text:00401D0C ;
_text:00401D0C                 fld     [esp+8Ch+var_70]
_text:00401D10                 fld     dword ptr [eax+ebp-10h]
_text:00401D14                 fmul    dword ptr [eax-10h]
_text:00401D17                 faddp   st(1), st
_text:00401D19                 fstp    [esp+8Ch+var_70]
_text:00401D1D ;
_text:00401D1D                 fld     [esp+8Ch+var_70]
_text:00401D21                 fld     dword ptr [eax-0Ch]
_text:00401D24                 fmul    dword ptr [ecx-14h]
_text:00401D27                 faddp   st(1), st
_text:00401D29                 fstp    [esp+8Ch+var_70]
_text:00401D2D ;

It looks very strange that compiler stores intermediate result into memory with fstp and immidiately loads it back to FPU stack with fld. After studying this thing, i've found, that it's done for truncating intermediate result from extended to float (FPU stack registers store data in extended format). So, after changing result accumulator type to double, this time-killing code disappeared and performance significantly increased. Here is a new results:
chars           processing time: 6 ms
90
shorts          processing time: 8 ms
-24012
shorts sse2     processing time: 4 ms
-24012
ints            processing time: 9 ms
1238466
floats          processing time: 9 ms
1228690
sse2 intrin     processing time: 9 ms
1228690
sse3 assembly   processing time: 8 ms
1228690
doubles         processing time: 18 ms
1025315
doubles sse2    processing time: 18 ms
1025315

Now SSE optimization does nothing for floats...
GeneralBetter use of SSE Pin
riffmaster5-Feb-10 1:16
riffmaster5-Feb-10 1:16 
GeneralRe: Better use of SSE Pin
BorisVidolov27-Nov-11 11:46
BorisVidolov27-Nov-11 11:46 
Generalvector usage Pin
Hatem Mostafa27-May-08 2:57
Hatem Mostafa27-May-08 2:57 
GeneralExperiment GPU vs. ... Pin
soho1-Apr-08 2:27
soho1-Apr-08 2:27 
GeneralRe: Experiment GPU vs. ... Pin
Rei Miyasaka5-Apr-08 11:11
Rei Miyasaka5-Apr-08 11:11 
GeneralRe: Experiment GPU vs. ... Pin
asdfasd23423f25-Sep-08 6:27
asdfasd23423f25-Sep-08 6:27 
Generalresults on my computer : Intel Pentium 4 3000 MHz Pin
epitalon216-Jan-08 8:44
epitalon216-Jan-08 8:44 
AnswerRe: results on my computer : Intel Pentium 4 3000 MHz Pin
Chesnokov Yuriy16-Jan-08 18:58
professionalChesnokov Yuriy16-Jan-08 18:58 
GeneralRe: results on my computer : Intel Pentium 4 3000 MHz Pin
epitalon16-Jan-08 22:57
epitalon16-Jan-08 22:57 
AnswerRe: results on my computer : Intel Pentium 4 3000 MHz Pin
Chesnokov Yuriy17-Jan-08 0:20
professionalChesnokov Yuriy17-Jan-08 0:20 
GeneralRe: results on my computer : Intel Pentium 4 3000 MHz Pin
Stuart Dootson1-Apr-08 10:11
professionalStuart Dootson1-Apr-08 10:11 
GeneralChange MS VS compiler setting, I can get performance improved Pin
isomer14-Nov-07 8:07
professionalisomer14-Nov-07 8:07 
AnswerRe: Change MS VS compiler setting, I can get performance improved Pin
Chesnokov Yuriy14-Nov-07 19:59
professionalChesnokov Yuriy14-Nov-07 19:59 
GeneralConvolution Operation Pin
Jeffrey Walton25-Oct-07 1:07
Jeffrey Walton25-Oct-07 1:07 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.