Click here to Skip to main content
15,867,453 members
Articles / High Performance Computing
Article

High performance computing from C++ to MMX

Rate me:
Please Sign up or sign in to vote.
4.81/5 (54 votes)
30 Jul 20039 min read 184.4K   2K   79   35
Boosting you application performance to the optimum by using hardware acceleration.

Image 1

Introduction

This article first addresses some of the common programmer's algorithm when writing computing intensive image processing application. It then explains how the existing code can be improved from native C++ code, to using MMX acceleration. MMX refers to a feature that exist in the Intel processors, which can boost up multimedia software, by executing integer instructions on several data parallely. Though the processor supports the acceleration, it does not mean your application can make use of it automatically. Hence, your application must be manually written in order to take advantage of it.

Once user is able to make use of MMX acceleration, he/she can advance 1 step further by utilizing SSE (Streaming SIMD Extension) which features parallel floating point calculations and SSE2 which is an upgrade to both floating point & integer calculation. You have to be aware that not all hardware support the acceleration described. Well, to keep the scope small, I had separated this explanation in another article (will be posted soon).

Now lets take a look at the list of Intel processors which supports each feature.

  • Pentium MMX, Pro, Pentium 2 - MMX
  • Pentium 3, Xeon - MMX, SSE
  • Pentium 4 and above - MMX, SSE, SSE2 /HHT

The article contents mainly apply to developers using Intel processor hardware. As for AMD users, some of the assembly instructions in MMX are compatible! Prior to this, some basic assembly language knowledge are needed. I will also assume that you have some knowledge on MFC doc/view architecture.

The reason I came out with this article is because I think there might be a need to highlight some high performance computing knowledge. Specially to those developers who are developing time critical applications. Image processing is a good example for demo purposes.

In this downloadable demo, you will need to have at least a Pentium processor with MMX to see the results.

Background

In image processing/3D graphics etc, performance is critical. The lesser the time an algorithm takes, the better it is, because more functionality can be squeezed in.

In my examples, I generated 2 images of size 1024 x 1024. Each is a megabyte in size. For the sake of simplicity, both are 8 bit grayscale images. I create synthetic images because dependencies upon other classes/objects/DLLs are minimum. Image 1 is an image with white vertical lines on black background. Image 2 has gradient fills of gray.

The 2 images will be added together to create a 3rd image. The final product would be an image with gradient fill & vertical line.

I can easily create the best performing algorithm in the demo, but instead I want to show you what is the difference between various kinds of code. I have developed 4 algorithms which demonstrate image addition. 2 in C++ and 2 in assembly language. For each of the algorithms, I put a high resolution timer to take the elapsed. The final image will be shown on screen (main window). The 2 source images are displayed in 2 mini windows on the right. And then I present you with the results of different kinds of PC in the later section of the article.

Writing the algorithms

For some basic understanding lets go thru this C++ code.

void CMMXDemoDoc::OnRunbasic() 
{
    int x=0, y=0, i=0, iGray=0;

    // Start timing
    m_el.Begin();

    // Add 2 image using direct memory access
    // Assume both image are same size
    for (y=0; y<m_iHeight; y++)
    {
        for (x=0; x<m_iWidth; x++)
        {
            // calc index
            i = x + y*m_iWidth;
            // add 2 pixels
            iGray = m_pImg1[i] + m_pImg2[i];
            // keep saturation value
            if (iGray > 255) iGray = 255;
            // save to destination image
            m_pImg[i] = iGray;

        }
    }

    // End Timing
    m_el.End();

    // Force redraw
    UpdateAllViews(0);
}

m_pImg, m_pImg1, m_pImg2 are all 1 dimensional array of unsigned char. All images are same size.

There are 2 for loops, the outer loop iterate through each line in the image. The inner loop iterate through each pixel in the line. For every pixel, we need to calculate the index in the array, Add the pixel from image 1 & image 2, check out if its saturated (>255), then store it into the destination image.

What is the problem with this algorithm?

  1. There are a lot of arithmetic operations used on addition and there is a multiplication operation as well. Multiplication is a bit slower than addition. Try changing the addition operator to multiply and see the differences in the demo.
  2. Array indexing to access random memory data. This causes some slowdown because the processor has to convert from array index to actual address in memory.
  3. 2 levels of for loops causes processor to operate stack instructions. Because ecx (counter register) is push and pop often. This imposes some processor cycles.

Optimizing using pointer arithmetic

void CMMXDemoDoc::OnRunopt() 
{
    // Begin timing
    m_el.Begin();

    // Precalculate the pointers
    BYTE *pSrc = m_pImg2;
    BYTE *pSrcEnd = m_pImg2 + m_iSize;
    BYTE *pDest = m_pImg;
    BYTE *pDestEnd = m_pImg + m_iSize;
    int iGray;

    // Copy from img1 to tmp
    memcpy(m_pImg, m_pImg1, m_iSize);
    
    // loop each pixel and Add
    while (pSrc < pSrcEnd)
    {
        iGray = *pDest + *pSrc;
        if (iGray > 255) iGray=255;
        *pDest = iGray;
        pSrc++;
        pDest++;
    }

    // End Timing
    m_el.End();
    // Force redraw
    UpdateAllViews(0);
}

Notice that I copied the data to the dest image. I can save up a pointer addition since memcpy() is an optimized function from VC. (memcpy() still takes time)

The code above uses pointer to access the data. There is nothing more to optimize, since the pointer is already storing the data memory address. The pointer moves from 1 element to the next in a sequential form. The source and destination pointers are incremented by 1 each, after the pixel has been calculated.

Converting to assembly

The code below shows how to write the algorithm in assembly. The elapse time of the function will be much more constant.

ASM
int AsmAdd(BYTE *d, BYTE *s, int w, int h)
{
    int iCount = w * h;

    // we assume all data in the register is not used by others
    __asm
    {
        // Assign pointers to register
        mov esi, [s] ; put src addr to esi reg
        mov edi, [d] ; put dest addr to edi reg
        mov ecx, [iCount] ; put count to ecx reg

        codeloop:
        mov al, [edi] ; mov a byte of src data to low
                        ; word of eax register
        add al, [esi] ; Add 8 bit from dest ptr to al
        jc nosave ; jump if carry flag on
        mov [edi], al
        nosave:
        inc esi
        inc edi
        dec ecx ; decrement count by 1
        jnz
        codeloop ; jump to codeloop if not 0

    }

    return 1;
}

Utilising MMX

We have 8 general purpose registers in our 80x86/Pentium processor. They are eax, ebx, ecx, edx, esi, edi, ebp, esp. These registers are used to hold operands for logical and arithmetic operations, operands for address calculations and memory pointers. Register esp is used to hold stack pointers most of the time, so that leaves us 7 registers. When we have lots of data, we would have to make sure these data are well handled by these 7 registers. Besides that, we also have another 8 80-bit registers located in the FPU (Floating Point Unit). We can make use of these 8 registers in FPU if we have MMX feature in our processor. If we use MMX, we could access 64bit from the 80bit register in the FPU. Why only 64? Because the Intel manual state so. : ) Anyhow, that's not an issue as we will be happy with 8 additional registers. We name these registers by mm0-mm7.

Image 2

MMX Technology Execution Environment

So what can we do with these 8 additional MMX registers? They are not going to run any faster right since the clock speed of the CPU is fixed to the amount of operation it can execute in a particular time. Yes they are! How? By executing MMX parallel instructions on preloaded stream of bytes!

Image 3

SIMD Execution model

Image 4

Data Types introduced with MMX Technologies

The parallel execution can save up some clock cycles per instruction. You can load 8-8bits, 4-16bits, 2-32bits into the 64 bits MMX register. To load or store, you use the instruction MOV x where x is the data size. Then you can perform calculation by calling MMX instructions. MMX instructions set consist of 47 instructions, grouped into several categories:

Data Transfer, Arithmetic, Comparison, Conversion, Unpacking, Logical, Shift, Empty MMX state instruction (EMMS).

Study the code below, 8-8bit unsigned saturated add operation is performed on 2-64bit MMX register by executing these lines.

ASM
  unsigned char mask[8] = {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01 }
  // txt = "abcdefgh"
  unsigned char txt[8] = {0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48 } 
  __asm {
  movq mm0, txt ; mm0 = ( 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 )
  movq mm1, mask ; mm1 = ( 01 | 01 | 01 | 01 | 01 | 01 | 01 | 01 )
  paddusb mm0, mm1 ; mm0 = ( 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 )
  movq txt, mm0 ; mov back the data in register to memory.
  emms ; switch back to FPU state.
}

The string txt[] which contains "abcdefgh" will now be "bcdefghi".

By using MMX, there are only 4 instructions executed and 8 bytes of data is processed at a time. A typical assembly algorithm without MMX will have to loop 8 times, every iteration executing a move instruction and an addition. Giving a total of 16 instructions at least. Now what is the speed improvement? I bet you'll get at least a minimum of 200% in time gain.

In Visual C++ 6.0, the easiest way to make use of MMX is writing assembly code using inline assembler. The code above illustrates so. The list of MMX instructions set is available in MSDN web sites. Intrinsics (wrapup function) is available in VC++.NET

For this demo you should at least have a Pentium processor with MMX and above, to produce the desired results.

Executing the demo

Building & executing the demo is pretty straight forward. The source code is presented together with some inline assembler instructions (very well commented). There are also some helper classes & functions.

  • <CElapsed> - provide high resolution timing using QueryPerformanceCounter().
  • <CMiniView> - serve additional views for a CDocument class.
  • Some raster drawing codes to assist image display.
  • <CreateNewViews()> shows how to create additional views for a typical document.
  • <CheckSSEAvail()> checks the availability of SSE presence in your processor. You can also check other features in your processor such as MMX and HHT.

Test Result

 Array IndexPointer arithmeticMMX Instructions
Clone Pentium 2, 400 MHZ46ms35ms16ms
Clone Pentium 3(eb) 600 MHZ30.6ms24ms11ms
Acer TravMate 332T, Mobile Pentium 2 300 MHZ71ms66ms58ms
Acer TravMate 270, Mobile Pentium 4 1.7 GHZ34ms20ms5.9ms
Toshiba Sat 2410, Mobile Pentium 4 1.8 GHZ22ms14ms4.5ms
Clone, Pentium 4 1.7 GHZ30ms18ms5.5ms
Dell Dimen 2400, Pentium 4 2.2 GHZ18.5ms12.5ms6ms

Various kinds of PC platforms and the test results on a 1024 x 1024 image.

Please note that the speeds are affected by several factors. This includes processor's clock speed, 1st & 2nd level cache memory size, system bus speed, DRAM speed. A fast system bus & DRAM will result in faster data transfer into the core register set. FYI, data is fetched from DRAM into the cache memory and then to the processor register set.

Points of interest

I had recently read through Alex Farber - Introduction to SSE Programming articles, and found out that he had just posted a similar topic. So I modified the content and separated it into 2 parts. The first was to emphasize on C++ and MMX and the second describing SSE/SSE2 implementation by using external compilers (Netwide assembler). The version Alex posted is a .NET version of SSE implementation by using intrinsics. So that makes room for me to post a VC 6.0 version for my next article.

MMX development is meant for intermediate programmers. Some learning curves are required. You have to catch up on Assembly, Intel Processor architectures, MMX instruction sets, assembly optimization & its terms. Hope this article really helps!

Based on the test results, I found that my laptop, Toshiba satellite 2410 seems be the best performer if utilizing MMX instructions. Its bizarre seeing why the Dell dimension desktop PC with a 2.2 GHZ processor loose out so much, even to a 1.7 GHZ clone system (gigabyte motherboard). I had double confirmed this with testing on a few of the Dell systems.

And finally, if you like the article, please give me a vote ya!

Reference

You can refer to some of these helpful sites.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Web Developer Cortex Imaging
Malaysia Malaysia
Interest in computer vision, biometrics, image processing & software optimizations. Known language C/C++, VC++ MFC, Win32, COM/ATL, Assembly.

Comments and Discussions

 
Generalsetting hardware acceleration Pin
Member 161799930-May-05 3:13
Member 161799930-May-05 3:13 
Generalbug in ASM code Pin
peterchen1-May-05 9:02
peterchen1-May-05 9:02 
GeneralRe: bug in ASM code Pin
Anonymous4-May-05 11:04
Anonymous4-May-05 11:04 
GeneralRe: bug in ASM code Pin
peterchen4-May-05 11:28
peterchen4-May-05 11:28 
GeneralData Type Problem(Integer) Pin
Dicky Wong19-Nov-04 1:41
Dicky Wong19-Nov-04 1:41 
GeneralRe: Data Type Problem(Integer) Pin
Anonymous22-Nov-04 18:46
Anonymous22-Nov-04 18:46 
GeneralMin,max Pin
Chris Losinger13-May-04 10:52
professionalChris Losinger13-May-04 10:52 
GeneralRe: Min,max Pin
Vincent Leong7713-May-04 21:36
Vincent Leong7713-May-04 21:36 
Generalnice, but I have a question Pin
sharlila19-Aug-03 23:06
sharlila19-Aug-03 23:06 
GeneralRe: nice, but I have a question Pin
Vincent Leong7720-Aug-03 16:15
Vincent Leong7720-Aug-03 16:15 
GeneralI don't understand Pin
Anonymous21-Aug-03 3:50
Anonymous21-Aug-03 3:50 
GeneralRe: I don't understand Pin
Vincent Leong7721-Aug-03 15:35
Vincent Leong7721-Aug-03 15:35 
GeneralOps, Correction Pin
Erik Oscarsson6-Aug-03 9:59
Erik Oscarsson6-Aug-03 9:59 
GeneralRe: Ops, Correction Pin
Anonymous7-Aug-03 14:41
Anonymous7-Aug-03 14:41 
GeneralRe: Ops, Correction Pin
Vincent Leong777-Aug-03 22:45
Vincent Leong777-Aug-03 22:45 
GeneralRe: Ops, Correction Pin
lafleon8-Aug-03 11:01
lafleon8-Aug-03 11:01 
GeneralWhy index addressing is slower Pin
Vincent Leong775-Aug-03 17:17
Vincent Leong775-Aug-03 17:17 
GeneralIndex addressing is faster Pin
Erik Oscarsson5-Aug-03 13:05
Erik Oscarsson5-Aug-03 13:05 
Here is a version with index adressing and 2 loops, and it is
much faster than even the assembly version. Since modern compilers
are so good at optimizing, going to assembly code rarely improves
the speed in any significant way.

1. Index addressing is FASTER than pointer adressing because only the index
needs to be incremented. 'Conversion' from index to actual address does not
take any extra time.
2. The time for the outer loop is completely insignificant, since the inner loop
loops 1024 times for each outer loop.
3. Multiplication is much slower than addition. Moving the multiplication to the
outer loop makes the multiplication time insignificant.
for (y=0; y<m_iHeight; y++)<br />
{<br />
  BYTE  *pL1 = &m_pImg1[ y*m_iWidth ];<br />
  BYTE  *pL2 = &m_pImg2[ y*m_iWidth ];<br />
  BYTE  *pL  = &m_pImg[  y*m_iWidth ];<br />
    for (x=0; x<m_iWidth; x++)<br />
    {<br />
        // add 2 pixels<br />
        iGray = pL1[x] + pL2[x];<br />
        // keep saturation value<br />
        if (iGray > 255) iGray = 255;<br />
        // Save to destination image<br />
        m_pImg[i] = iGray;<br />
    }<br />
}

I got the following execution times (in ms) on my P4 2.4 GHZ
Basic(The above code):6.9 Optimized:11.0 Assembly:10.7 MMX:3.9
For Optimized, Assembly and MMX I used the ori´ginal code.

ErikO
GeneralRe: Ops, Correction Pin
Vincent Leong775-Aug-03 16:33
Vincent Leong775-Aug-03 16:33 
GeneralRe: why multiplication is slower Pin
Vincent Leong775-Aug-03 17:27
Vincent Leong775-Aug-03 17:27 
GeneralRe: Index addressing is faster Pin
Joel M.11-Sep-03 11:09
Joel M.11-Sep-03 11:09 
General__int64 data types Pin
Vincent Leong773-Aug-03 17:08
Vincent Leong773-Aug-03 17:08 
GeneralMMXSwarm Sample: Demonstrates CImage and Visual C++ MMX Support Pin
TW2-Aug-03 7:22
TW2-Aug-03 7:22 
GeneralRe: MMXSwarm Sample: Demonstrates CImage and Visual C++ MMX Support Pin
Vincent Leong773-Aug-03 15:52
Vincent Leong773-Aug-03 15:52 
GeneralRe: MMXSwarm Sample: Demonstrates CImage and Visual C++ MMX Support Pin
Vincent Leong773-Aug-03 15:53
Vincent Leong773-Aug-03 15:53 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.