Click here to Skip to main content
Click here to Skip to main content

Introduction to MMX Programming

By , 8 Jul 2003
 

Introduction

The Intel MMX™ technology allows enhanced performance in many applications such as image processing, 2D and 3D graphics and others. The typical situation where Intel MMX™ tecnology may be applied is the execution of repetitive operations on large arrays of data elements like byte, word or double-word.

Visual Studio .NET 2003 supports a set of MMX Intrinsics which allow the use of the MMX instructions directly from C++ code, without writing the Assembly instructions. Reading the MSDN MMX topics [2] together with Intel Software manuals [1] gives the opportunity to understand the basics of MMX programming.

MMX technology implememts the SIMD (single-instruction, multiple-data) execution model. Consider the following programming task: adding some value to each element in a BYTE array. The algorithm for this task may be written by such way:

for each  b in array
    b = b + n
With more details:
for each  b in array
{
    load b to the register
    add n to the register
    read the result from the register to memory
}
Processors with the Intel MMX support have eight 64-bit registers, each of which may contain 8 bytes, or 4 words, or 2 double-words. MMX is a set of instructions which allow to load a numeric data (bytes, words, double-words) into the MMX registers, make arithmetic and logical operations with them and read the results back to memory. Using the MMX technology, algorithm may be written by such way:
for each  8 members in array
{
    load 8 members to the MMX register
    add n to each byte in one operation
    write the result from the register back into memory
}
A C++ programmer writing a program using MMX Intrinsics doesn't work with the MMX registers directly. He has a 64-byte __m64 type and set of functions to perform an arithmetic and logical operations. The C++ compiler takes care of registers and code optimizations.

Visual C++ MMXSwarm sample [4] shows the use of the MMX technology in image processing. It contains a set of wrapper classes simplifying work with MMX Intrinsics, and shows how to make image processing operations on various types of images (monochrome, RGB 24 bits, RGB 32 bits etc.). This article is a simple introduction to C++ MMX programming. Everyone who is interesting in this technology is strongly encouraged to read the MMXSwarm sample.

MMX Programming Details

Include Files

All MMX instructions are defined in emmintrin.h file:
#include <emmintrin.h>
Since MMX instructions are compiler intrinsics and not functions, there are no lib-files.

__m64 Data Type

Variables of this type are used as MMX instruction operands. They should not be accessed directly. Variables of type _m64 are automatically aligned on 8-byte boundaries.

Detection of MMX Support

MMX instructions may be used if they are supported by the processor. The Visual C++ CPUID sample [3] shows how to detect support of the SSE, MMX and other processor features. It is done using the cpuid Assembly command. See details in this sample and in the Intel Software manuals [1].

Saturation Arithmetic and Wraparound Mode

The MMX technology supports a new arithmetic capability known as saturating arithmetic. In saturation mode, results of an operation that overflow or underflow are clipped (saturated) to a data-range limit for the data type ([1]). Saturation mode is used in image processing. The following simple example allows to understand the difference between saturation and wraparound mode. Consider adding 1 to a BYTE variable which has value 255. In wraparound mode result will be 0 (carry bit is ignored). In saturation mode result will be 255. The same effect is in the low range, for example, 1 - 2 = 0 (for BYTE type, in saturation mode). Each MMX arithmetic instruction has two sub-types: saturated and wraparound. The demo project from this article uses only saturated instructions.

MMX8 Demo Project

MMX8 is SDI application which makes simple processing with a monochrome 8 bits per pixel image. Source image or result of it's processing is shown in the window. New ATL class CImage is used to extract an image from resources and to show it in the window. Two operations are done with the image: inversion and changing of brightness. Each operation may be done by one of the following ways:
  • C++ code;
  • C++ code with MMX Intrinsics;
  • Inline Assembly with MMX instructions.
Calculation time is shown in the status bar.

C++ image inversion function:

void CImg8Operations::InvertImageCPlusPlus(
    BYTE* pSource, 
    BYTE* pDest, 
    int nNumberOfPixels)
{
    for ( int i = 0; i < nNumberOfPixels; i++ )
    {
        *pDest++ = 255 - *pSource++;
    }
}
The best way to find the required MMX instruction is reading the Intel Software manuals [1]. The name of the required Assembly MMX instruction may be found in the short MMX technology overview (Volume 1, Chapter 8). Detailed instruction definition is in the volume 2. This definition contains also the name of appropriate C++ compiler intrinsic. Some C++ MMX intrinsic are composite (translated to more than one Assembly instructions). They should be found directly in the MSDN documentation [2].

The summary of all MMX instructions used in the MMX8 sample is shown in the following table:

Required Function Assembly Instruction MMX Intrinsic
Empty MMX state (prevents collisions with floating-point operations) emms _mm_empty
Unsigned subtraction with saturation of each byte in two 64-bits operands psubusb _mm_subs_pu8
Unsigned addition with saturation of each byte in two 64-bits operands paddusb _mm_adds_pu8

Image inversion function in C++ with MMX Intrinsics:

void CImg8Operations::InvertImageC_MMX(
    BYTE* pSource, 
    BYTE* pDest, 
    int nNumberOfPixels)
{
    __int64 i = 0;
    i = ~i;                                 // 0xffffffffffffffff    

    // 8 pixels are processed in one loop
    int nLoop = nNumberOfPixels/8;

    __m64* pIn = (__m64*) pSource;          // input pointer
    __m64* pOut = (__m64*) pDest;           // output pointer

    __m64 tmp;                              // work variable

    _mm_empty();                            // emms

    __m64 n1 = Get_m64(i);

    for ( int i = 0; i < nLoop; i++ )
    {
        tmp = _mm_subs_pu8 (n1 , *pIn);     // Unsigned subtraction with 
                                            // saturation.
                                            // tmp = n1 - *pIn  for each byte

        *pOut = tmp;

        pIn++;                              // next 8 pixels
        pOut++;
    }

    _mm_empty();                            // emms
}

__m64 CImg8Operations::Get_m64(__int64 n)
{
    union __m64__m64
    {
        __m64 m;
        __int64 i;
    } mi;

    mi.i = n;
    return mi.m;
}
Since the functions are executed in a very short time, I call them a number of times to see the significant difference. Calculation times on my computer:
  • C++ code - 43 ms
  • C++ with MMX Intrinsics - 26 ms
  • Inline Assembly with MMX instructions - 26 ms
Execution time should be estimated in the Release configuration, with compiler optimizations.

Changing of brighntess is done by the most simple way - just adding or substracting some value to/from each pixel in the image. Conversion functions are slightly more complicated because we need two different branches for a positive and negative changes.

C++ function for changing an image brightness:

void CImg8Operations::ChangeBrightnessCPlusPlus(
    BYTE* pSource, 
    BYTE* pDest, 
    int nNumberOfPixels, 
    int nChange)
{
    if ( nChange > 255 )
        nChange = 255;
    else if ( nChange < -255 )
        nChange = -255;

    BYTE b = (BYTE) abs(nChange);

    int i, n;

    if ( nChange > 0 )
    {
        for ( i = 0; i < nNumberOfPixels; i++ )
        {
            n = (int)(*pSource++ + b);

            if ( n > 255 )
                n = 255;

            *pDest++ = (BYTE) n;
        }
    }
    else
    {
        for ( i = 0; i < nNumberOfPixels; i++ )
        {
            n = (int)(*pSource++ - b);

            if ( n < 0 )
                n = 0;
            *pDest++ = (BYTE) n;
        }
    }
}
Changing an image brightness using C++ with MMX Intrinsics:
void CImg8Operations::ChangeBrightnessC_MMX(
    BYTE* pSource, 
    BYTE* pDest, 
    int nNumberOfPixels, 
    int nChange)
{
    if ( nChange > 255 )
        nChange = 255;
    else if ( nChange < -255 )
        nChange = -255;

    BYTE b = (BYTE) abs(nChange);

    // make 64 bits value with b in each byte
    __int64 c = b;

    for ( int i = 1; i <= 7; i++ )
    {
        c = c << 8;
        c |= b;
    }

    // 8 pixels are processed in one loop
    int nNumberOfLoops = nNumberOfPixels / 8;

    __m64* pIn = (__m64*) pSource;          // input pointer
    __m64* pOut = (__m64*) pDest;           // output pointer

    __m64 tmp;                              // work variable


    _mm_empty();                            // emms

    __m64 nChange64 = Get_m64(c);

    if ( nChange > 0 )
    {
        for ( i = 0; i < nNumberOfLoops; i++ )
        {
            tmp = _mm_adds_pu8(*pIn, nChange64); // Unsigned addition 
                                                 // with saturation.
                                                 // tmp = *pIn + nChange64
                                                 // for each byte

            *pOut = tmp;

            pIn++;                               // next 8 pixels
            pOut++;
        }
    }
    else
    {
        for ( i = 0; i < nNumberOfLoops; i++ )
        {
            tmp = _mm_subs_pu8(*pIn, nChange64); // Unsigned subtraction 
                                                 // with saturation.
                                                 // tmp = *pIn - nChange64
                                                 // for each byte

            *pOut = tmp;

            pIn++;                                      // next 8 pixels
            pOut++;
        }
    }

    _mm_empty();                            // emms
}
 
Notice that the sign of the nChange parameter is checked once outside of loop and not thousands of times inside of loop. Calculation times on my computer:
  • C++ code - 49 ms
  • C++ with MMX Intrinsics - 26 ms
  • Inline Assembly with MMX instructions - 26 ms

MMX32 Demo Project

MMX32 project makes an operations with 32 bits per pixel RGB image. Operations are inversion and changing of image color balance (multiplication of each color to some value).

MMX multiplication is done by more complicated way that addition or subtraction, because result of multiplication is not of the same size as operands. For example, if multiplication operands have a BYTE type, result should have a WORD type. This requires additional conversions, and difference between C++ and MMX execution times is minimal (5-10%).

Changing an image color balance using C++ with MMX Intrinsics:

void CImg32Operations::ColorsC_MMX(
    BYTE* pSource, 
    BYTE* pDest, 
    int nNumberOfPixels, 
    float fRedCoefficient, 
    float fGreenCoefficient, 
    float fBlueCoefficient)
{
    int nRed = (int)(fRedCoefficient * 256.0f);
    int nGreen = (int)(fGreenCoefficient * 256.0f);
    int nBlue = (int)(fBlueCoefficient * 256.0f);

    // make multiplication coefficient
    __int64 c = 0;
    c = nRed;
    c = c << 16;
    c |= nGreen;
    c = c << 16;
    c |= nBlue;

    __m64 nNull = _m_from_int(0);           // null
    __m64 tmp = _m_from_int(0);             // work variable

    _mm_empty();                            // emms

    __m64 nCoeff = Get_m64(c);

    DWORD* pIn = (DWORD*) pSource;          // input pointer
    DWORD* pOut = (DWORD*) pDest;           // output pointer

    for ( int i = 0; i < nNumberOfPixels; i++ )
    {
        tmp = _m_from_int(*pIn);                // tmp = *pIn (write to low
                                                // 32 bits)

        tmp = _mm_unpacklo_pi8(tmp, nNull );    // convert low 4 bytes of
                                                // tmp to 4 words
                                                // high byte for each word
                                                // is taken from nNull

        tmp =  _mm_mullo_pi16 (tmp , nCoeff);   // multiply each word in
                                                // tmp to word in nCoeff
                                                // get low word of each
                                                // result

        tmp = _mm_srli_pi16 (tmp , 8);          // shift each word in tmp
                                                // right to 8 bits (/256)

        tmp = _mm_packs_pu16 (tmp, nNull);      // Pack with unsigned
                                                // saturation.
                                                // Convert 4 words from tmp
                                                // to 4 bytes and write them
                                                // to low 32 bits of tmp.
                                                // Convert 4 words from nNull
                                                // to 4 bytes and write them
                                                // to high 32 bits of tmp.

        *pOut = _m_to_int(tmp);                 // *pOut = tmp (low 32 bits)
        
        pIn++;
        pOut++;

    }

    _mm_empty();                          // emms
}
See additional details in the demo project source code.

SSE2 Technology

SSE2 technology contains a set of integer MMX-like Intrinsics operating with SSE 128-bytes registers. Changing of an inmage color balance using the SSE2 technology, for example, can be executed significantly faster than using pure C++ code. SSE2 technology also extends the SSE technology adding an operations with double-precision floating-point data type. The MMXSwarm C++ sample works both with MMX and integer SSE2 instructions.

Sources

  1. Intel Software manuals.
  2. MSDN, MMX Technology. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/vcrefsupportformmxtechnology.asp
  3. Microsoft Visual C++ CPUID sample. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vcsample/html/vcsamcpuiddeterminecpucapabilities.asp
  4. Microsoft Visual C++ MMXSwarm sample. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vcsample/html/vcsamMMXSwarmSampleDemonstratesCImageVisualCsMMXSupport.asp
  5. Matt Pietrek. Under The Hood. February 1998 issue of Microsoft Systems Journal. http://www.microsoft.com/msj/0298/hood0298.aspx

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Alex Fr
Software Developer
Israel Israel
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionHow about MMX contrast adjustmentmemberPantelis Georgiadis6 Dec '04 - 6:54 
Great article !!!!
Can anyone help me with the MMX contrast adjustment ?
GeneralSSE Question (Strange Behavior)memberDicky Wong19 Nov '04 - 5:55 
I would like to ask questions about SSE.
If you have time, you can see the strange behavior.
Normally, we use "paddd" and "psubd" in SSE command.
There is no problem in "paddd". However, "psubd" is abnormal.
E.g.
a = [1 1 1 1]
b = [2 2 2 2]
c = a - b
c should be [-1 -1 -1 -1]
However, in my case,
c is equal to [-1 -2 -1 -2]
 
Do you know why?
 
Dicky
GeneralMMXSwarm...sussJoeRohde9 Oct '03 - 13:17 
If anyone does look at this sample, please look at the VS.Net 2003 version and not VS7. The VS7 one has some poorly thought out code. Not to say the 2003 one is perfect, but it's better.
 
I applaud anyone playing with this stuff. The world needs more well written high perf native code!
 
Oh, the 2003 version should, in theory, execute correctly using SSE2 on the AMD64 chip (in 32 bit mode for sure) - I need to check that out one of these days. Smile | :)
 
Joe
 

Generaldivision instructionmemberJürgen Eidt2 Aug '03 - 21:39 
From the Intel-Specs I don't see a division instruction for MMX. This is too bad because I have a performance critical loop for a scaling function that deserves some boost Wink | ;)
Do you know why there is no division instruction?
 
Jürgen
cpicture.de
GeneralRe: division instructionmemberAlex Farber2 Aug '03 - 22:57 
I think the reason is technological. Division is expensive operation and they couldn't implement it in MMX. You may try to replace division with multiplication and shift, if possible.
GeneralRe: division instructionmemberJürgen Eidt3 Aug '03 - 20:05 
Thanks Alex,
actually the division is done by a constant value in my case and replacing it with a mult/shift works fine. The error is minimal using a 16bit shift.
The performance gain is not so high because of the 32bit values I have.
I think MMX plays its full advantage for 8bit operands using the instructions with value saturation. For example the image processing on RGB images.
The SSE2 instruction set is only available on the newer processors but allows 4 32bit values processed in parallel for example. This is in my opinion a huge improvement to MMX.
Anyway, thanks again for your great article which makes people aware of a processor feature that can boost the performance (if used correctly of course).

 
Jürgen
cpicture.de
GeneralSome suggestionsmemberAnthony_Yio22 Jul '03 - 15:40 
How about AMD 3DNow! instructions?
 
By the way, a good article you had presented.
GeneralRe: Some suggestionsmemberAlex Farber22 Jul '03 - 18:02 
Yes, I would like to try also these technologies, but my processor doesn't support them. Thanks.
GeneralTested only in VS 2003memberAlex Farber8 Jul '03 - 20:20 
I made this project in Visual Studio .NET 2003. Publisher added also VC7 to the list of supported versions, I am not sure VC7 supports this.
GeneralRe: Tested only in VS 2003membersharlila9 Jul '03 - 5:21 

correct me if I'm right, but it may also work with vc processor pack 5,
installed on VC 6. I have'nt tried it yet, but it looks pretty cool,
and you've brought an intresting idea. I saw that you shaved about
1/2 of the time. I'm writing a filter for directshow which processes
30 frames per second, and MMX might be what I need.
 
thank you
GeneralRe: Tested only in VS 2003memberJohn M. Drescher9 Jul '03 - 10:15 
WARNING: Do not install the processor pack. It will cause you more problems than it is worth because it breaks exceptions and causes stack corruption...

 
John
GeneralRe: Tested only in VS 2003memberAlex Farber9 Jul '03 - 18:06 
In previous VC versions you can use inline Assembly with MMX instructions.
GeneralRe: Tested only in VS 2003membersharlila10 Jul '03 - 3:37 
Alex Farber wrote:
In previous VC versions you can use inline Assembly with MMX instructions
 
with or without the processor pack?
 
is it really that bad? I was kind of hoping it'll be fine. I haven't
installed it yet but I was about to. can you confirm that it is much
worse? you mean that try-catch clauses won't work or only in MMX
instructions? what do you mean in breaks exceptions?
 
thanks
GeneralRe: Tested only in VS 2003memberJohn M. Drescher10 Jul '03 - 11:13 
The processor pack causes a stack optimization bug that is completly unrelated to using MMX or any of the other extra instructions it contains. Exceptions from com are totally broken and other problems occur that cause stack corruption. There is a guy in our group who still uses the processor pack and has a workaround for the bug in some cases. I talk to him about it when I get a chance.

 
John
GeneralRe: Tested only in VS 2003memberJohn M. Drescher10 Jul '03 - 11:19 
Here is a little follow up on the bug:
 
I have recently installed both SP5 and the Processor Pack to visual C++ 6 and I am having the following problem. When an exception occurs inside a try catch block inside a function that calls ado methods the function crashes (in the debug build) on return. The following code runs fine when compiled with an earlier service pack and no processor pack, but with SP5 + Processor Pack it crashes during the return from the test function. Also if I compile it on a pc with an eariler sp, it will debug and run fine on a machine with the latest sp (as long as you don't try to compile!).
 
The following code has been reduced to the smallest part that I could get it to crash. I compiled it as a Win32 console application and did not change any of the default values. In SP5 + processor pack an access violation occurs when test() returns. This was tested on win2k. I compiled and tested it in debug build only.
 
#include "stdafx.h"
 
// You must change this to match your path
#import "D:\program files\common files\system\ado\msado15.dll" \
            no_namespace \
            rename( "EOF", "adoEOF" )
 
void test()
{
   _RecordsetPtr pRst = NULL;
   try
   {
      throw "a";
 
      pRst->Open(_variant_t(), _variant_t(),adOpenStatic ,
            adLockReadOnly, adCmdText);
   }
   catch (...)
   {
 
   }
}
 
int main(int argc, char* argv[])
{
   test();
   return 0;
}
 

 
My question is 1) Has anyone seen this problem? 2) am I doing something wrong?
 
John
GeneralRe: Tested only in VS 2003membersharlila10 Jul '03 - 20:42 

ok, I guess I won't install processor pack. Although it's a pretty
unique case (I don't use ado). Are you sure you didn't download
the beta processor pack? because the say it's unstable.
GeneralRe: Tested only in VS 2003memberJohn M. Drescher11 Jul '03 - 2:33 
No. There are more cases that cause the problem. The coworker I talked about above does not use ADO and he experienced the problem. And no this was not the beta version.
 
John
GeneralRe: Tested only in VS 2003membersharlila11 Jul '03 - 4:34 
ok,
you talked me out of installing it.Unsure | :~
GeneralRe: Tested only in VS 2003memberJohn M. Drescher11 Jul '03 - 4:39 
You can install it on an alternate machine and use the .lib, .obj or .dll file in your application. This is what we do. My coworker has his algorithms compiled with the processor pack and I use them (.lib files) in my application without problems.
 
John
GeneralRe: Tested only in VS 2003memberJohn M. Drescher11 Jul '03 - 4:52 
I found out what was causing my coworkers problem. It was not ADO but it was still COM. During a COM call (pSeriesCol is a com pointer to excel using the #import directive) the EBX and EPB registers are trashed. His work around fixes the EBX register so the return address of the function will be correct. He did not investigate fixing the EBP register so there can be problems with stack variables.
 
#define	Save_EBX	_asm	mov	saved_ebx,ebx
#define	Restore_EBX	_asm	mov	ebx,saved_ebx
 

void some_member_function()
{
...
 
try{
Save_EBX;
pSeriesCol->Add((Range*)pRange,xlColumns,VARIANT_TRUE,VARIANT_TRUE,VARIANT_FALSE);
// EBX and EBP registers are trashed with the processor pack 5 installed in vc6 with sp5.
Restore_EBX;
}
catch( _com_error &e){ e;};
 
...
}
 

 
John
GeneralRe: Tested only in VS 2003membersharlila11 Jul '03 - 21:53 
you know, I tried searching microsoft's KB and came up with nothing.
well, there is something with SP4 installed but that's it.
maybe you have something unusual in your pc. I don't know, an old
OS, multiple processors, maybe something else. could it be?
GeneralRe: Tested only in VS 2003memberJohn M. Drescher12 Jul '03 - 2:39 
It is not in the KB. It has happened on every pc I tried it on and it is very reproducable. The only thing about our machines is we are using AMD processors. If you do not use COM I would not worry about the problem my coworker has had it installed in his pc for about 2 years but for the most part he does not use COM.

 
John
GeneralRe: Tested only in VS 2003membersharlila12 Jul '03 - 3:48 
well,
I use COM sometimes, however, I have an Intel processor. maybe that's
the problem? you certainly confused me, I think I'll develop a
split personality now. no, I'm kidding, thanks for taking the time
to explain it to me.
GeneralRe: Tested only in VS 2003memberJohn M. Drescher12 Jul '03 - 4:51 
sharlila wrote:
I use COM sometimes, however, I have an Intel processor.
 
I do remember that taking an executable produced with the processor pack to an intel machine caused the crash also. I would say you can try it for yourself.
 
sharlila wrote:
you certainly confused me, I think I'll develop a
split personality now. no, I'm kidding

 
You convinced me that it may not be so bad to install. The worst that can happen is that you will have to uninstall vc++ delete the folder and then reinstall. I did this to remove it on my pc. Glenn my coworker friend did not do this but he only uses COM for one application that I wrote for him (Excel automation).

 
John
GeneralRe: Tested only in VS 2003membersharlila13 Jul '03 - 3:11 
John M. Drescher wrote:
The worst that can happen is that you will have to uninstall vc++ delete the folder and then reinstall
 
I don't know. This seems one of those things that might force you
to format your HD. Can't know when dealing with those patches.
Another thing, I don't want to run between two computers to debug
something.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web02 | 2.6.130516.1 | Last Updated 9 Jul 2003
Article Copyright 2003 by Alex Fr
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid