Click here to Skip to main content
Click here to Skip to main content

Introduction to MMX Programming

By , 8 Jul 2003
 

Introduction

The Intel MMX™ technology allows enhanced performance in many applications such as image processing, 2D and 3D graphics and others. The typical situation where Intel MMX™ tecnology may be applied is the execution of repetitive operations on large arrays of data elements like byte, word or double-word.

Visual Studio .NET 2003 supports a set of MMX Intrinsics which allow the use of the MMX instructions directly from C++ code, without writing the Assembly instructions. Reading the MSDN MMX topics [2] together with Intel Software manuals [1] gives the opportunity to understand the basics of MMX programming.

MMX technology implememts the SIMD (single-instruction, multiple-data) execution model. Consider the following programming task: adding some value to each element in a BYTE array. The algorithm for this task may be written by such way:

for each  b in array
    b = b + n
With more details:
for each  b in array
{
    load b to the register
    add n to the register
    read the result from the register to memory
}
Processors with the Intel MMX support have eight 64-bit registers, each of which may contain 8 bytes, or 4 words, or 2 double-words. MMX is a set of instructions which allow to load a numeric data (bytes, words, double-words) into the MMX registers, make arithmetic and logical operations with them and read the results back to memory. Using the MMX technology, algorithm may be written by such way:
for each  8 members in array
{
    load 8 members to the MMX register
    add n to each byte in one operation
    write the result from the register back into memory
}
A C++ programmer writing a program using MMX Intrinsics doesn't work with the MMX registers directly. He has a 64-byte __m64 type and set of functions to perform an arithmetic and logical operations. The C++ compiler takes care of registers and code optimizations.

Visual C++ MMXSwarm sample [4] shows the use of the MMX technology in image processing. It contains a set of wrapper classes simplifying work with MMX Intrinsics, and shows how to make image processing operations on various types of images (monochrome, RGB 24 bits, RGB 32 bits etc.). This article is a simple introduction to C++ MMX programming. Everyone who is interesting in this technology is strongly encouraged to read the MMXSwarm sample.

MMX Programming Details

Include Files

All MMX instructions are defined in emmintrin.h file:
#include <emmintrin.h>
Since MMX instructions are compiler intrinsics and not functions, there are no lib-files.

__m64 Data Type

Variables of this type are used as MMX instruction operands. They should not be accessed directly. Variables of type _m64 are automatically aligned on 8-byte boundaries.

Detection of MMX Support

MMX instructions may be used if they are supported by the processor. The Visual C++ CPUID sample [3] shows how to detect support of the SSE, MMX and other processor features. It is done using the cpuid Assembly command. See details in this sample and in the Intel Software manuals [1].

Saturation Arithmetic and Wraparound Mode

The MMX technology supports a new arithmetic capability known as saturating arithmetic. In saturation mode, results of an operation that overflow or underflow are clipped (saturated) to a data-range limit for the data type ([1]). Saturation mode is used in image processing. The following simple example allows to understand the difference between saturation and wraparound mode. Consider adding 1 to a BYTE variable which has value 255. In wraparound mode result will be 0 (carry bit is ignored). In saturation mode result will be 255. The same effect is in the low range, for example, 1 - 2 = 0 (for BYTE type, in saturation mode). Each MMX arithmetic instruction has two sub-types: saturated and wraparound. The demo project from this article uses only saturated instructions.

MMX8 Demo Project

MMX8 is SDI application which makes simple processing with a monochrome 8 bits per pixel image. Source image or result of it's processing is shown in the window. New ATL class CImage is used to extract an image from resources and to show it in the window. Two operations are done with the image: inversion and changing of brightness. Each operation may be done by one of the following ways:
  • C++ code;
  • C++ code with MMX Intrinsics;
  • Inline Assembly with MMX instructions.
Calculation time is shown in the status bar.

C++ image inversion function:

void CImg8Operations::InvertImageCPlusPlus(
    BYTE* pSource, 
    BYTE* pDest, 
    int nNumberOfPixels)
{
    for ( int i = 0; i < nNumberOfPixels; i++ )
    {
        *pDest++ = 255 - *pSource++;
    }
}
The best way to find the required MMX instruction is reading the Intel Software manuals [1]. The name of the required Assembly MMX instruction may be found in the short MMX technology overview (Volume 1, Chapter 8). Detailed instruction definition is in the volume 2. This definition contains also the name of appropriate C++ compiler intrinsic. Some C++ MMX intrinsic are composite (translated to more than one Assembly instructions). They should be found directly in the MSDN documentation [2].

The summary of all MMX instructions used in the MMX8 sample is shown in the following table:

Required Function Assembly Instruction MMX Intrinsic
Empty MMX state (prevents collisions with floating-point operations) emms _mm_empty
Unsigned subtraction with saturation of each byte in two 64-bits operands psubusb _mm_subs_pu8
Unsigned addition with saturation of each byte in two 64-bits operands paddusb _mm_adds_pu8

Image inversion function in C++ with MMX Intrinsics:

void CImg8Operations::InvertImageC_MMX(
    BYTE* pSource, 
    BYTE* pDest, 
    int nNumberOfPixels)
{
    __int64 i = 0;
    i = ~i;                                 // 0xffffffffffffffff    

    // 8 pixels are processed in one loop
    int nLoop = nNumberOfPixels/8;

    __m64* pIn = (__m64*) pSource;          // input pointer
    __m64* pOut = (__m64*) pDest;           // output pointer

    __m64 tmp;                              // work variable

    _mm_empty();                            // emms

    __m64 n1 = Get_m64(i);

    for ( int i = 0; i < nLoop; i++ )
    {
        tmp = _mm_subs_pu8 (n1 , *pIn);     // Unsigned subtraction with 
                                            // saturation.
                                            // tmp = n1 - *pIn  for each byte

        *pOut = tmp;

        pIn++;                              // next 8 pixels
        pOut++;
    }

    _mm_empty();                            // emms
}

__m64 CImg8Operations::Get_m64(__int64 n)
{
    union __m64__m64
    {
        __m64 m;
        __int64 i;
    } mi;

    mi.i = n;
    return mi.m;
}
Since the functions are executed in a very short time, I call them a number of times to see the significant difference. Calculation times on my computer:
  • C++ code - 43 ms
  • C++ with MMX Intrinsics - 26 ms
  • Inline Assembly with MMX instructions - 26 ms
Execution time should be estimated in the Release configuration, with compiler optimizations.

Changing of brighntess is done by the most simple way - just adding or substracting some value to/from each pixel in the image. Conversion functions are slightly more complicated because we need two different branches for a positive and negative changes.

C++ function for changing an image brightness:

void CImg8Operations::ChangeBrightnessCPlusPlus(
    BYTE* pSource, 
    BYTE* pDest, 
    int nNumberOfPixels, 
    int nChange)
{
    if ( nChange > 255 )
        nChange = 255;
    else if ( nChange < -255 )
        nChange = -255;

    BYTE b = (BYTE) abs(nChange);

    int i, n;

    if ( nChange > 0 )
    {
        for ( i = 0; i < nNumberOfPixels; i++ )
        {
            n = (int)(*pSource++ + b);

            if ( n > 255 )
                n = 255;

            *pDest++ = (BYTE) n;
        }
    }
    else
    {
        for ( i = 0; i < nNumberOfPixels; i++ )
        {
            n = (int)(*pSource++ - b);

            if ( n < 0 )
                n = 0;
            *pDest++ = (BYTE) n;
        }
    }
}
Changing an image brightness using C++ with MMX Intrinsics:
void CImg8Operations::ChangeBrightnessC_MMX(
    BYTE* pSource, 
    BYTE* pDest, 
    int nNumberOfPixels, 
    int nChange)
{
    if ( nChange > 255 )
        nChange = 255;
    else if ( nChange < -255 )
        nChange = -255;

    BYTE b = (BYTE) abs(nChange);

    // make 64 bits value with b in each byte
    __int64 c = b;

    for ( int i = 1; i <= 7; i++ )
    {
        c = c << 8;
        c |= b;
    }

    // 8 pixels are processed in one loop
    int nNumberOfLoops = nNumberOfPixels / 8;

    __m64* pIn = (__m64*) pSource;          // input pointer
    __m64* pOut = (__m64*) pDest;           // output pointer

    __m64 tmp;                              // work variable


    _mm_empty();                            // emms

    __m64 nChange64 = Get_m64(c);

    if ( nChange > 0 )
    {
        for ( i = 0; i < nNumberOfLoops; i++ )
        {
            tmp = _mm_adds_pu8(*pIn, nChange64); // Unsigned addition 
                                                 // with saturation.
                                                 // tmp = *pIn + nChange64
                                                 // for each byte

            *pOut = tmp;

            pIn++;                               // next 8 pixels
            pOut++;
        }
    }
    else
    {
        for ( i = 0; i < nNumberOfLoops; i++ )
        {
            tmp = _mm_subs_pu8(*pIn, nChange64); // Unsigned subtraction 
                                                 // with saturation.
                                                 // tmp = *pIn - nChange64
                                                 // for each byte

            *pOut = tmp;

            pIn++;                                      // next 8 pixels
            pOut++;
        }
    }

    _mm_empty();                            // emms
}
 
Notice that the sign of the nChange parameter is checked once outside of loop and not thousands of times inside of loop. Calculation times on my computer:
  • C++ code - 49 ms
  • C++ with MMX Intrinsics - 26 ms
  • Inline Assembly with MMX instructions - 26 ms

MMX32 Demo Project

MMX32 project makes an operations with 32 bits per pixel RGB image. Operations are inversion and changing of image color balance (multiplication of each color to some value).

MMX multiplication is done by more complicated way that addition or subtraction, because result of multiplication is not of the same size as operands. For example, if multiplication operands have a BYTE type, result should have a WORD type. This requires additional conversions, and difference between C++ and MMX execution times is minimal (5-10%).

Changing an image color balance using C++ with MMX Intrinsics:

void CImg32Operations::ColorsC_MMX(
    BYTE* pSource, 
    BYTE* pDest, 
    int nNumberOfPixels, 
    float fRedCoefficient, 
    float fGreenCoefficient, 
    float fBlueCoefficient)
{
    int nRed = (int)(fRedCoefficient * 256.0f);
    int nGreen = (int)(fGreenCoefficient * 256.0f);
    int nBlue = (int)(fBlueCoefficient * 256.0f);

    // make multiplication coefficient
    __int64 c = 0;
    c = nRed;
    c = c << 16;
    c |= nGreen;
    c = c << 16;
    c |= nBlue;

    __m64 nNull = _m_from_int(0);           // null
    __m64 tmp = _m_from_int(0);             // work variable

    _mm_empty();                            // emms

    __m64 nCoeff = Get_m64(c);

    DWORD* pIn = (DWORD*) pSource;          // input pointer
    DWORD* pOut = (DWORD*) pDest;           // output pointer

    for ( int i = 0; i < nNumberOfPixels; i++ )
    {
        tmp = _m_from_int(*pIn);                // tmp = *pIn (write to low
                                                // 32 bits)

        tmp = _mm_unpacklo_pi8(tmp, nNull );    // convert low 4 bytes of
                                                // tmp to 4 words
                                                // high byte for each word
                                                // is taken from nNull

        tmp =  _mm_mullo_pi16 (tmp , nCoeff);   // multiply each word in
                                                // tmp to word in nCoeff
                                                // get low word of each
                                                // result

        tmp = _mm_srli_pi16 (tmp , 8);          // shift each word in tmp
                                                // right to 8 bits (/256)

        tmp = _mm_packs_pu16 (tmp, nNull);      // Pack with unsigned
                                                // saturation.
                                                // Convert 4 words from tmp
                                                // to 4 bytes and write them
                                                // to low 32 bits of tmp.
                                                // Convert 4 words from nNull
                                                // to 4 bytes and write them
                                                // to high 32 bits of tmp.

        *pOut = _m_to_int(tmp);                 // *pOut = tmp (low 32 bits)
        
        pIn++;
        pOut++;

    }

    _mm_empty();                          // emms
}
See additional details in the demo project source code.

SSE2 Technology

SSE2 technology contains a set of integer MMX-like Intrinsics operating with SSE 128-bytes registers. Changing of an inmage color balance using the SSE2 technology, for example, can be executed significantly faster than using pure C++ code. SSE2 technology also extends the SSE technology adding an operations with double-precision floating-point data type. The MMXSwarm C++ sample works both with MMX and integer SSE2 instructions.

Sources

  1. Intel Software manuals.
  2. MSDN, MMX Technology. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/vcrefsupportformmxtechnology.asp
  3. Microsoft Visual C++ CPUID sample. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vcsample/html/vcsamcpuiddeterminecpucapabilities.asp
  4. Microsoft Visual C++ MMXSwarm sample. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vcsample/html/vcsamMMXSwarmSampleDemonstratesCImageVisualCsMMXSupport.asp
  5. Matt Pietrek. Under The Hood. February 1998 issue of Microsoft Systems Journal. http://www.microsoft.com/msj/0298/hood0298.aspx

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Alex Fr
Software Developer
Israel Israel
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralSwap arraymemberX3m9 May '08 - 0:12 
Congratulations for your great article.
I have the following question:
How can I swap elements in an array? I.e. given an array of RGBA values lke this:
0xFFFFFFFF, 0xFF00FF00, 0x00FF00FF, 0xFFFF0000 to swap the elements so it looks like this:
0xFFFF0000, 0x00FF00FF, 0xFF00FF00, 0xFFFFFFFF?
I need this in order to mirror a bitmap in memory before displaying it with BitBlt, because StretchBlt is too slow.
 
Best regards,
X3m

GeneralGreat article and a questionmemberX3m30 Jan '08 - 23:35 
Hello,
 
First of all let me congratulate you on a job well done.
Your article is very helpfull for MMX newbies like me (although experienced in c++).
 
I have the following question:
Can you please give me some instructions and maybe a simple example of how can I do a bilinear (or any kind of) image resampling/resizing?
 
Thanks in advance and keep up good work.
 
Best regards,
X3m

GeneralRe: Great article and a questionmemberAlex Fr31 Jan '08 - 1:17 
Thank you.
Before you start with these algorithms, take into account that this article was written long time ago. On Pentium III MMX gives great performance boost. On Pentium IV and later, effect of these optimizations is minimal, I guess because of large cache size. When the whole image is placed into the cache, simple C++ loop works almost like MMX loop.
If you still want to make optimization, use SSE2 instead of MMX. SSE2 allows to work with integer numbers by the same way as MMX, but using 128 bit registers instead of 64 bit.
About your question - unfortunately, I don't have experience in implementing these algorithms. If you or your company can pay some money for effective image processing, consider using third party library. The best and fastest library is Intel IPP, and its cost is acceptable.
QuestionHelp me!memberKienNT783 Oct '07 - 0:23 
Please show me how to programming mmx in c#
AnswerRe: Help me!memberLloyd Atkinson2 Sep '10 - 0:42 
Use Inline assembly Smile | :)


"People demand freedom of speech to make up for the freedom of thought which they avoid."

GeneralAlpha blendingmemberDickymoe5125 Sep '05 - 22:36 
For the begining, sorry for my bad english.
I'm a newbie in MMX/SSE. I would like to do alpha blending (it is only a test to test the performance) but when I use MMX/SSE i'm less speed than optimized c++.
 
The biggest problem, it is the multiplication 8*8 not exists, and the conversion are slowly.
 
Do you have example of alpha blending with intrasec functions, or explication ?
 
Thank you
 
Good Bye
Generaluse of EMMS/_mm_emptymemberHlabbe5 Aug '05 - 8:07 
Hi There,
 
Thx for a great article! I was curious about the usage of _mm_empty in your example. You actually call it twice in your function even though no floating point operation will happen in between (resulting in wasted cycles for the first call), am I right? Any particular reason why or you're just being extra careful by always calling _mm_empty after any MMX instr. and before either returning or calling another function?
 
Thx in advance,
 
Hugues.

GeneralTrouble with performancememberJahnotto21 Feb '05 - 2:57 
Hi, thank you for a very good article!
 
I'm trying to implement a little MMX test for myself without luck Frown | :(
 
Here is the code:
 

 

#include
#include
 

#include
 
#define REPS 100
 
void doItCpp(unsigned char* src, unsigned char* dst, unsigned char inc, int iSize)
{
for (int i = 0; i < iSize; i++)
{
*dst = *src + inc;

src++;
dst++;
 
}
}
 
void doItMmx(unsigned char* src, unsigned char* dst, unsigned char inc, int iSize)
{
__m64* pSrc = (__m64*)src;
__m64* pDest = (__m64*)dst;
__m64 tmp;
 
_mm_empty();
 
__m64 increment = _mm_set1_pi8(inc);
 
int iNumElems = iSize / 8;
for (int i = 0; i < iNumElems; i++)
{
tmp = _mm_add_pi8(*pSrc, increment);
 
*pDest = tmp;
 
pSrc++;
pDest++;
}
 
_mm_empty();
}
 
int main(int, char**)
{
printf("Starting...\n");
 
// Define dimensions
int iNumPix = 2048 * 2048;
 
// Create original data
unsigned char* originalpix = (unsigned char*) _aligned_malloc(iNumPix, 16);
unsigned char* procpix = (unsigned char*) _aligned_malloc(iNumPix, 16);
unsigned char* procpix2 = (unsigned char*) _aligned_malloc(iNumPix, 16);
 
// Fill with random data
for (int i = 0; i < iNumPix; i++)
originalpix[i] = (unsigned char)(256 * rand() / (float)RAND_MAX);

//
//
 
long lStartTime = GetTickCount();
 
unsigned char smallrandnum = unsigned char(lStartTime % 5);
 
for (int REP = 0; REP < REPS; REP++)
{
doItCpp(originalpix, procpix, smallrandnum, iNumPix);
}
 
printf("Time 1: %i ms\n", GetTickCount() - lStartTime);
 
//
//
lStartTime = GetTickCount();
 
for (int REP = 0; REP < REPS; REP++)
{
doItMmx(originalpix, procpix2, smallrandnum, iNumPix);
}
 
printf("Time 2: %i\n", GetTickCount() - lStartTime);
 

if (memcmp(procpix, procpix2, iNumPix) != 0)
printf("*** RESULT IS NOT VALID");
 

_aligned_free(originalpix);
_aligned_free(procpix);
_aligned_free(procpix2);
 
return 0;
}
 

 

The result is valid, but both the C++ and C++/MMX version are achieving the same performance results: approx. 1000 ms on my Pentium M. I am using Visual Studio .NET 2003.
 
What can be wrong? It wonder if I have forgot to initialize something.
 
By the way, your MMX examples are showing a performance leap of 3-5x when going from C++ to C++/MMX !

GeneralRe: Trouble with performancememberAlex Fr21 Feb '05 - 3:40 
Testing your code on Pentium III, I got some performance boost with MMX code. On Pentium IV results may be different. For example, if source and destination arrays may be placed in the CPU cache, MMX code doesn't give any advantage. Try to increase significantly size of input/output arrays, because Pentium IV has a large cache.
Many optimization techniques valid few years ago become unnecessary or less important with releasing of new processors and new compilers. To see the whole picture, you need to read optimized Assembly code generated by C++ compiler, know CPU parameters, play with C++ compiler options etc.
You can write the same with SSE2 instructions, this may make a difference.
GeneralRe: Trouble with performancememberJahnotto21 Feb '05 - 22:49 
Hello Alex, thank you for your quick answer!
 
I get your point - optimization techniques that were good yesterday may be obsolete today.
 
However, I find it very strange that your MMX8 example is so much faster than my test. Your "change brightness" example spends 15 ms with C++ and 3 ms with C++/MMX. As far as I can tell, the operation in your MMX8 example is just the same as what I do: add a number to each 8-bit pixel.
 
I also tried to measure the number of pixels processed per millisecond:
 
Your demo with C++: 320,000 pixels/ms
Your demo with MMX: 1,600,000 pixels/ms
My test with C++: 350,000 pixels/ms
My test with MMX: 400,000 pixels/ms
 
I have also tried to use SSE2 instead without any difference:
 

 
__m128i* pSrc = (__m128i*)src;
__m128i* pDest = (__m128i*)dst;
__m128i tmp;
 
__m128i increment = _mm_set1_epi8(inc);
 
int iNumElems = iSize / 16;
for (int i = 0; i < iNumElems; i++)
{
tmp = _mm_add_epi8(*pSrc, increment);
 
*pDest = tmp;
 
pSrc++;
pDest++;
}
 
_mm_empty();
 

 
The input/output arrays I'm using are 16 MB, far bigger than the CPU's 512 kB cache.
 
I am scratching my head here.. I just can't see what you're doing differently.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web04 | 2.6.130523.1 | Last Updated 9 Jul 2003
Article Copyright 2003 by Alex Fr
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid