Introduction
The Intel MMX™ technology allows enhanced performance in many applications such as image processing, 2D and 3D graphics and others. The typical situation where Intel MMX™ tecnology may be applied is the execution of repetitive operations on large arrays of data elements like byte, word or double-word.
Visual Studio .NET 2003 supports a set of MMX Intrinsics which allow the use of the MMX instructions directly from C++ code, without writing the Assembly instructions. Reading the MSDN MMX topics [2] together with Intel Software manuals [1] gives the opportunity to understand the basics of MMX programming.
MMX technology implememts the SIMD (single-instruction, multiple-data) execution model. Consider the following programming task: adding some value to each element in a BYTE array. The algorithm for this task may be written by such way:
for each b in array
b = b + n
With more details:
for each b in array
{
load b to the register
add n to the register
read the result from the register to memory
}
Processors with the Intel MMX support have eight 64-bit registers, each of which may contain 8 bytes, or 4 words, or 2 double-words. MMX is a set of instructions which allow to load a numeric data (bytes, words, double-words) into the MMX registers, make arithmetic and logical operations with them and read the results back to memory. Using the MMX technology, algorithm may be written by such way:
for each 8 members in array
{
load 8 members to the MMX register
add n to each byte in one operation
write the result from the register back into memory
}
A C++ programmer writing a program using MMX Intrinsics doesn't work with the MMX registers directly. He has a 64-byte
__m64
type and set of functions to perform an arithmetic and logical operations. The C++ compiler takes care of registers and code optimizations.
Visual C++ MMXSwarm
sample [4] shows the use of the MMX technology in image processing. It contains a set of wrapper classes simplifying work with MMX Intrinsics, and shows how to make image processing operations on various types of images (monochrome, RGB 24 bits, RGB 32 bits etc.). This article is a simple introduction to C++ MMX programming. Everyone who is interesting in this technology is strongly encouraged to read the MMXSwarm
sample.
MMX Programming Details
Include Files
All MMX instructions are defined in
emmintrin.h
file:
#include <emmintrin.h>
Since MMX instructions are compiler intrinsics and not functions, there are no lib-files.
__m64 Data Type
Variables of this type are used as MMX instruction operands. They should not be accessed directly. Variables of type
_m64
are automatically aligned on 8-byte boundaries.
Detection of MMX Support
MMX instructions may be used if they are supported by the processor. The Visual C++
CPUID sample
[3] shows how to detect support of the SSE, MMX and other processor features. It is done using the
cpuid
Assembly command. See details in this sample and in the Intel Software manuals
[1].
Saturation Arithmetic and Wraparound Mode
The MMX technology supports a new arithmetic capability known as saturating arithmetic. In saturation mode, results of an operation that overflow or underflow are clipped (saturated) to a data-range limit for the data type (
[1]). Saturation mode is used in image processing. The following simple example allows to understand the difference between saturation and wraparound mode. Consider adding 1 to a
BYTE
variable which has value 255. In wraparound mode result will be 0 (carry bit is ignored). In saturation mode result will be 255. The same effect is in the low range, for example, 1 - 2 = 0 (for
BYTE
type, in saturation mode). Each MMX arithmetic instruction has two sub-types: saturated and wraparound. The demo project from this article uses only saturated instructions.
MMX8 Demo Project
MMX8
is SDI application which makes simple processing with a monochrome 8 bits per pixel image. Source image or result of it's processing is shown in the window. New ATL class
CImage
is used to extract an image from resources and to show it in the window. Two operations are done with the image: inversion and changing of brightness. Each operation may be done by one of the following ways:
- C++ code;
- C++ code with MMX Intrinsics;
- Inline Assembly with MMX instructions.
Calculation time is shown in the status bar.
C++ image inversion function:
void CImg8Operations::InvertImageCPlusPlus(
BYTE* pSource,
BYTE* pDest,
int nNumberOfPixels)
{
for ( int i = 0; i < nNumberOfPixels; i++ )
{
*pDest++ = 255 - *pSource++;
}
}
The best way to find the required MMX instruction is reading the Intel Software manuals
[1]. The name of the required Assembly MMX instruction may be found in the short MMX technology overview (Volume 1, Chapter 8). Detailed instruction definition is in the volume 2. This definition contains also the name of appropriate C++ compiler intrinsic. Some C++ MMX intrinsic are composite (translated to more than one Assembly instructions). They should be found directly in the MSDN documentation
[2].
The summary of all MMX instructions used in the MMX8 sample is shown in the following table:
Required Function |
Assembly Instruction |
MMX Intrinsic |
Empty MMX state (prevents collisions with floating-point operations) |
emms |
_mm_empty |
Unsigned subtraction with saturation of each byte in two 64-bits operands |
psubusb |
_mm_subs_pu8 |
Unsigned addition with saturation of each byte in two 64-bits operands |
paddusb |
_mm_adds_pu8 |
Image inversion function in C++ with MMX Intrinsics:
void CImg8Operations::InvertImageC_MMX(
BYTE* pSource,
BYTE* pDest,
int nNumberOfPixels)
{
__int64 i = 0;
i = ~i;
int nLoop = nNumberOfPixels/8;
__m64* pIn = (__m64*) pSource; __m64* pOut = (__m64*) pDest;
__m64 tmp;
_mm_empty();
__m64 n1 = Get_m64(i);
for ( int i = 0; i < nLoop; i++ )
{
tmp = _mm_subs_pu8 (n1 , *pIn);
*pOut = tmp;
pIn++; pOut++;
}
_mm_empty(); }
__m64 CImg8Operations::Get_m64(__int64 n)
{
union __m64__m64
{
__m64 m;
__int64 i;
} mi;
mi.i = n;
return mi.m;
}
Since the functions are executed in a very short time, I call them a number of times to see the significant difference. Calculation times on my computer:
- C++ code - 43 ms
- C++ with MMX Intrinsics - 26 ms
- Inline Assembly with MMX instructions - 26 ms
Execution time should be estimated in the Release configuration, with compiler optimizations.
Changing of brighntess is done by the most simple way - just adding or substracting some value to/from each pixel in the image. Conversion functions are slightly more complicated because we need two different branches for a positive and negative changes.
C++ function for changing an image brightness:
void CImg8Operations::ChangeBrightnessCPlusPlus(
BYTE* pSource,
BYTE* pDest,
int nNumberOfPixels,
int nChange)
{
if ( nChange > 255 )
nChange = 255;
else if ( nChange < -255 )
nChange = -255;
BYTE b = (BYTE) abs(nChange);
int i, n;
if ( nChange > 0 )
{
for ( i = 0; i < nNumberOfPixels; i++ )
{
n = (int)(*pSource++ + b);
if ( n > 255 )
n = 255;
*pDest++ = (BYTE) n;
}
}
else
{
for ( i = 0; i < nNumberOfPixels; i++ )
{
n = (int)(*pSource++ - b);
if ( n < 0 )
n = 0;
*pDest++ = (BYTE) n;
}
}
}
Changing an image brightness using C++ with MMX Intrinsics:
void CImg8Operations::ChangeBrightnessC_MMX(
BYTE* pSource,
BYTE* pDest,
int nNumberOfPixels,
int nChange)
{
if ( nChange > 255 )
nChange = 255;
else if ( nChange < -255 )
nChange = -255;
BYTE b = (BYTE) abs(nChange);
__int64 c = b;
for ( int i = 1; i <= 7; i++ )
{
c = c << 8;
c |= b;
}
int nNumberOfLoops = nNumberOfPixels / 8;
__m64* pIn = (__m64*) pSource; __m64* pOut = (__m64*) pDest;
__m64 tmp;
_mm_empty();
__m64 nChange64 = Get_m64(c);
if ( nChange > 0 )
{
for ( i = 0; i < nNumberOfLoops; i++ )
{
tmp = _mm_adds_pu8(*pIn, nChange64);
*pOut = tmp;
pIn++; pOut++;
}
}
else
{
for ( i = 0; i < nNumberOfLoops; i++ )
{
tmp = _mm_subs_pu8(*pIn, nChange64);
*pOut = tmp;
pIn++; pOut++;
}
}
_mm_empty(); }
Notice that the sign of the
nChange
parameter is checked once outside of loop and not thousands of times inside of loop. Calculation times on my computer:
- C++ code - 49 ms
- C++ with MMX Intrinsics - 26 ms
- Inline Assembly with MMX instructions - 26 ms
MMX32 Demo Project
MMX32
project makes an operations with 32 bits per pixel RGB image. Operations are inversion and changing of image color balance (multiplication of each color to some value).
MMX multiplication is done by more complicated way that addition or subtraction, because result of multiplication is not of the same size as operands. For example, if multiplication operands have a BYTE
type, result should have a WORD
type. This requires additional conversions, and difference between C++ and MMX execution times is minimal (5-10%).
Changing an image color balance using C++ with MMX Intrinsics:
void CImg32Operations::ColorsC_MMX(
BYTE* pSource,
BYTE* pDest,
int nNumberOfPixels,
float fRedCoefficient,
float fGreenCoefficient,
float fBlueCoefficient)
{
int nRed = (int)(fRedCoefficient * 256.0f);
int nGreen = (int)(fGreenCoefficient * 256.0f);
int nBlue = (int)(fBlueCoefficient * 256.0f);
__int64 c = 0;
c = nRed;
c = c << 16;
c |= nGreen;
c = c << 16;
c |= nBlue;
__m64 nNull = _m_from_int(0); __m64 tmp = _m_from_int(0);
_mm_empty();
__m64 nCoeff = Get_m64(c);
DWORD* pIn = (DWORD*) pSource; DWORD* pOut = (DWORD*) pDest;
for ( int i = 0; i < nNumberOfPixels; i++ )
{
tmp = _m_from_int(*pIn);
tmp = _mm_unpacklo_pi8(tmp, nNull );
tmp = _mm_mullo_pi16 (tmp , nCoeff);
tmp = _mm_srli_pi16 (tmp , 8);
tmp = _mm_packs_pu16 (tmp, nNull);
*pOut = _m_to_int(tmp);
pIn++;
pOut++;
}
_mm_empty(); }
See additional details in the demo project source code.
SSE2 Technology
SSE2 technology contains a set of integer MMX-like Intrinsics operating with SSE 128-bytes registers. Changing of an inmage color balance using the SSE2 technology, for example, can be executed significantly faster than using pure C++ code. SSE2 technology also extends the SSE technology adding an operations with double-precision floating-point data type. The
MMXSwarm
C++ sample works both with MMX and integer SSE2 instructions.
Sources
- Intel Software manuals.
- MSDN, MMX Technology. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/vcrefsupportformmxtechnology.asp
- Microsoft Visual C++ CPUID sample. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vcsample/html/vcsamcpuiddeterminecpucapabilities.asp
- Microsoft Visual C++ MMXSwarm sample. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vcsample/html/vcsamMMXSwarmSampleDemonstratesCImageVisualCsMMXSupport.asp
- Matt Pietrek. Under The Hood. February 1998 issue of Microsoft Systems Journal. http://www.microsoft.com/msj/0298/hood0298.aspx