# Optimized Image Inversion Using SSE2

By , 2 May 2009

## Introduction

Image inversion is the altering of intensities or colors of an image from normal form to an opposite representation.

## Background

An inverted image could be interpreted as a digital version of image negatives. After inversion, every color takes the exact opposite one (I know this terminology is not that scientific, but it’s useful as a conceptual information). Let’s put this in more scientific terms. A positive image should be defined as a normal, original RGB or gray image. A negative image denotes a tonal inversion of a positive image, in which light areas appear dark and dark areas appear light. In negative images, a color reversing is also achieved, such that the red areas appear cyan, greens appear magenta, and blues appear yellow. In simpler sense, for the grayscale case, a black and white image, using 0 for black and 255 for white, a near-black pixel value of 5 will be converted to 250, or near-white.

Image inversion is one of the easiest techniques in image processing. Therefore, it’s very applicable to demonstrations of performance, acceleration, and optimization. Many of the state of the art image processing libraries such as OpenCV, Gandalf, VXL etc., perform this operation as fast as possible, even though some more accelerations using parallel hardware are possible.

In this article, I will demonstrate three implementations of image inversion: a basic implementation, an optimized one (in C level), and an optimized one using enhanced instructions of modern CPUs (for our case, SSE2).

The code uses OpenCV, not for the implementation of the algorithms, but only for reading the image and displaying it. So, if you don’t know how to use or configure it, please refer to the OpenCV documentation, or simply: http://opencv.willowgarage.com/wiki/VisualC%2B%2B.

## Algorithm

For every pixel I(x,y), the inversion mapping is defined by Y(x,y)=255-I(x,y). For color images, the exact same formula is applied for all three channels (R,G,B). In computer language, this is Y(x,y)=~I(x,y). Here is how the basic algorithm looks like:

```int i=0;
for (;i!=src->imageSize; i++)
dst->imageData[i]=~((uchar)src->imageData[i]);```

## Optimizations

##### Bit-wise optimizations

Image inversion is the same as inverting every bit of the pixel value. 1->0 and 0->1. In short, it’s not an operation. Many modern processors have registers composed of 32 (we don’t care about 64 bit machines right now). Because bit inversion is independent of data size, we should acquire data as big as our register size, to fully utilize the registers. That’s why we represent a byte image as an integer pointer, which has four times smaller size. In other words:

`Sizeof(uchar) * imagesize = sizeof(int) * imagesize/4`

This way, we could gain up to 3x speed.

##### Loop unrolling

The goal of loop unwinding is to increase the program's speed by reducing (or eliminating) the "end of loop" test on each iteration. Loops can be re-written as a sequence of independent statements which eliminates the loop controller overhead. Significant gains can be realized if the speed gained (by eliminating the "end of loop" test) offsets the performance reduction caused by the increased size of the program. Furthermore, if the statements in the loop are independent of each other, statements that occur earlier in the loop do not affect statements that follow them, and the statements can be executed in parallel. (Wikipedia)

Here is a simple loop unrolling example:

A procedure in a computer program is to delete 100 items from a collection. This is accomplished by means of a `for`-loop which calls the function `delete(item_number)`:

```for (int x = 0; x < 100;  x++)
{
delete(x);
}```

If this part of the program is to be optimized, and the overhead of the loop requires significant resources compared to that for `delete(x)`, loop unwinding can be used to speed it up. This will result in an optimized code fragment like:

```for (int x = 0; x < 100;  x += 5)
{
delete(x);
delete(x+1);
delete(x+2);
delete(x+3);
delete(x+4);
}```

As a result of this modification, the new program has to make only twenty loops, instead of a hundred. There are now a fifth as many jumps and conditional branches that need to be taken, which over many iterations would be a great improvement in the loop administration time. The final algorithm for processing one row looks like:

```for( ; i <= step1 - 16; i += 16 )
{
const int* src1i=(const int*)(src1+i);
int* dst1i=(int*)(dst1+i);
int t0 = ~(src1i)[0];
int t1 = ~(src1i)[1];
int t2 = ~(src1i)[2];
int t3 = ~(src1i)[3];

dst1i[0]=t0;
dst1i[1]=t1;
dst1i[2]=t2;
dst1i[3]=t3;
}```
##### SSE2 Integer Instructions

Because we process the data using an int pointer, we could make use of SSE2 integer arithmetic. This way, we could process four `int` values at the same time. Discussion of the SSE2 entire architecture is not the aim of this article. If you are not sure how to use them, please refer to web sites like:

(I am sorry; I don’t know a great SSE tutorial on the web.)

To implement inversion using SSE2, we should take the optimized inversion code and convert it into SSE2. However, as far as I know, SSE2 doesn’t have a direct NOT instruction. Instead, we make use of the `pandn` (packed and not) instruction. Here, we convert the problem into a more mathematical statement, which is Y(x,y)=~(0xff & I(x,y)). Here, we bitwise-AND the pixel with 255 and apply a bitwise-NOT. There is one more additional computation, but it may still be worth it (because the instruction is directly executed on the hardware). One could simply loop through all the pixels using a loop instruction and apply the same operator in SSE2 to optimize.

Here is the SSE2 instructions I use for this application:

• `movdqa`: Moves 128 bit data from memory or pointer to an `XMM` register or vice versa.
• `pandn`: "and & not" an XMM register.
• `loop`: continue looping.

Memory alignment has always been an important issue. To prepare the data to be used by SSE processing, we should align our pointers to at least 16 byte boundaries (optimally). In this article, I use `align(32)` to be on the safe side. Alignment is always good, because the CPU could accomplish faster memory accesses when reading aligned data.

Finally, the entire processing of the image can be written as:

```//Traverse:
// unroll the loop 4 times. 4th unroll is a half one.
// (as much as XMM's can hold)

movdqa xmm0,[esi]
movdqa xmm1,[esi+16]
movdqa xmm2,[esi+32]
movdqa xmm3,[esi+48]

// 255- pixel for 3 pixels
pandn  xmm0, xmm7
pandn  xmm1, xmm7
pandn  xmm2, xmm7
pandn  xmm3, xmm7

// output the computed content
movdqa [edi], xmm0
movdqa [edi+16], xmm1
movdqa [edi+32], xmm2
movdqa [edi+48], xmm3

// traverse array

// move on
loop Traverse;```

## Conclusion

Please notice that C level optimizations are pretty promising for applications that need performance. Only dig into the assembly code if you really strive to make it better. For some problems, you may not even gain performance!

## Issues

Images with widths that are not divisible by 4 will have some left-over pixels at the end of each row. I left out inversion of these leftover pixels on each row. That’s why it’s best if you add this functionality or simply don’t process images with sizes not divisible by 4. This code assumes that you have the image "C:\\Data\\Waterfall.jpg" in your hard drive. If you don't have this image, please modify the code and change the first line inside `main` to read an image you like.

CEO Gravi Information Technologies and Consultancy Ltd
Turkey
Currently, also an MSc. student in Technical University of Munich, I develop practical application in computer vision for more than 5 years. I design real-time solutions to industrial and practical vision problems, both 3D and 2D. Very interested in developing algorithms in C relating math and vision.

"Great minds never die, they just tend to infinity..."

Votes of 3 or less require a comment

Hint: For improved responsiveness ensure Javascript is enabled and choose 'Normal' from the Layout dropdown and hit 'Update'.
 Search this forum Profile popups    Spacing RelaxedCompactTight   Noise Very HighHighMediumLowVery Low   Layout Open AllThread ViewNo JavascriptPreview   Per page 102550
 First Prev Next
 My vote of 5 manoj kumar choubey 20-Feb-12 19:16
 Intrinsic asm Eric Trinh 4-May-09 13:03
 Re: Intrinsic asm Tolga Birdal 7-May-09 6:23
 Re: Intrinsic asm Eric Trinh 7-May-09 7:50
 Yes there are some differences in syntax of some intrinsincs against different compilers (ex: cpuid in VC and GCC). But I have still not seen differences between logical and arithmetic instructions on Intel Compiler, VC and GCC compilers. In fact pure asm syntax easily differs between compilers even for the same processor (ATT or intel notation...). So the best way to have a directly compatible code was to use an external assembler (nasm for example) and avoid inline asm, not a very good solution So, intrinsics remain by far the best solution for portability (I don't speak portabilitty between processors but between compilers). Don't mislead, intrinsics are only an other way to write ASM. So as ASM code rules, intrinsics will rules too by the same way. The only facts it changes are easier code writing and reading, and better optimizer proof. But now, C compilers can also optimize to MMX/SSE code... Ok the results did not convinced.... Sign In·View Thread·Permalink
 Re: Intrinsic asm Tolga Birdal 7-May-09 11:22
 Nice JohnWallis42 4-May-09 3:27
 Re: Nice Tolga Birdal 4-May-09 4:32
 Last Visit: 31-Dec-99 18:00     Last Update: 19-Jun-13 13:08 Refresh 1