Click here to Skip to main content
13,252,065 members (56,075 online)
Click here to Skip to main content
Add your own
alternative version

Stats

10.2K views
13 bookmarked
Posted 25 Jul 2017

Image Blending in Windows - an Assembly Language Approach

, 4 Oct 2017
Rate this:
Please Sign up or sign in to vote.
Ditching the slowdowns and applying a little elbow grease can create dramatic speed improvements in GDI image blending.

Image Blending in Windows – an Assembly Language Approach

[Once again, the code snippets in this article refuse to format correctly.  When entering the article text, the scroll bars appear and all is well.  When the article is submitted, the scroll bars disappear and all the code snippets suddenly form a fanatical devotion to a right margin that's too narrow, and unwanted line breaks begin appearing all over the snippets.  The problem has reared its ugly head in other articles I've written, and better minds than mine have worked on it - all to no avail.  The problem has not gone unnoticed, and is not the result of lazy indifference.]

[Worse, the article editor appears to be forcing the image URL references to https:// after they're entered as http://.  It's not known why this occurs but as the author I know of no way I can fix it.  Images display fine in the editor; URL's are auto-modified to https:// against my will after I submit the article.  To counter this as best I can, each image caption includes the full URL where the image lives.]

To C, or Not to C?

Quality code from a C++ perspective could well be horribly wasteful and inefficient from the CPU’s viewpoint.  If you’re not playing directly in the CPU’s arena, you’re not likely to come anywhere near maximizing its potential.  With high-demand tasks such as image blending, you can’t rely on the Windows API – especially GDI functions – to get the job done efficiently.  Windows is designed to be all things to all developers; it places ritual far ahead of performance.  If you’re going to eliminate bottlenecks efficiently, you can’t go wrong by turning to assembly language. 

The caveat here is that intrinsics are not a safe alternative.  To squeeze every last iota of performance out of the CPU, you can’t write generic code that doesn’t know or care whether it’s running on a lightspeed hundredth generation i9 or a clunky lightweight ARM processor.  In short, you can’t milk quality out of shortcuts.  To get the best performance, sometimes you have to work for it: you must tailor your work to the hardware it’s running on.  In the end, that work pays off in spades.

The Pitfalls of BitBlt

The first thing to note is that when you create a bitmap with CreateCompatibleBitmap, the very limited kernel memory is used to store that image.  If Microsoft documents this fact at all, that discussion is well hidden from prying eyes.  If such documentation exists, I never saw it.  (And it may well exist; maybe I have no excuse for never having seen it!)  When creating bitmaps on the fly, I’ve hit the “insufficient memory” wall exactly once; that experience was what led me to begin scouring development forums to track down where I was going wrong.  That was when I learned about which bitmaps go into which memory areas, and in the course of adjusting for it, that particular problem never returned.

CreateDIBSection uses memory off the current process’ heap; making that switch solved my resource problem once and for all.  I developed my own local function, CreateLocalBitmap, which takes the same parameters as CreateCompatibleBitmap (with one extra added) but creates a DIB section instead – and its functionality includes returning the location of the bitmap data.  That little data pointer is worked with directly, across the board, in the image blending process discussed in this article.  If you load an image from a resource, you can also use GetDIBits to retrieve a bitmap's data.  (The documentation for GetBitmapBits says that function is deprecated; use GetDIBits instead.)

My locally declared function CreateLocalBitmap always creates a 32-bit bitmap.  This is done because 4 bytes per pixel work exceedingly well for processing through XMM registers.  With a 24-bit format, you just can’t do it.  You’ll have to convert the 24-bit format to a 32-bit layout first – a taxing process that is wholly unnecessary, given a little preparation that creates 32-bit bitmaps from the outset.

[NOTE: the following two paragraphs were accurate at the time they were written.  They were inserted as a direct result of experiencing a forced change from 32 to 24 bits in a bitmap I was working with.  However in the course of creating the sample images for this article, the effect appears to have vanished.  I used BitBlt to copy from a LoadImage 24-bit format to a CreateDIBSection 32-bit format, and the 32-bit format remained.  So take the following two paragraphs with a grain of salt - in true Windows tradition, it could happen but doesn't always.  Let your own experience guide you.]

BitBlt likes 24-bit formats, and it’s quite aggressive about forcing its views on every bitmap passing through it.  Even if you’re blitting to a bitmap or DIB section that was explicitly created as a 32-bit image, BitBlt will overwrite the 32-bit data with 24-bit values.  The amount of allocated bitmap memory remains unchanged from its initial size, but the expected 32-bit data format does not survive a call to BitBlt.  The function is completely unware of alpha channels and will not hesitate to perform genocide on every byte of bitmap data referring to one.

Instead, AlphaBlend, designed to process alpha channels, respects the sovereignty of alpha channel bytes and leaves bitmaps formatted as 32-bit entities.  Ideally, if you can get the function to work at all under Windows 10, it’s called with an opacity value of 255 (fully opaque) to emulate BitBlt without destroying the format of the bitmap data.  What happens when a 24-bit image is AlphaBlended onto a 32-bit destination is a question I can’t answer – my last attempt to use that function resulted in an error that cannot and should not occur, and that was the end of my toying with it.  After 22 years of developing under Windows (almost exclusively in assembly language), my intolerance for Windows API failures has grown to the point where I will no longer research most bugs and inexcusably recalcitrant functions – I simply go straight to a workaround.  For example, in the case of filling a bitmap with a solid color, the silly and quite stupid (my opinion) process of creating a brush, calling FillRect, and deleting the brush, not to mention creating a memory DC to hold the bitmap while this is done, selecting the bitmap into it, then deselecting the bitmap and finally deleting the DC – is all so far over the line of sheer insanity that I couldn’t even begin to explain the thinking behind it.  Instead, seven simple assembly language instructions fill the bitmap.  No DC is required, no brush, no objects to select, deselect, or delete.  As long as you have that bitmap data pointer, all is well.

lea     rdi, BitmapData
mov     rax, BitmapWidth
mov     rcx, BitmapHeight
mul     rcx
mov     rcx, rax
mov     rax, FillColor
rep     stosd

Done … and … done.  It’s just that simple.  There’s no reason to make it any more complicated than that.

Blending

This article outlines a process for blending two bitmaps of 32-bit format.  Alpha values are not used (premultiplied or otherwise); the blend is the final result of merging two images.  The source opacity is passed as an integer percent value times one hundred, which is converted within the blend function to a single-precision float; the destination opacity is presumed to be one minus that value.  For example, declaring a 20% opacity for the source bitmap would place the destination opacity at 80%.  I have not found a way to do this with GDI (everything seems to be additive), and I won’t go beyond that to get the job done.  DirectX requires a relatively enormous setup time that I find difficult to justify in most apps that I create.  Moving beyond that to something like DirectComposition or Direct2D is for me unthinkable – I liken it to bringing a full-sized aircraft carrier to the grocery store just to haul your groceries home, when a little red wagon would do just as well.  That doesn’t mean those API’s are that powerful – they’re just that needlessly complex (and embarrassingly bloated).

For the process outlined here, the focus is on speed, and to maximize speed, two main rules must be followed:

  1.  Avoid memory access.
  2. Use XMM registers to process all color component values in parallel – this results in one multiplication per pixel, not three.

YMM registers don’t make sense in this kind of application because they use 64-bit values.  For the process of blending images, 32-bit values are ultimately moved down to 8 bits for each color channel, and four 32-bit values processed by XMM registers is the low limit of SSE/AVX granularity.

Thinking it Through

Forethought is the precursor of performance.  The impact of tailoring the task at hand for the specifics of the processor that task will run on cannot be overstated.  A little analysis and planning can go a long way in creating unprecedented speed improvements; customizing code for the hardware it will run on has no parallel in its virtue.

In the actual blend loop, for optimal performance, there should be only three occasions when memory is accessed: to read each pixel from image 1, to read each pixel from image 2, and to write the final blended result.  Complicating the process is the inherent need to separate each pixel into color channels before performing the blend.  After the blend is complete, the data again needs to be converted back from float values, reassembled into a single 32-bit value, and stored at its destination. 

Logically, the process is so simple that most developers (if they don’t hand the task off to Windows) simply run a loop to execute it.  The problem is that in C++ or any other language, the simplicity of the algorithm is usually mistaken as equating to speed in execution, with the particulars of actual implementation not being worth much attention.

The function created in this article will take two incoming bitmaps – one and two – and write the final blend data over the first bitmap’s incoming data.  Since everything transpires across the CPU’s registers, no memory needs to be allocated (or subsequently freed) to achieve the blend.

The real key to the process lies in breaking up the pixel data into color channels, then getting that data into an XMM register where the required multiplications can be executed. 

The multiplier values will not change across the life of the loop, so they should be placed into XMM registers directly from the registers they’re passed to the blend function in, then left alone.

On entry into the function:

RCX      > bitmap one bits
RDX      > bitmap two bits
R8       = bitmap one blend %, multiplied by one hundred to allow an integer value to be passed
R9       = bitmap width
[rsp+40] = bitmap height

The reason for passing the blend % as an integer is that 64-bit calling convention becomes confusing and convoluted when trying to pass a float.  It can certainly be done (most languages do it as a matter of course), but being the only float value, the blend factor will come into the function in XMM0.  You can do it if you want to; the code in this article won’t.  As a general rule, I’ve found that a lot of mistakes and debugging time are saved by utilizing XMM registers in some kind of sensible order where they’re not so easily lost track of.  There are sixteen of the things and it’s quite easy to forget who’s doing what and why if you implement them at random.  In the end, the entire issue is purely aesthetic, but I personally find it far preferable to doing things any other way: parameter 3, R8, holds the source bitmap blend percent times a hundred - allowing an integer value to be passed in a general purpose register. 

For reasons discussed shortly, there is no danger of overflowing any 8-bit color channel anywhere in the blend process.

On entry to the function, immediately point RDI at the first bitmap’s data, and RSI at the second.  The assembly STOSD instruction writes to the location where RDI is pointing, and bitmap one is where the blended output is written.  So it may as well be set up as the first order of business.

mov rdi, rcx ; On entry, RCX points to bitmap one data – copy that pointer to RDI, the output
mov rsi, rdx ; On entry, RDX points to bitmap two data

With this done, it’s time to decide on XMM register usage.  XMM 0 will get data from bitmap one; XMM1 will hold the blend factor for bitmap one.   XMM2 will get data from bitmap two, and XMM3 will hold bitmap two’s blend factor.

There’s little left to do beyond setting up XMM1 and XMM3 with the blend factor values before diving into the process loop.  Of critical importance is one final note: it would be insane to divide every output pixel by 100.0.  After moving the incoming R8 into XMM0, it’s divided by 100.0 immediately so that the division – the slowest instruction on Earth – only needs to be performed once.

The following variables are declared as data:

align       16
one_hundred real4 4 dup ( 100.0 )

Then the entry code for the function:

mov       rdi, rcx                    ; Set write pointer @ bitmap 1 
mov       rsi, rdx                    ; Set read pointer @ bitmap 2
movd      xmm1, r8                    ; Move the incoming bitmap 1 blend factor 
cvtdq2ps  xmm1, xmm1                  ; Convert value to floating point 
divss     xmm1, real4 ptr one_hundred ; Divide the blend factor by 100 
shufps    xmm1, xmm1, 0               ; Copy the low dword of XMM0 into all four XMM0 dwords 

Setting XMM3 to (1 - XMM1) requires just a little creativity.  There are no instructions for moving immediate data (data that’s encoded with an instruction) into an XMM register, and loading 1.0 from memory is last choice. 

The CMPPS instruction compares whatever random data is in XMM3 with itself, checking for XMM3 = XMM3.  The result must be true, regardless of what the register holds.  All bits in XMM3 will then be set, as is done when a compare is true.  Shifting those bits right by 31 bits will leave a value of 1.0 in each 32-bit division of XMM3.

cmpps     xmm3, xmm3, 0               ; Compare XMM3 = XMM3 to set all bits of register
psrld     xmm3, 31                    ; Right align bit 31, shift out all other bits
cvtdq2ps  xmm3, xmm3                  ; Convert to floating point value
subps     xmm3, xmm1                  ; Subtract bitmap one blend factor from 1.0

The blend factors are now correctly set as packed (present and repeated in all four 32-bit sections) floating point values in XMM1 for image one, and XMM3 for image two.

Breaking Down and Building Up

The next parts of the process needing special attention are separating and recombining the color channel values.  Fortunately, SSE will handle most of this for you, at least on the uptake.  Curiously, there is no such convenience when writing each final blended pixel.

Things would be much simpler if the blend factor didn’t come into play.  All operations could be performed with SSE integer instructions, and there would be no need for ever converting values to and from floating point.  But the blend factor is the entire reason for the function, and it has to be stored as a floating point value since within the function it’s always <= 1.0.  So there’s little choice in the matter but to convert the color data to floats before multiplying.

The following instruction loads the source data into XMM0 (bitmap one) and XMM2 (bitmap two):

pmovzxbd       xmm0, dword ptr [ rdi ]      ; Load the three color channels & the alpha channel: bitmap one
cvtdq2ps       xmm0, xmm0                   ; Convert values to floating point
pmovzxbd       xmm2, dword ptr [ rsi ]      ; Load the three color channels & the alpha channel: bitmap two
cvtdq2ps       xmm2, xmm2                   ; Convert values to floating point

The above instructions separate each byte from the dword value loaded and conveniently place the data into consecutive dword locations within an XMM register.  This saves oodles of execution time over having to do it manually.

Next, the multiplication occurs:

mulps          xmm0, xmm1                   ; Multiply color data by bitmap one blend factor
mulps          xmm2, xmm3                   ; Multiply color data by bitmap two blend factor

Add the two values together:

addps          xmm0, xmm2                   ; Add the two values

Finally, build the output dword, store it, and advance both bitmap data pointers.

cvtps2dq       xmm0, xmm0                   ; Convert result back to integer values
shufps         xmm0, xmm0, 4Eh              ; Shift result 2 slots or 64 bits (either direction, same result)
movd           ebx, xmm0                    ; Get the red channel value
shufps         xmm0, xmm0, 93h              ; Shift result 1 slot left
movd           eax, xmm0                    ; Get the green channel value
shl            ebx, 8                       ; Shift the accumulator left 1 byte
mov            bl, al                       ; Set green in BL
shufps         xmm0, xmm0, 93h              ; Shift result 1 slot left
movd           eax, xmm0                    ; Get the blue channel value
shl            ebx, 8                       ; Shift the accumulator left 1 byte
or             eax, ebx                     ; OR the red and green with the blue currently in AL

stosd                                       ; Store the final result and advance write pointer (bitmap 1) 4 bytes

The above code comprises the entirety of the merging process.

No memory access occurs inside the loop, other than to load the source values and store the merged result.  The only math performed is the multiplication of each color channel by the blend factor, and even this operation is performed simultaneously on all color channels via the XMM registers.

The complete function is shown below.  Note that the images passed to this function are assumed to be the same size, and are merged in their entirety.  A more sophisticated function that allowed specified areas to be selected out of each image would be considerably more complicated as far as loop controls – individual rows would need to be looped through, then columns within each row, while properly tracking the data pointers through both loops.  For simplicity, this article merges the entire incoming set of bitmap data - its intent is to focus on utilizing assembly language, particularly XMM registers, to perform the actual blend.

The required data for the function:

                   align          16                           ; Required for XMM access
one_hundred        real4          4 dup ( 100.0 )              ; Divisor of 100

Finally, the entire function:

;**********************************************************************************************************************
; BlendImages
;
; In: 1 RCX           > bitmap one bits
;     2 RDX           > bitmap two bits
;     3 R8            = bitmap one blend factor (integer, % * 100; bitmap 2 is 1 minus this value)
;     4 R9            = bitmap width (same for both bitmaps)
;     5 [ RSP + 20h ] = bitmap height

BlendImages        proc                                        ; Declare the function

                   mov            rdi, rcx                     ; Set write pointer @ bitmap 1
                   mov            rsi, rdx                     ; Set read pointer @ bitmap 2

                   movd           xmm1, r8                     ; Move the incoming bitmap 1 blend factor
                   cvtdq2ps       xmm1, xmm1                   ; Convert value to floating point
                   divss          xmm1, real4 ptr one_hundred  ; Divide the blend factor by 100
                   shufps         xmm1, xmm1, 0                ; Copy the low dword of XMM0 into all four XMM0 dwords

                   cmpps          xmm3, xmm3, 0                ; Compare XMM3 = XMM3 to set all bits of register
                   psrld          xmm3, 31                     ; Right align bit 31, shift out all other bits
                   cvtdq2ps       xmm3, xmm3                   ; Convert to floating point value
                   subps          xmm3, xmm1                   ; Subtract bitmap one blend factor from 1.0

                   mov            rax, [ rsp + 40 ]            ; Get the bitmap width
                   mul            r9                           ; Multiply by height for pixel count
                   mov            rcx, rax                     ; Set the count through the loop

BlendLoop:         pmovzxbd       xmm0, dword ptr [ rdi ]      ; Load the three color channels & the alpha channel: bitmap one
                   cvtdq2ps       xmm0, xmm0                   ; Convert values to floating point
                   pmovzxbd       xmm2, dword ptr [ rsi ]      ; Load the three color channels & the alpha channel: bitmap two
                   cvtdq2ps       xmm2, xmm2                   ; Convert values to floating point

                   mulps          xmm0, xmm1                   ; Multiply color data by bitmap one blend factor
                   mulps          xmm2, xmm3                   ; Multiply color data by bitmap two blend factor

                   addps          xmm0, xmm2                   ; Add the two values

                   ; Build and store the final output dword

                   cvtps2dq       xmm0, xmm0                   ; Convert result back to integer values
                   shufps         xmm0, xmm0, 4Eh              ; Shift result 2 slots or 64 bits (either direction, same result)
                   movd           ebx, xmm0                    ; Get the red channel value
                   shufps         xmm0, xmm0, 93h              ; Shift result 1 slot left
                   movd           eax, xmm0                    ; Get the green channel value
                   shl            ebx, 8                       ; Shift the accumulator left 1 byte
                   mov            bl, al                       ; Set green in BL
                   shufps         xmm0, xmm0, 93h              ; Shift result 1 slot left
                   movd           eax, xmm0                    ; Get the blue channel value
                   shl            ebx, 8                       ; Shift the accumulator left 1 byte
                   or             eax, ebx                     ; OR the red and green with the blue currently in AL

                   stosd                                       ; Store the final result and advance write pointer (bitmap 1) 4 bytes
                   add            rsi, 4                       ; Advance the source pointer (bitmap two)

                   loop           BlendLoop                    ; Return to top of loop

                   xor            rax, rax                     ; Zero the return value

                   ret                                         ; Return from function

BlendImages        endp                                        ; End function declaration

The function conforms to the 64-bit calling convention, allowing it to be exported from a DLL and called from any language that allows calling external functions.

Limit checking is not required because the nature of the blend process makes an overflow impossible – with 8 bit color channels, no pixel can have a value greater than 255.  Even if a blend factor > 1 were passed, such as 1.25, the bitmap two blend factor would be 1 - 1.25 or -0.25.  Worst case, both pixels being blended would be 255.  The Bitmap 1 pixel would factor in as 255 * 1.25 or 319 (rounded up); the bitmap 2 pixel would weigh in at 255 * -0.25 or -64 (rounded up).  Summing the two values would yield 255.  So no limit checking is required with the implementation shown.

Certified Runnable

The code in this article has actually been compiled and executed, blending the following images at 45% image 1 (the sunset) and 55% image 2 (the forest); the third (blended) image was captured from actual output generated by this article's source code:

Image 1: http://www.starjourneygames.com/images/sunset.bmp
Image 2: http://www.starjourneygames.com/images/forest.bmp
Image 3: http://www.starjourneygames.com/images/blend.bmp

Conclusion

This article demonstrated how drastically custom tailoring code for the hardware it’s running on can speed up an operation.  Garbage in, garbage out; shortcuts are the fast path to the next bottleneck.

The function outlined here is hardly made up of thousands of lines of code, and it’s almost completely self-contained.  In most (if not all) modern languages, the attendant declarations, includes, and endless compiler instructions far outweigh any savings of time in the coding process.  To use an analogy, every day, more and more workers are replaced with dead weight managers.  The workers (actual functions) may or may not get the job done faster, but the explosion of dead weight management makes the company as a whole larger and larger each year, while producing less and less – and what’s produced is slower and slower to execute.

A forthcoming article covers a much more complex subject: implementing Gaussian blur in assembly language. The article's code will use an 11x11 convolution kernel and will run a single pass.  The speed improvement in that code is nothing short of staggering, resulting from the application of the same logic that was presented in this article: customizing code to maximize what the CPU has to offer.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

CMalcheski
Software Developer (Senior) Star Journey Games, LLC
United States United States
I began life working as a Morse Intercept Operator in Army Intelligence in 1978. I've worked at a number of aerospace companies (only ever as a word processor; those were the very early days).

All of my software development knowledge came from books. In the 1990's I developed drivers for QIC tape drives. These were for a company that went out of business before I could finish; the entire QIC market dried up shortly thereafter.

I've developed drivers for everything from network packet handlers to disk drives; from keyboards to mice; from tape drives to file system filters. My latest efforts were focused on an NTFS file system parser that worked on raw disk sector reads.

All of my development is under Windows, and is in assembly language exclusively.

In 2008 I worked for Microsoft as an SDET III (Software Development Engineer in Test) on the Ford Sync project, which was relatively new at the time.

I'm currently working on creating the resources required to develop a scaled-down OS in all assembly for the purposes of running 3D games without all the corporate fluff, slowdown, and bureaucracy. On top of this will run an all-assembly 3D game engine. Speed-ups from today's norm are expected to be staggering, but only the final product will stand on its own merit.

You may also be interested in...

Pro

Comments and Discussions

 
Questiondivss vs mulss Pin
siekmanski5-Oct-17 5:16
membersiekmanski5-Oct-17 5:16 
AnswerRe: divss vs mulss Pin
CMalcheski9-Oct-17 4:18
memberCMalcheski9-Oct-17 4:18 
Questionwhat is your IDE? Pin
Southmountain4-Sep-17 8:52
memberSouthmountain4-Sep-17 8:52 
AnswerRe: what is your IDE? Pin
CMalcheski11-Sep-17 17:54
memberCMalcheski11-Sep-17 17:54 
PraiseEnjoyable Read Pin
asiwel27-Jul-17 18:43
memberasiwel27-Jul-17 18:43 
GeneralRe: Enjoyable Read Pin
CMalcheski28-Jul-17 0:58
memberCMalcheski28-Jul-17 0:58 
GeneralRe: Enjoyable Read Pin
asiwel28-Jul-17 12:48
memberasiwel28-Jul-17 12:48 
QuestionMessage Closed Pin
26-Jul-17 1:48
memberRakhi india26-Jul-17 1:48 
AnswerRe: Rakhiinindia.in Pin
CMalcheski26-Jul-17 3:15
memberCMalcheski26-Jul-17 3:15 
Questioncode Pin
Member 1332937125-Jul-17 23:39
memberMember 1332937125-Jul-17 23:39 
AnswerRe: code Pin
CMalcheski26-Jul-17 0:30
memberCMalcheski26-Jul-17 0:30 
Questionwindows Pin
Member 1332937125-Jul-17 23:35
memberMember 1332937125-Jul-17 23:35 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.171114.1 | Last Updated 4 Oct 2017
Article Copyright 2017 by CMalcheski
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid