Click here to Skip to main content
11,639,861 members (60,846 online)
Click here to Skip to main content

Fast Image Blurring with CUDA

, 10 Sep 2009 CPOL 62.9K 2.7K 32
Rate this:
Please Sign up or sign in to vote.
High performance and good quality of image blurring

Introduction

I take an existing image blurring algorithm, developed by Mario Klingemann, to demonstrate the way to do image blurring with CUDA. Please visit http://incubator.quasimondo.com for more details about the stack blurring algorithm. I reckon that Stack Blur is already the fastest blur with a good looking algorithm. In this sample, there are some minor code changes with CUDA for this algorithm and we see how CUDA can speed up the performance.

Background

Blur image which is always a time consuming task. Blurring quality and processing speed cannot always have good performance for both. CUDA might help programmers resolve this issue. This code is tested on Windows 7 with NVIDIA GeForce G210M.

Stack Blur in a Conventional Way

Stack Blur needs to process image rows first and then columns. There are two while loops to process rows and columns of image consecutively. The time consuming parts are the outer while loops for image rows and columns. Therefore, they are the targets to be modified by CUDA.

Parameters

  • unsigned long* pImage [in/out]: 32-bit image buffer
  • unsigned w [in]: Image width
  • unsigned h [in]: Image height
  • unsigned r [in]: Blur level
void stack_blur_rgba32(unsigned long* pImage, unsigned w, unsigned h, unsigned r)
{
    // ...
    // Process image rows, this outer while-loop will be parallel computed by CUDA instead
    do
    {
        // Get input and output weights
        do
        {
            // ...
        }
        while(++i <= r);
        
        // Blur image rows
        do
        {
            // ...        
        }
        while(++x < w);      
    }
    while(++y < h);
    
    // Process image columns, this outer while-loop will be parallel 
    // computed by CUDA instead
    do
    {
        // Get input and output weights
        do
        {
            // ...
        }
        while(++i <= r);
        
        // Blur image columns
        do
        {
            // ...
        }
        while(++y < h);      
    }
    while(++x < w);  
    // ...
}

In this sample, processing time is 0.063553 (ms), tested by CPU. We will then see how performance can be sped up by CUDA.

Stack Blur with CUDA

The most significant parts are stack buffers which need to have independent buffers for each row and column. Because threads are running in parallel, stack buffers have to be separated and used by individual rows and columns. The rest of the code has nothing much changed except for the CUDA codes.

Parameters

  • uchar4* pImage [in/out]: 32-bit image buffer
  • uchar4* stack_data_horiz_ptr [in]: Stack buffer for rows
  • uchar4* stack_data_vert_ptr [in]: Stack buffer for columns
  • unsigned w [in]: Image width
  • unsigned h [in]: Image height
  • unsigned r [in]: Blur level
  • bool bMapped [in]: Flag of support of "host memory mapping to device memory"
void StackBlur_GPU(uchar4* pImage, uchar4* stack_data_horiz_ptr, 
	uchar4* stack_data_vert_ptr, unsigned w, unsigned h, unsigned r, bool bMapped)
{
    unsigned div = ((r + r) + 1);
    unsigned divLenHoriz = (sizeof(uchar4) * div * h);
    unsigned divLenVert = (sizeof(uchar4) * div * w);
    unsigned sizeImage = ((w * h) << 2);
    uchar4* stack_dev_horiz_ptr = NULL;
    uchar4* stack_dev_vert_ptr = NULL;
    uchar4* pImage_dev_ptr = NULL;
    unsigned mul_sum = *(stack_blur8_mul + r);
    unsigned shr_sum = *(stack_blur8_shr + r);

    if (false == bMapped)
    {
        cudaMalloc((void**)&stack_dev_horiz_ptr, divLenHoriz);
        cudaMalloc((void**)&stack_dev_vert_ptr, divLenVert);
        cudaMalloc((void**)&pImage_dev_ptr, sizeImage);

        cudaMemcpy(pImage_dev_ptr, pImage, sizeImage, cudaMemcpyHostToDevice);
    }
    else
    {
        cudaHostGetDevicePointer((void**)&stack_dev_horiz_ptr, 
					(void*)stack_data_horiz_ptr, 0);
        cudaHostGetDevicePointer((void**)&stack_dev_vert_ptr, 
					(void*)stack_data_vert_ptr, 0);
        cudaHostGetDevicePointer((void**)&pImage_dev_ptr, (void*)pImage, 0);
    }

    StackBlurHorizontal_Device<<<(unsigned)::ceil((float)(h + 1) / 
	(float)_THREADS), _THREADS>>>(pImage_dev_ptr, stack_dev_horiz_ptr, 
	mul_sum, shr_sum, w, h, r);
    StackBlurVertical_Device<<<(unsigned)::ceil((float)(w + 1) / 
	(float)_THREADS), _THREADS>>>(pImage_dev_ptr, stack_dev_vert_ptr, 
	mul_sum, shr_sum, w, h, r);

    if (false == bMapped)
    {
        cudaMemcpy(pImage, pImage_dev_ptr, sizeImage, cudaMemcpyDeviceToHost);

        cudaFree( stack_dev_horiz_ptr );
        stack_dev_horiz_ptr = NULL;

        cudaFree( stack_dev_vert_ptr );
        stack_dev_vert_ptr = NULL;

        cudaFree( pImage_dev_ptr );
        pImage_dev_ptr = NULL;        
    }
}

Parameters

  • unsigned long* lpHostBuf [in/out]: 32-bit image buffer
  • unsigned w [in]: Image width
  • unsigned h [in]: Image height
  • unsigned r [in]: Blur level
  • unsigned bMapped [in]: Flag of support of "host memory mapping to device memory"
void StackBlur_Device(unsigned long* lpHostBuf, unsigned w, 
			unsigned h, unsigned r, bool bMapped)
{
    if (NULL == lpHostBuf)
    {
        return;
    }
    else if ((r < 1) || (w < 1) || (h < 1))
    {
        return;
    }
    else if (r > 254)
    {
        r = 254;
    }

    uchar4* stack_data_horiz_ptr = NULL;
    uchar4* stack_data_vert_ptr = NULL;
    unsigned div = ((r + r) + 1);
    unsigned divLenHoriz = (sizeof(uchar4) * div * h);
    unsigned divLenVert = (sizeof(uchar4) * div * w);

    if (false == bMapped)
    {
        stack_data_horiz_ptr = (uchar4*)malloc( divLenHoriz );
        stack_data_vert_ptr = (uchar4*)malloc( divLenVert );
    }
    else
    {
        cudaHostAlloc((void**)&stack_data_horiz_ptr, divLenHoriz, cudaHostAllocMapped);
        cudaHostAlloc((void**)&stack_data_vert_ptr, divLenVert, cudaHostAllocMapped);
    }

    StackBlur_GPU((uchar4*)lpHostBuf, stack_data_horiz_ptr, 
			stack_data_vert_ptr, w, h, r, bMapped);
    DebugPrintf("StackBlur_GPU: %x\n", cudaGetLastError());

    if (false == bMapped)
    {
        free( stack_data_horiz_ptr );
        stack_data_horiz_ptr = NULL;

        free( stack_data_vert_ptr );
        stack_data_vert_ptr = NULL;
    }
    else
    {
        cudaFreeHost( stack_data_horiz_ptr );
        stack_data_horiz_ptr = NULL;

        cudaFreeHost( stack_data_vert_ptr );
        stack_data_vert_ptr = NULL;
    }
}

In this sample, the processing time is only 0.000150 (ms), tested by GPU. The processing time with CUDA is 300x or more faster than the conventional way.

Does This Code Work?!

  1. Check this link and see what NVIDIA GPUs support CUDA.
  2. This code will spew debug messages. Download dbgview from this link
  3. Higher threads per block may not work for some NVIDIA GPUs. Please change _THREADS to a smaller value and re-compile this code.

Points of Interest

The result apparently tells that parallel computing with CUDA is amazing.

History

  • Sept. 8, 2009: Initial release
  • Sept. 10, 2009: Changed value of threads per block to 256, adapted to most of NVIDIA GPUs

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

ChaoJui
Software Developer (Senior) http://home.so-net.net.tw/lioucy
Taiwan Taiwan
I've been a coding guy for 15 years, using C/C++ and assembly. Also using database to do information presenation with graphics applications.

You may also be interested in...

Comments and Discussions

 
GeneralMy vote of 1 Pin
Member 1119520430-Oct-14 23:41
memberMember 1119520430-Oct-14 23:41 
BugThe timing on this is incorrect as noted above & VS2010 + CUDA 5.0 Support Pin
Jhomes4-Feb-13 14:36
memberJhomes4-Feb-13 14:36 
BugRe: The timing on this is incorrect as noted above & VS2010 + CUDA 5.0 Support Pin
ido00015-Feb-13 0:06
memberido00015-Feb-13 0:06 
GeneralRe: The timing on this is incorrect as noted above & VS2010 + CUDA 5.0 Support Pin
Jhomes13-Mar-13 17:35
memberJhomes13-Mar-13 17:35 
Buggpu mode do not work for win7 x64 with latest driver\sdk\toolkit. Pin
mrgloom22-Apr-12 8:02
membermrgloom22-Apr-12 8:02 
GeneralRe: gpu mode do not work for win7 x64 with latest driver\sdk\toolkit. Pin
h_nik00716-Nov-12 13:35
memberh_nik00716-Nov-12 13:35 
GeneralMy vote of 5 Pin
mbue14-Apr-11 11:44
membermbue14-Apr-11 11:44 
GeneralVS2010 Pin
Majid Shahabfar9-Jul-10 9:56
memberMajid Shahabfar9-Jul-10 9:56 
GeneralRe: VS2010 Pin
Jeffry Torres Alvarez2-Oct-13 1:53
memberJeffry Torres Alvarez2-Oct-13 1:53 
Questionif my laptop vdo card not nvdia,It work? Pin
poi1191-May-10 9:35
memberpoi1191-May-10 9:35 
I will test your program for implement my project.Thank
GeneralElapsed time problems Pin
d3v14-Apr-10 8:58
memberd3v14-Apr-10 8:58 
GeneralRe: Elapsed time problems [modified] Pin
Member 770296525-Feb-11 9:46
memberMember 770296525-Feb-11 9:46 
QuestionAny pixel differences between CUDA & non-CUDA image? Pin
H. Gohel15-Sep-09 14:39
memberH. Gohel15-Sep-09 14:39 
AnswerRe: Any pixel differences between CUDA & non-CUDA image? Pin
ChaoJui16-Sep-09 2:06
memberChaoJui16-Sep-09 2:06 
GeneralRe: Any pixel differences between CUDA & non-CUDA image? Pin
Julekmen20-Oct-09 23:34
memberJulekmen20-Oct-09 23:34 
GeneralRe: Any pixel differences between CUDA & non-CUDA image? Pin
ChaoJui22-Oct-09 4:39
memberChaoJui22-Oct-09 4:39 
Generalcutil.h can't be found Pin
Collin Jasnoch15-Sep-09 6:12
memberCollin Jasnoch15-Sep-09 6:12 
GeneralRe: cutil.h can't be found [modified] Pin
ChaoJui15-Sep-09 23:49
memberChaoJui15-Sep-09 23:49 
GeneralRe: cutil.h can't be found Pin
Collin Jasnoch16-Sep-09 3:26
memberCollin Jasnoch16-Sep-09 3:26 
GeneralRe: cutil.h can't be found Pin
ChaoJui16-Sep-09 4:14
memberChaoJui16-Sep-09 4:14 
GeneralRe: cutil.h can't be found Pin
thorolfo7-Aug-11 0:08
memberthorolfo7-Aug-11 0:08 
GeneralDoes not work with NVidia G8400 mobile (Win7, latest drivers) Pin
xliqz8-Sep-09 4:17
memberxliqz8-Sep-09 4:17 
GeneralRe: Does not work with NVidia G8400 mobile (Win7, latest drivers) [modified] Pin
ChaoJui8-Sep-09 4:35
memberChaoJui8-Sep-09 4:35 
GeneralRe: Does not work with NVidia G8400 mobile (Win7, latest drivers) Pin
xliqz9-Sep-09 18:46
memberxliqz9-Sep-09 18:46 
GeneralRe: Does not work with NVidia G8400 mobile (Win7, latest drivers) Pin
xliqz10-Sep-09 5:34
memberxliqz10-Sep-09 5:34 
GeneralRe: Does not work with NVidia G8400 mobile (Win7, latest drivers) [modified] Pin
ChaoJui10-Sep-09 15:18
memberChaoJui10-Sep-09 15:18 
GeneralRe: Does not work with NVidia G8400 mobile (Win7, latest drivers) Pin
xliqz12-Sep-09 22:03
memberxliqz12-Sep-09 22:03 
GeneralRe: Does not work with NVidia G8400 mobile (Win7, latest drivers) Pin
ChaoJui12-Sep-09 23:06
memberChaoJui12-Sep-09 23:06 
GeneralNot work ! Pin
Apex Liu8-Sep-09 3:21
memberApex Liu8-Sep-09 3:21 
GeneralRe: Not work ! [modified] Pin
ChaoJui8-Sep-09 3:48
memberChaoJui8-Sep-09 3:48 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.150731.1 | Last Updated 10 Sep 2009
Article Copyright 2009 by ChaoJui
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid