Background subtraction (BGS) is a basic task in many computer vision applications, where we want to segment out the foreground objects from the background of a video. In the "easiest" case when the camera is static, the background is often defined as the pixels that stay relatively constant, and the foreground is everything else moving around. Since BGS is often used as a pre-processing step, it is preferred to be as fast as possible. Figure 1 gives some results of our implementation.
This article is basically based on our publication in the IEEE-RIVF 2010 Conference . This article just would like to provide a quick "start guide" for people who want to apply the implementation to their projects. The full source code and sample project can also be found on Google Code.
Fig. 1. The segmentation result on VSSN06, video7. The left column are 2 frames from the video, the right column are the segmentation results. Foreground pixels are marked in white.
In real-life scenarios, the video can be affected by various factors: illumination change, moving objects in the background, camera moving... In order to take into account those issues, numerous BGS methods have been developed in the community, and often divided into two groups: parametric models and non-parametric models.
Extended Gaussian mixture model (GMM) [2, 3] by Zivkovic and van der Heijden is a parametric approach for BGS in which they maintain a mixture of Gaussians for the underlying distribution for each pixel's color values. For each new frame, the mean and covariance of each component in the mixture is updated to reflect the change (if any) of the pixel values. If the new value is far enough from the mixture (the Mahalanobis distance from the RGB value to the component's means are larger than, for instance, three times of the standard deviation), the pixel will be considered as the foreground. Beyond this basic version, Zivkovic and van der Heijden used some additional control parameters and modified the model updating equations to not only decrease the processing time for each frame, but also gain better segmentation. The details and some wild 'n' crazy mathematics can be found in [2, 3].
Nonetheless, we observed that Zivkovic's CPU implementation is still not fast enough for real-life demanding applications, especially when the system has to deal with high-resolution video segments. Moreover, despite the fact that the Moore's law can resolve the performance problem in the "near" future, offloading the CPU and transferring the burden of GMM computation to GPU is still worth to try. Hence we have implemented the GPU version for this algorithm on CUDA. Experiments showed that our GPU implementation provides an average of 11x speedup. In most of the tests, the frame rate is always higher than 50 frame per seconds (FPS), even on HD video formats.
Using the Code
The first thing is to ensure that you have a working system with CUDA and OpenCV installed and configured correctly. The sample project on Google Code has included the headers and libraries for OpenCV 2.2. However, you can use any version you like, the source code does not require any specific version.
In order to use the code, add CvFastBgGMM.h and CvFastBgGMM_ver4.cu to your project. You can also use CvFastBgGMM_ver3.cu or CvFastBgGMM_ver5.cu. The version is the level of optimization described in Levels of optimization section. For our experiences, version 4 gives the best results most of the time.
In CvFastBgGMM.h, you have to make sure this line is correct with the version you chose above:
#define CUDAGMM_VERSION 4
CvFastBgGMM is nearly the same with many classes in OpenCV: creating the data structure, setting appropriate parameters, and call a function to update the model for each new frame.
CvFastBgGMMParams* params = cvCreateFastBgGMMParams(FRAME_WIDTH, FRAME_HEIGHT);
params->fAlphaT = 0.008f;
CvFastBgGMM* gmm = cvCreateFastBgGMM(params, videoFrame);
CvFastBgGMM in a standard loop of getting images from webcam is also straightforward:
capturer = cvCaptureFromCAM(-1);
videoFrame = cvQueryFrame(capturer);
params = cvCreateFastBgGMMParams(videoFrame->width, videoFrame->height);
params->fAlphaT = 0.005f;
pGMM = cvCreateFastBgGMM(params, videoFrame);
while(key != 'q')
videoFrame = cvQueryFrame(capturer);
key = cvWaitKey(10);
The CPU and GPU implementations are executed on the dataset from the VSSN 2006 competition to compare the robustness, and on the PETS2009 dataset to evaluate the speedup. All experiments in this section are performed on a system running Core 2 Quad Q9400 2.6 GHz CPU (4 GB of RAM) and GeForce 9600GT GPU. Although the CPU implementation is not multithreaded, we still execute it on four cores to get the best performance.
In this section, I will show some benchmarking results that are presented in our paper. For a detailed benchmarking results, please take a look into the paper.
Optimization Levels and Occupancy
To estimate the effectiveness of various optimization techniques (which are discussed in Levels of optimization section), we run the implementation with different levels of optimization on the VSSN2006 and PETS2009 datasets. The results are shown in Figure 2. In this figure, kernel 1 is the GPU basic implementation using constant memory. Kernel 2 is more optimized with memory coalescing on 4-channel input images. Kernel 3 uses the SoA pattern, memory coalescing on Gaussian parameters, and pinned memory when transferring data between host and device. Kernel 4 is the asynchronous implementation. You will notice that the source package has version 5, that is the "template" version, however I realized that the template does not help increasing the performance, so I did not include it in this test.
Figure 2. Effectiveness of optimization techniques.
On VSSN 2006 dataset, kernel 4 outperforms all others, but on PETS2009, it is slightly slower than kernel 3. This can be roughly explained by the difference in frame size of these two datasets: PETS2009 has 768x576 sequences, while VSSN2006 contains smaller 384x240 video sequences. The asynchronous implementation in kernel 4 requires more time to transfer large data between host and device, leads to the decrement in frame rate. Nevertheless, we prefer kernel 4 since it is better in interleaving the CPU processing into the GPU processing time. In this section, we use kernel 4 for all remaining experiments.
In CUDA, the multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the GPU. It depends on number of threads per block, number of registers per thread, capacity of shared memory per block that a kernel uses, and on compute capability of the device. Our implementation uses 128 threads per block, 20 registers per thread and 36 bytes of shared memory per block. The implementation is executed on GeForce 9600GT which supports compute capability 1.1. Thus, the kernel has occupancy of 0.5. This configuration is determined experimentally. When attempting to increase the occupancy by using less registers per thread and more/less threads per block, we notice that the overall performance falls.
The PETS2009 is used to test the speedup of the GPU implementation over various frame resolutions. We down-sample and up-sample frames to get 11 sequences in different resolutions. At each resolution level, we run both implementations, measure the frame rate and speedup. The results are provided in Figure 3. The CPU implementation performs pretty well on the low-resolution sequences, but for sequences larger than 640x480, the frame rates are lower than 30 fps, and for HD sequences (1280x720, 1920x1080, 2560x1440) the frame rates are below 10 fps. For the GPU version, even with HD sequences, the frame rate is always greater than 50 fps.
Fig. 3. Comparision of the CPU version and GPU version in term of frame rate over different resolutions. Green lines are the speedup. The average speedup for each sequence is about 11x. (Click on the image for the detailed view)
Levels of Optimization
This section will give some "behind the scene" insights. You can safely ignore this if you just want to apply this implementation to your project.
The Basic Implementation
The basic GPU version can be developed easily due to the fact that each pixel in the video frame is processed independently. Everyone who has attended a parallel computing course can realize that this is the traditional case of embarassing parallel. On GPUs, we can implement this by assigning a thread for each pixel. However, the number of threads on GPU is limited, while the number of pixels per frame is various for each video. Our implementation deals with this by automatically choosing an appropriate number of pixels for each thread. When a new frame arrives, the algorithm is run and the background model is updated. Pixels in output image are classified as either foreground or background, depends on their intensity value.
This basic implementation only provides 3x speedup over the original CPU implementation for VGA (640x480) video.
Kernel 1: Pinned (non-pageable) Memory and Constant Memory
Pinned memory is located on host storage and is not subject to page in the virtual memory storage system. The CUDA runtime provides some functions to allocate and free pinned host memory. Using pinned memory has several benefits:
- The memory transfers can be performed concurrently with kernel execution.
- Transfer bandwidth between host memory and device memory is higher when using pinned memory.
In our implementation, pinned memory is used to transfer the input and output image data between host and device. This not only improves memory throughput, but also allows launching the kernel asynchronously.
The algorithm parameters are not changed during the execution, hence we stored them on the Constant memory. In CUDA, accesses to constant memory are cached and very efficient.
Kernel 2: Memory Coalescing
In CUDA, global memory accesses which are performed by neighbored threads can be coalesced automatically into as few as one transaction when certain requirements are met. Although the newer version of CUDA hardware facilitates memory coalescing without any requirements. These techniques are still useful when the application has to run on the old hardware.
The details of memory coalescing applying to this algorithm are so messy that I do not intend to present here. I just want to emphasize that memory coalescing does help much in improving the performace of our algorithm. To me, it is one of the most powerful techniques on CUDA. Memory coalescing requires a deep understanding on the data structures and memory organization of CUDA, but it is worth to try.
Other specific optimization techniques are described in the paper.
"License" (Not Actually a License)
The following conditions are derived from Zivkovic's original CPU implementation:
This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following:
- this notice
- an acknowledgment of the authors and individual contributions to the work
Copying, reproduction, or republishing for any other purpose shall require a license. Please contact the author in such cases. All the code is provided without any guarantee.
If you use this project in academic publications, please cite our publication.
- Jan 7th, 2011: Article submitted
- Vu Pham, Phong Vo, Hung Vu Thanh, Bac Le Hoai, "GPU Implementation of Extended Gaussian Mixture Model for Background Subtraction", IEEE-RIVF 2010 International Conference on Computing and Telecommunication Technologies, Vietnam National University, Nov. 01-04, 2010. DOI: 10.1109/RIVF.2010.5634007.
- Z.Zivkovic, "Improved adaptive Gausian mixture model for background subtraction", International Conference Pattern Recognition, Vol.2, pages: 28-31, 2004.
- Z.Zivkovic, F. van der Heijden, "Efficient adaptive density estimation per image pixel for the task of background subtraction", Pattern Recognition Letters, vol. 27, no. 7, pages 773-780, 2006.