Moving on from Part II, this article presents a tracking algorithm. The Centroid tracker presented in Part II was not actually a tracking algorithm but a fancy trick. All the "tracker" was doing was to find the centroid of the backprojected image and put a bounding box around it - this is not tracking by definition. So this article presents a very well known tracking algorithm in Computer Vision community called Meanshift Tracking. This article is based on a series of research papers by Comaniciu Et Al. ,,.
Moving on, visual tracking can be described as identifying an object in a video frame and then tracking that object in subsequent frames. It does not include the detection of the object. Tracking assumes that the object is selected either manually or automatically. In this article, we'll use manual object selection. Once the object is selected, some kind of model(s) is created based on some attribute(s) and then using the model(s), the object is tracked in subsequent frames. Meanshift, as described in the mentioned research papers, uses colour as the attribute and histogram for modeling this attribute. We have already covered these pre-requisites in Part I.
Meanshift Tracking - High Level View
Given an object, we create a colour histogram - let's call this target histogram. When we get the next frame, we try to find an object with a similar colour histogram in the vicinity of the object in the previous frame. Once we find this object, we stop looking for it and wait to get another frame. This process is repeated over and over again which results in tracking. At this point, there are two questions one must ask:
- How do we know that we have found our object?
- How do we search the vicinity of the object?
The answer to the first question is that we use some kind of histogram matching, such as Bhattacharyya Coefficient (see Part I). So we create the histogram of our possible object and match it to our target histogram. If it is close enough, we assume it is the same object; if not, we make a move in the vicinity and create a histogram and do a match again till we find a close enough object.
The way of looking in the vicinity is a technique from which the algorithm gets its name: Meanshift. Meanshift is the mode seeking algorithm. In other words, given a probability distribution a.k.a. normalised histogram, the Meanshift algorithm seeks the peak of this histogram.
Basic Meanshift Tracking
We first pick the attribute(s) of the object to model the object. We are using colour as the attribute and the normalised histogram as our model. So in the video frame, we select the target object by placing a bounding box, a.k.a. ROI (Region of Interest), around it. Assuming everything, including the background, in the bounding box belongs to the target, we create the normalised colour histogram.
Now we grab the next frame and place our bounding box at the same location as the previous frame and create a candidate normalised colour histogram.
Next comes the Meanshift bit. The Meanshift algorithm iterations consists of steps 3 to 5 below:
- Create a target model in the ROI in the form of a histogram based on its feature - most commonly it is colour. Normalise the target histogram.
- In the next image frame, create a candidate model histogram at the location "x" of the target in the previous image. Normalise the candidate histogram.
- Each pixel "I" in the ROI around "x" provides a weighted sample. Suppose its colour is in bin u of the histogram, then the weight contributed by a pixel is given by:
- Update the meanshift vector:
- Update the current position and iterate:
The effect of the above steps is that an object is tracked in a sequence of frames. The weights are important, since otherwise the samples are simply at the pixel locations and therefore uniformly distributed. The Meanshift algorithm works because the weighted ratio (marked in red above) is biased towards a mode because colours are more likely to match in the correct direction. Hence, the weights will on average be greater closer to the true location. Don't be put off by the Mathematics - here is what is going on:
The table below shows the Meanshift calculations. X and Y are the coordinates, RGB is the pixel colour value, Bin is the histogram index corresponding to the RGB colour value, is the value of the target histogram bin, and is the value of the candidate histogram bin:
The new centroid is found as shown below. We keep on moving the centroid till we find the likely target, i.e., using the Bhattacharyya coefficient to match the histogram.
An improvement to this basic Meanshift tracking proposed by Comaniciu Et Al. is that once we find the new centroid, we create a histogram of the image under the ROI and find the Bhattacharyya coefficient (let's call it BC1). We move the centroid in small increments and create a histogram. Then as long as the Bhattacharyya coefficient at the moved centroid is less than BC1, we keep on inching the centroid towards the target. Furthermore, we also keep track of how much the centroid has moved from the previous position and the number of Meanshift iterations. The centroid movement is found by using the Euclidean distance between the previous and current centroids. If the centroid movement and the number of Meanshift iterations are less than some threshold, we keep on running the Meanshift algorithm. Using the threshold is important for practical purposes because if the tracker gets lost, we don't want to be stuck in an infinite loop!
Kernel-based Meanshift Tracking
In kernel-based Meanshift, a kernel profile is overlaid on the ROI and the model is created based on the kernel. The tracker proposed by Comaniciu Et Al  uses the Epanichnikov kernel shown below. Because of the shape of this kernel, the pixels towards the centre of the ROI would have a greater value than the pixels towards the edges, which gives the tracker the ability to reject interfering backgrounds.
There is one key difference between a kernel-based Meanshift and basic Meanshift. When the target or candidate model is created in a kernel-based Meanshift tracker, a suitable kernel is used in the ROI and the value of the kernel corresponding to the colour pixel value is used to create the histogram instead of the actual colour value. The process of creating the model, i.e., histogram, in the kernel-based Meanshift tracker is the same as usual, expect that instead of incrementing the corresponding bin by 1, the value of the kernel at that location is added to the bin. For example, if a pixel colour value falls in bin 3 and the kernel value at the pixel is 0.843, then we would add 0.843 to bin 3 instead of incrementing the bin value. Using the Epanichnikov kernel, the bin value "
k" will look something like this (the equations in 2b and 2c come from the Epanichnikov profile):
Using the Epanichnikov kernel for an ROI of 5X5 pixels might look like shown below. We can see that the pixels near the centre have a larger value than the pixels at the edges:
Background Weighted Meanshift Tracking
The background is always of concern in visual tracking. When the target and background colours are the same, there is a high probability that the tracker will get lost. Comaniciu Et Al.  have proposed a variation of the Meanshift algorithm in which the histogram of the background area around the ROI is also calculated and fused with the histogram of the ROI to create a more reliable model. A histogram is created for an area around the ROI window. An appropriate area can be selected experimentally. For each bin, a weight is then derived by taking the minimum between 1 and the minimum non-zero number in that bin. If is the normalised histogram and is the smallest non-zero entry, then the weights are derived according to:
The weights are to reduce the importance of those features that are prominent in the background, i.e., with low - this is given below:
- Get the smallest non-zero value from the background histogram
- For each bin in the target histogram:
- Apply weights to the target model:
Using the Code
Step 1: Select a histogram size - the default is 4x4x4.
Step 2: Select a tracker type.
Step 3: Click the "Load Sequence" button and select the folder containing JPG or BMP files.
Step 4: Select an area in the image using mouse - Right click the mouse and drag. This will create a colour histogram of the selected area. The target histogram is shown in red and will not change during the tracking. The red ROI is for the target and the yellow is for the background.
Step 5: Click the "Start Tracking" button to see the tracking in action.
For Meanshift, the candidate histogram is shown in blue and will change during tracking as it is the histogram of the area that the tracker "thinks" is the target. The centroid and Bhattacharyya Coefficient are displayed on the screen. For the centroid, only the centroid information is displayed.
I have not implemented scale adaptation described in the paper. Scale adaptation means the ROI should shrink and expand when the object moves away from and near to the camera, respectively. The paper suggests evaluating Bhattacharyya coefficient at a scale smaller and larger than the current ROI and then set the ROI size to the one that gives you the maximum coefficient. Also, I would encourage the readers to implement the CAMShift Tracker . The series of articles so far has all the information required to implement the CAMShift tracker plus the paper is easy to follow - Good luck!
For any third party code or material used, please refer to "Points of Interest" in Part I.
-  D. Comaniciu and P. Meer, “Mean Shift Analysis and Applications”, Proc. Seventh International Conference Computer Vision, pp 1197 – 1203, September 1999.
-  D. Comaniciu, V. Ramesh, and P. Meer, “Real-Time Tracking of Non-Rigid Objects Using Mean Shift”, Proc. 2000 IEEE Conference Computer Vision and Pattern Recognition, vol. II, pp. 142 – 149, June 2000.
-  D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-Based Object Tracking”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 25, No. 5, May 2003.
-  Gary Bradski, “Computer Vision Face Tracking For Use in a Perceptual User Interface”, Proc. IEEE Workshop Applications of Computer Vision, pp. 214 – 219, October 1998.