Tracking an object from a live video input

Paul Yi Tung, Ooi

4.69/5 (41 votes)

Sep 24, 2004

4 min read

256795

14406

Track an object based on its features, using the AVICap window class.

Sample screenshot

Introduction

As part of my research project, I had to implement a feature tracking device that runs entirely on a hardware board. Designing things, especially useful things on a piece of hardware, takes effort and time. To avoid any tedious calibrations of algorithms on board and to ensure the algorithms are all properly designed, I wrote a Windows application to simulate the environment - grabbing frames from a web camera and track. In exactly the same way as I have benefited from open source projects, I would certainly enjoy spending some time contributing to a site such as The Code Project.

AVICap

In this demo application, I have chosen to demonstrate the use of AVICap window class to track objects. AVICap is a window class that provides applications with an extremely convenient programming interface to access video acquisition hardware such as a web camera used in this demo application.

To be able to track objects from a live video input, we obviously need to gain access to individual frames. To gain access to individual frames before they are previewed, use the capSetCallbackOnFrame macro.

BOOL capSetCallbackOnFrame(HWND hwnd, FrameCallback fpProc);

HWND hwnd: Handle to the capture window.
FrameCallback fpProc: Pointer to the preview callback function. Specify NULL for this parameter to disable a previously installed callback function.

typedef LRESULT (*FrameCallback)(HWND hWnd, LPVIDEOHDR lpVideoHdr);

The LPVIDEOHDR is declared as follows:

typedef struct videohdr_tag {
    LPBYTE lpData;          /* Pointer to buffer. */
    DWORD  dwBufferLength;  /* Length of buffer. */
    DWORD  dwBytesUsed;     /* Bytes actually used. */
    DWORD  dwTimeCaptured;  /* Timefrom start of stream. */
    DWORD  dwUser;
    DWORD  dwFlags;         /* Flags. */
    DWORD  dwReserved[4];
}

#define VHDR_DONE       0x00000001
#define VHDR_PREPARED   0x00000002
#define VHDR_INQUEUE    0x00000004
#define VHDR_KEYFRAME   0x00000008

Once the frame callback procedure is associated to a capture window, we are all set to begin tracking.

Color Space

Before we start processing frames, it is important to understand the different representations for color spaces used in digitized video. There are many color spaces to choose from, and each of them has its own strengths and limitations. Choosing the right color space for a specific application simplifies computation significantly.

The feature that we will be looking at for this demo application is brightness, and we will track objects based on their brightness. A very natural approach is to make sure that the color space that we are dealing with has a brightness component. YUV is one color space that has this very component that we are seeking for. However, YUV is not necessarily one of the input formats that is available from the web camera. Therefore, a conversion is required from the typical RGB24 input format to YUV.

The relationship between RGB and YUV can be expressed simply as the following set of linear equations.

[ Y ]   [  0.257  0.504  0.098  0.063 ][ R ]
[ U ] = [ -0.148 -0.291  0.439  0.500 ][ G ]
[ V ]   [  0.439 -0.368 -0.072  0.500 ][ B ]
[ 1 ]   [  0.000  0.000  0.000  1.000 ][ 1 ]

This matrix results from the concept of change of basis in linear algebra, where in this case, corresponds to the rotation of the color cube such that the new basis has a component with the unique property R = G = B.

Feature Tracking

Now that we have direct access to the brightness of each pixel, a simple algorithm can be used to track a bright object. The algorithm that will be introduced here is a fairly simple one, called the "rectangle algorithm". The rectangle algorithm keeps track of four points in each frame, the top most, left most, right most and bottom most points where the brightness exceeds a certain threshold value.

If you use the following code, make sure you set the input format of your web camera to RGB24.

LRESULT CChildView::FrameCallbackProc(HWND hWnd, LPVIDEOHDR lpVideoHdr)
{
    ...
    ...

    for (int i=0; i<nHeight; ++i) {
        for (int j=0; j<nWidth; ++j) {
            /* Get the appropriate index into the buffer. */
            index = 3*(i*nWidth+j);
            /* Compute the V component. */
            Y = floor(0.299*lpData[index+2] + 0.587*lpData[index+1] + 
                                          0.114*lpData[index] + 0.5);
            /* If brightness exceeds threshold value. */
            if (Y > bThreshold) {
                /* First occurence, initialize points. */
                if (init) {
                    if (pLeft.x > j) {
                        pLeft.x = j;
                        pLeft.y = i;
                    }
                    if (pRight.x < j) {
                        pRight.x = j;
                        pRight.y = i;
                    }
                    pBottom.x = j;
                    pBottom.y = i;
                }
                /* Always keep track of four corners. */
                else {
                    pTop.x = pBottom.x = pLeft.x = pRight.x = j;
                    pTop.y = pBottom.y = pLeft.y = pRight.y = i;
                    init = true;
                }
            }
        }
    }    
    
    ...
    ...

}

A rectangle can be constructed from these points, which tells us where the bright object is. The border of the rectangle is then simply replaced by a predefined color.

if (init) {
    /* Replace border pixels with predefined colour. */
    for (int i=pLeft.x; i<=pRight.x; ++i) {
        index = 3*((pTop.y)*nWidth + i);    /* Top */
        lpData[index]   = 0;   /* B */
        lpData[index+1] = 0;   /* G */
        lpData[index+2] = 255; /* R */
        index = 3*((pBottom.y)*nWidth + i); /* Bottom */
        lpData[index]   = 0;   /* B */
        lpData[index+1] = 0;   /* G */
        lpData[index+2] = 255; /* R */
    }
    for (int i=pTop.y; i<=pBottom.y; ++i) {
        index = 3*((i)*nWidth + pLeft.x);   /* Left */
        lpData[index]   = 0;   /* B */
        lpData[index+1] = 0;   /* G */
        lpData[index+2] = 255; /* R */
        index = 3*((i)*nWidth + pRight.x);  /* Right */
        lpData[index]   = 0;   /* B */
        lpData[index+1] = 0;   /* G */
        lpData[index+2] = 255; /* R */
    }
}

This algorithm obviously has a lot of weaknesses.

It only gives the position of the object as a whole on the screen.
It does not keep any information about the shape of the object.
It does not tell where the middle of the object is.
It can never track multiple objects.

An Improved Algorithm

Sample screenshot 2

This algorithm tracks objects by identifying segments that make up the object on the screen. Each segment consists of the head and the length of the segment. The object is constructed by grouping the segments together.

BYTE Y; int index;
/* -- Variables used by the new tracking algorithm. -- */
QSEG segment;
std::list<QSEG> object;

for (int i=0; i<nHeight; ++i) {
    segment.length = 0;

    for (int j=0; j<nWidth; ++j) {
        index = 3*(i*nWidth+j);
        Y = floor(0.299*lpData[index+2] + 0.587*lpData[index+1] +
          0.114*lpData[index] + 0.5);

        if (Y > bThreshold) {
            if (segment.length == 0) {
                segment.head.x = j;
                segment.head.y = i;
            }
            ++segment.length;
        }
    }

    if (segment.length) {
        object.push_back(segment);
    }
}

/* -- Draw the shape of the object with a predefined colour.  -- */
for (std::list<QSEG>::iterator i=object.begin(); i!=object.end(); ++i) {
    index = 3*((*i).head.y*nWidth + (*i).head.x);
    lpData[index]   = 255;
    lpData[index+1] = 0;
    lpData[index+2] = 255;

    index = 3*((*i).head.y*nWidth + (*i).head.x + (*i).length);
    lpData[index]   = 255;
    lpData[index+1] = 0;
    lpData[index+2] = 255;
}

This new tracking algorithm has a few extra advantages.

It can track multiple objects.
It can track the shape of the objects.
The number of pixels that make up the object on the screen can be easily calculated. With this piece of information and proper distance calibration, the position of the object in 3 dimensions can be determined.