From glyph recognition to augmented reality

Andrew Kirillov

4.96/5 (81 votes)

Sep 23, 2011

GPL3

30 min read

122084

The article describes an algorithm for recognition of optical glyphs in still images and video and then shows its application in 3D augmented reality.

Augmented Reality

Index

Introduction;
Prerequisites;
Finding potential glyphs;
Glyph recognition;
Matching found glyph with database of glyphs;
2D Augmented Reality;
Pose estimation;
3D augmented realit:
Conclusion.

Introduction

Recognition of glyphs (or optical glyphs as they are called most frequently) is quite an intersection topic, which has applications in a range of different areas. The most popular application of optical glyphs is augmented reality, where computer vision algorithm finds them in a video stream and substitutes with artificially generated objects creating a view which is half real and half virtual - virtual objects in a real world. Another area of optical glyphs' application is robotics, where glyphs can be used to give commands to a robot or help robot to navigate within some environment, where glyphs can be used to give robot directions.

In this article we are going to discuss algorithms for optical glyph recognition, which is the first step towards all the applications based on optical glyphs. Then we are going to switch from glyph recognition to 2D and finally 3D augmented reality.

For those who prefer seeing first what’s this all about before reading all the details, here is a small video which summarizes the work being done:

From glyph recognition to augmented reality

Prerequisites

All the image processing algorithm discussed further in this article are based on the AForge.NET framework. A bit of its knowledge will not hurt, but is not a requirement since the framework provides documentation and samples anyway. For algorithm prototyping and testing I've used an IPPrototyper application, which is part of the framework. As usually it really simplified testing of the algorithm on many images and allowed to concentrate on the idea itself, but not on some other unwanted coding.

Below is a sample of some glyphs which we are aiming to recognize. All glyphs are represented with a square grid divided equally to the same number of rows and columns. Each cell of the grid is filled with either black or white color. The first and the last row/column of each glyph contains only black cells, which creates a black border around each glyph. Also we do an assumption that every row and column has at least one white cell, so there are no completely black rows and columns (except the first and the last). All such glyphs are printed on white paper in such a way, that there is white area around black borders of a glyph (the above picture of IPPrototypes shows how they look like when printed).

Finding potential glyphs

Before going into glyph recognition, there is another task which needs to be solved first - find potential glyphs in an image to recognize. The aim of this task is to find all quadrilateral areas which may look like a glyph - an area which is promising enough for further analysis and recognition. In other words, we need to find 4 corners of each glyph in the source image. It so happened that this task is the hardest one from the entire glyph searching-recognition part.

The first step is trivial - we'll do grayscaling of the original image, since it will reduce amount of data to process, plus we don't need color information for this task anyway.

What is next? As we can see all glyphs are quite contrast objects - a black bordered glyph on white paper. So most probably a good direction is to search for black quadrilaterals surrounded by white areas and do their analysis. However, how to find them? One of the ideas is to try doing thresholding and then blob analysis for finding black quadrilaterals. Of course we are not going to use regular thresholding with predefined threshold since it will give us nothing - we simply can not set one threshold value for all possible light and environment conditions. Trying Otsu thresholding may produce some good results:

As we can see on the pictures above, the Otsu thresholding did its work quite well - we got black quadrilaterals surrounded by white areas. Using blob counter it is possible to find all the black objects on the above binary images, perform some checks to make sure these objects are quadrilaterals, etc. etc. It is really possible to get everything working starting from this point, however it may have some issues. The problem is that Otsu thresholding worked for the above images and it actually works for many other images. But not for all of them. Here is one of the images, where it does not work as supposed and all the idea fails.

The above picture shows that global thresholding does not work very well for certain illumination/environment conditions. So we may need to find another idea.

As it was already mentioned, optical glyphs are quite contrast objects - black glyph surrounded by white area. Of course the contrast may change depending on light conditions and black areas may get lighter, but white areas may get darker. But still the difference should be considerable enough unless we have absolutely bad illumination. So instead of trying to find black or white quadrilaterals, we may try to find regions where image brightness changes sharply. This is the work for edge detector, for example Difference Edge Detector:

To get rid of the areas where image brightness changes insignificantly we will do thresholding. Here is how it looks like with the 3 samples we've started from:

As we can see on the pictures above, all the detected glyphs are represented by a stand alone blob forming quadrilateral. In the case if illumination condition is not completely bad, all these glyphs' quadrilaterals have a well connected edge, so they are really represented with a single blob, which will be easy to extract with a blob counting algorithms.

Below is an example of bad illumination conditions, where both Otsu thresholding and thresholded edge detection fail to produce any good result which could be used for further glyph location and recognition.

So we decide to go with edge detection and hence here is the beginning of our code (we will use UnmanagedImage to avoid extra locks/unlocks of .NET's managed image):

// 1 - grayscaling
UnmanagedImage grayImage = null;

if ( image.PixelFormat == PixelFormat.Format8bppIndexed )
{
    grayImage = image;
}
else
{
    grayImage = UnmanagedImage.Create( image.Width, image.Height,
        PixelFormat.Format8bppIndexed );
    Grayscale.CommonAlgorithms.BT709.Apply( image, grayImage );
}

// 2 - Edge detection
DifferenceEdgeDetector edgeDetector = new DifferenceEdgeDetector( );
UnmanagedImage edgesImage = edgeDetector.Apply( grayImage );

// 3 - Threshold edges
Threshold thresholdFilter = new Threshold( 40 );
thresholdFilter.ApplyInPlace( edgesImage );

Now, when we have a binary image containing significant edges of all objects, we need to process all the blobs formed by these edges and check if any of the blobs may represent an edge of a glyph. To go through all separate blobs we can use BlobCounter:

// create and configure blob counter
BlobCounter blobCounter = new BlobCounter( );

blobCounter.MinHeight    = 32;
blobCounter.MinWidth     = 32;
blobCounter.FilterBlobs  = true;
blobCounter.ObjectsOrder = ObjectsOrder.Size;

// 4 - find all stand alone blobs
blobCounter.ProcessImage( edgesImage );
Blob[] blobs = blobCounter.GetObjectsInformation( );

// 5 - check each blob
for ( int i = 0, n = blobs.Length; i < n; i++ )
{
    // ...
}

As we can see from the binary edge images we got, we have lots of edges. But not all of them form a quadrilateral looking object. We are interested only in quadrilateral looking blobs, because a glyph will be always represented by a quadrilateral regardless of how it is rotated. To make a check for quadrilateral, we can collect blob's edge points using GetBlobsEdgePoints() and then use IsQuadrilateral() method for checking if these points may form a quadrilateral. If not, then we skip the blob and go to then next one.

List<IntPoint> edgePoints = blobCounter.GetBlobsEdgePoints( blobs[i] );
List<IntPoint> corners = null;

// does it look like a quadrilateral ?
if ( shapeChecker.IsQuadrilateral( edgePoints, out corners ) )
{
    // ...
}

OK, now we have all blobs which look like quadrilaterals. However not every quadrilateral is a glyph. As we already mentioned, a glyph has a black border and it is printed on white paper. So we need to make a check that a blob we have is black inside, but white outside. Or, to be more correct, it should be much darker inside than outside (since illumination may vary and checking for perfect white/black will not work).

To perform a check if blob is darker inside than outside, we may get left and right edge points of the blob using GetBlobsLeftAndRightEdges() method and then calculate average difference of brightness between pixels just outside of the blob and inside. If the average difference is significant enough, then most likely we have a dark object surrounded by lighter area.

// get edge points on the left and on the right side
List<IntPoint> leftEdgePoints, rightEdgePoints;
blobCounter.GetBlobsLeftAndRightEdges( blobs[i],
    out leftEdgePoints, out rightEdgePoints );

// calculate average difference between pixel values from outside of the
// shape and from inside
float diff = CalculateAverageEdgesBrightnessDifference(
    leftEdgePoints, rightEdgePoints, grayImage );

// check average difference, which tells how much outside is lighter than
// inside on the average
if ( diff > 20 )
{
    // ...
}

To clarify the idea of calculating average difference between pixels outside and inside of a blob, lets take a closer look at the CalculateAverageEdgesBrightnessDifference() method. For both left and right edges of a blob, the method builds two lists of points - list of points which are a bit on the left from the edge and list of points which are a bit on the right from the edge (lets say 3 pixel away from the edge). For each of the lists of points it collects pixel values corresponding to these points using Collect8bppPixelValues() method. And then it calculates average difference - for left blob's edge it subtracts value of the pixel on the right side of the edge (inside of the blob) from value of the pixel on the left side of the edge (outside of the blob); for right blob's edge it does opposite difference. When calculation is done the method produces a value, which is an average difference between pixels outside and inside of a blob.

const int stepSize = 3;

// Calculate average brightness difference between pixels outside and
// inside of the object bounded by specified left and right edge
private float CalculateAverageEdgesBrightnessDifference(
    List<IntPoint> leftEdgePoints,
    List<IntPoint> rightEdgePoints,
    UnmanagedImage image )
{
    // create list of points, which are a bit on the left/right from edges
    List<IntPoint> leftEdgePoints1  = new List<IntPoint>( );
    List<IntPoint> leftEdgePoints2  = new List<IntPoint>( );
    List<IntPoint> rightEdgePoints1 = new List<IntPoint>( );
    List<IntPoint> rightEdgePoints2 = new List<IntPoint>( );

    int tx1, tx2, ty;
    int widthM1 = image.Width - 1;

    for ( int k = 0; k < leftEdgePoints.Count; k++ )
    {
        tx1 = leftEdgePoints[k].X - stepSize;
        tx2 = leftEdgePoints[k].X + stepSize;
        ty = leftEdgePoints[k].Y;

        leftEdgePoints1.Add( new IntPoint(
            ( tx1 < 0 ) ? 0 : tx1, ty ) );
        leftEdgePoints2.Add( new IntPoint(
            ( tx2 > widthM1 ) ? widthM1 : tx2, ty ) );

        tx1 = rightEdgePoints[k].X - stepSize;
        tx2 = rightEdgePoints[k].X + stepSize;
        ty = rightEdgePoints[k].Y;

        rightEdgePoints1.Add( new IntPoint(
            ( tx1 < 0 ) ? 0 : tx1, ty ) );
        rightEdgePoints2.Add( new IntPoint(
            ( tx2 > widthM1 ) ? widthM1 : tx2, ty ) );
    }

    // collect pixel values from specified points
    byte[] leftValues1  = image.Collect8bppPixelValues( leftEdgePoints1 );
    byte[] leftValues2  = image.Collect8bppPixelValues( leftEdgePoints2 );
    byte[] rightValues1 = image.Collect8bppPixelValues( rightEdgePoints1 );
    byte[] rightValues2 = image.Collect8bppPixelValues( rightEdgePoints2 );

    // calculate average difference between pixel values from outside of
    // the shape and from inside
    float diff = 0;
    int pixelCount = 0;
    
    for ( int k = 0; k <leftEdgePoints.Count; k++ )
    {
        if ( rightEdgePoints[k].X - leftEdgePoints[k].X > stepSize * 2 )
        {
            diff += ( leftValues1[k]  - leftValues2[k] );
            diff += ( rightValues2[k] - rightValues1[k] );
            pixelCount += 2;
        }
    }

    return diff / pixelCount;
}

Now it is time to take a look at the result of the two checks we made - for quadrilateral and for average difference between pixels inside and outside of a blob. Let's highlight edges of all the blobs which pass these checks and see if we get any closer to detection of glyphs' location.

Taking a look at the above pictures we can see that result of the two checks we made is really acceptable - only blobs containing optical glyphs were highlighted and nothing else. Potentially it may happen that some other objects may satisfy those checks and the algorithm may find some other dark quadrilaterals surrounded by white areas. However experiments show it does not happen so often. Even if happens sometimes, there is still further glyph recognition step involved, which may filter "false" glyphs. So, we decide that we have quite good glyph (or better say potential glyph) localization algorithm and can move further into recognition.

Glyph recognition

Now, when we have coordinates of potential glyphs (its quadrilaterals), we can do their actual recognition. It is possible to develop an algorithm, which does glyph recognition directly in the source image. However, let's simplify things a bit and extract glyphs from the source image, so we have a separate square image for each potential glyph, containing only glyph data. This can be done using QuadrilateralTransformation. Below are the few glyphs extracted from some of the previously processed images:

// 6 - do quadrilateral transformation
QuadrilateralTransformation quadrilateralTransformation =
    new QuadrilateralTransformation( quadrilateral, 100, 100 );
UnmanagedImage glyphImage = quadrilateralTransformation.Apply( image );

Some glyphs extracted from source images

As we can see from the pictures above, illumination conditions may vary quite a lot and some glyphs may not be so contrast as others. So we may use Otsu thresholing on this stage to binarize glyphs.

// otsu thresholding
OtsuThreshold otsuThresholdFilter = new OtsuThreshold( );
otsuThresholdFilter.ApplyInPlace( glyphImage );

At this stage we are ready to go into final glyph recognition. There are different possible ways exist to do this, like shape recognition, template matching, etc. Although there may be benefits of using things like shape recognition, I found them a bit too complex for such a simple task of recognizing a glyph satisfying constraints we made from there very beginning. As it was mentioned before, all our glyphs are represented by a square grid, where each cell is filled with black or white color. So a recognition algorithm can be made quite simple with this assumption - just divide glyph image into cells and check what is the average (most common) color of the cell.

Before we go into glyph recognition code, let's do some clarification to the way we divide glyph into cells. For example, let's take a look at the image below. Here we can see how glyph is divided by dark gray lines into 5x5 grid of cells having same width and height. So what we could do is just to count number of white pixels in each such cell and check if the number is greater than half of the cell's area. If it is greater, then we assume that the cell is filled by white color, which corresponds to "1" lets say. And if the number is less than half of the cell's area, then we have a black filled cell, which corresponds to "0". Also we may introduce confidence level for each cell - if the entire cell is filled with white or black pixels, then we are 100% confident about cell's color/type. However, if a cell has 60% of white pixels and 40% of black pixels, then recognition confidence drops to 60%. When a cell is half filled with white and half filled with black color, then confidence equals to 50%, which means we are not sure at all about cell color/type.

However, with the above described approach it will be hardly possible to find a cell, which may give 100% confidence level. As we can see from the picture above, all the process of glyph localization, extraction, thresholding, etc. may cause some imperfections - some edge cells may contain also parts of white areas surrounding a glyph, but some inner cells which are supposed to be black may contain white pixels caused by neighboring white cells, etc. So instead of calculating number of white pixels over entire cell's area, we may introduce small gap around cell's borders and exclude it from processing. The above picture demonstrates the idea with gaps - instead of scanning entire cell which is highlighted by dark gray lines, we scan smaller inner area which is highlighted with light gray lines.

Now, when the recognition idea seems to be clear, we can get to its implementation. First of all the code goes throw the provided image and calculates sum of pixels' values for each cell. Then these sums are used to calculate fullness of each cell - how full is a cell filled with white pixels. Finally cell's fullness is used to determine its type ("1" - white filled or "0" - black filled) and confidence level. Note: before using this function (method), user must set glyph size to recognize.

public byte[,] Recognize( UnmanagedImage image, Rectangle rect,
    out float confidence )
{
    int glyphStartX = rect.Left;
    int glyphStartY = rect.Top;

    int glyphWidth  = rect.Width;
    int glyphHeight = rect.Height;

    // glyph's cell size
    int cellWidth  = glyphWidth  / glyphSize;
    int cellHeight = glyphHeight / glyphSize;

    // allow some gap for each cell, which is not scanned
    int cellOffsetX = (int) ( cellWidth  * 0.2 );
    int cellOffsetY = (int) ( cellHeight * 0.2 );

    // cell's scan size
    int cellScanX = (int) ( cellWidth  * 0.6 );
    int cellScanY = (int) ( cellHeight * 0.6 );
    int cellScanArea = cellScanX * cellScanY;

    // summary intensity for each glyph's cell
    int[,] cellIntensity = new int[glyphSize, glyphSize];

    unsafe
    {
        int stride = image.Stride;

        byte* srcBase = (byte*) image.ImageData.ToPointer( ) +
            ( glyphStartY + cellOffsetY ) * stride +
            glyphStartX + cellOffsetX;
        byte* srcLine;
        byte* src;

        // for all glyph's rows
        for ( int gi = 0; gi < glyphSize; gi++ )
        {
            srcLine = srcBase + cellHeight * gi * stride;

            // for all lines in the row
            for ( int y = 0; y < cellScanY; y++ )
            {
                // for all glyph columns
                for ( int gj = 0; gj < glyphSize; gj++ )
                {
                    src = srcLine + cellWidth * gj;

                    // for all pixels in the column
                    for ( int x = 0; x < cellScanX; x++, src++ )
                    {
                        cellIntensity[gi, gj] += *src;
                    }
                }

                srcLine += stride;
            }
        }
    }

    // calculate value of each glyph's cell and set
    // glyphs' confidence to minim value of cell's confidence
    byte[,] glyphValues = new byte[glyphSize, glyphSize];
    confidence = 1f;

    for ( int gi = 0; gi < glyphSize; gi++ )
    {
        for ( int gj = 0; gj < glyphSize; gj++ )
        {
            float fullness = (float)
                ( cellIntensity[gi, gj] / 255 ) / cellScanArea;
            float conf = (float) System.Math.Abs( fullness - 0.5 ) + 0.5f;

            glyphValues[gi, gj] = (byte) ( ( fullness > 0.5f ) ? 1 : 0 );

            if ( conf < confidence )
                confidence = conf;
        }
    }

    return glyphValues;
}

With the function provided above, the next step after glyph's binarization looks quite simple:

// recognize raw glyph
float confidence;

byte[,] glyphValues = binaryGlyphRecognizer.Recognize( glyphImage,
    new Rectangle( 0, 0, glyphImage.Width, glyphImage.Height ), out confidence );

At this stage we have a 2D byte array containing "0" and "1" elements corresponding to black and white cells of a glyph's image. For example, the function should provide result shown below for the above shown glyph image:

Now, let's do some checks to make sure we processed a glyph image satisfying constraints we set in the beginning. First, let's check confidence level - if it is lower than certain limit (for example 0.6, which corresponds to 60%), then we skip the processed object. Also we skip it in the case if the glyph does not have a border made of black cells (if glyph data contains at least single "1" value in the first/last row or column) or if it does not have at least one white cell in any inner row or column.

if ( confidence >= minConfidenceLevel )
{
    if ( ( CheckIfGlyphHasBorder( glyphValues ) ) &&
         ( CheckIfEveryRowColumnHasValue( glyphValues ) ) ) 
    {
        // ...
        // further processing
    }
}

That is it about glyph data extraction/recognition. If a candidate image containing potential glyph has passed all these steps and checks, then it seems we really got a glyph.

Matching found glyph with database of glyphs

Although we did extraction of glyph data from an image, this is not the last step in glyph recognition task. Applications dealing with augmented reality or robotics usually have a database of glyphs, where each glyph may have its own meaning. For example, in augmented reality each glyph is associated with a virtual object to be shown instead of a glyph, but in robotics applications each glyph may represent a command or direction for a robot. So the last step is to match the extracted glyph data with a database of glyphs and retrieve information related with the glyph - it's ID, name, whatever else.

To complete glyph matching step successfully, we need to keep in mind that glyphs can be rotated, so comparing extracted glyph data one to one with glyphs stored in database will not work. Finding a matching glyph in glyphs' database we need to do 4 compares of extracted glyph data with every glyph in the database - compare 4 possible rotations of extracted glyph data with the database.

Another important thing to mention is that all glyphs in database should be rotation variant in order to be unique regardless of rotation. If a glyph may look the same after rotation then it is a rotation invariant glyph. For rotation invariant glyphs we cannot determine their rotation angle, which is very important for applications like augmented reality. Also it may not be possible to find correct matching glyph in a database, if it contains few rotation invariant glyphs, which may look the same if one of them is rotated.

Below picture demonstrates some rotation variant and invariant glyphs. Glyphs (1) and (2) are rotation variant - if they are rotated they will look always different. Glyphs (3), (4) and (5) are rotation invariant - if rotated they will look the same, so it is not possible to detect their rotation angle. Also we may see that glyph (4) is actually same as glyph (5), but just rotated, so glyph database should not contain them both.

public int CheckForMatching( byte[,] rawGlyphData )
{
    int size = rawGlyphData.GetLength( 0 );
    int sizeM1 = size - 1;

    bool match1 = true;
    bool match2 = true;
    bool match3 = true;
    bool match4 = true;

    for ( int i = 0; i < size; i++ )
    {
        for ( int j = 0; j < size; j++ )
        {
            byte value = rawGlyphData[i, j];

            // no rotation
            match1 &= ( value == data[i, j] );
            // 180 deg
            match2 &= ( value == data[sizeM1 - i, sizeM1 - j] );
            // 90 deg
            match3 &= ( value == data[sizeM1 - j, i] );
            // 270 deg
            match4 &= ( value == data[j, sizeM1 - i] );
        }
    }

    if ( match1 )
        return 0;
    else if ( match2 )
        return 180;
    else if ( match3 )
        return 90;
    else if ( match4 )
        return 270;

    return -1;
}

As we can see from the code above, the method returns -1 if provided glyph data don't match to the data kept in data variable (member of a glyph class). However, if the matching is found than it returns rotation angle (0, 90, 180 or 270 degrees in counter clockwise direction) which is used to get the specified glyph data from the original glyph we match to.

Now, all we need to do is to go through all glyphs in a database and check if the glyph data we extracted from image matches to any of the glyphs from the database. If matching is found, then we can get all the data associated with the matched glyph and use it for visualization, giving command to robot, etc.

That is all about glyph recognition. Now it is time for a small demo, which demonstrates all the above code applied to a video feed (the code highlights recognized glyphs with a border and displays their names).

2D Augmented Reality

Now, when we have glyph recognition working, it is time to move further and try some 2D augmented reality. It will not be hard to do it since we have all the things we need for this.

The first thing we need to do is to correct glyph's quadrilateral (the one we got from IsQuadrilateral() call on the glyph localization phase). As it was already mentioned the glyph we extract from the found quadrilateral may not look exactly the same as in glyphs' database, but may be rotated. So we need to rotate quadrilateral in such a way, that a glyph extracted from it would look exactly as in database. For this purpose we need to use rotation angle provided by CheckForMatching() call we did on glyph matching phase:

if ( rotation != -1 )
{
    foundGlyph.RecognizedQuadrilateral = foundGlyph.Quadrilateral;

    // rotate quadrilateral's corners
    while ( rotation > 0 )
    {
        foundGlyph.RecognizedQuadrilateral.Add( foundGlyph.RecognizedQuadrilateral[0] );
        foundGlyph.RecognizedQuadrilateral.RemoveAt( 0 );

        rotation -= 90;
    }
}

All we need to do now to complete 2D augmented reality is to put an image we want into the corrected quadrilateral. For this purpose we use BackwardQuadrilateralTransformation - same as QuadrilateralTransformation, but instead of extracting image from the specified quadrilateral it puts another image into it.

// put glyph's image onto the glyph using quadrilateral transformation
BackwardQuadrilateralTransformation quadrilateralTransformation =
    new BackwardQuadrilateralTransformation( );

quadrilateralTransformation.SourceImage = glyphImage;
quadrilateralTransformation.DestinationQuadrilateral = glyphData.RecognizedQuadrilateral;

quadrilateralTransformation.ApplyInPlace( sourceImage );

That was quick. Nothing else to say about 2D augmented reality after all mentioned before. So let's see another demo ...

Pose estimation

As it obviously turns out, 3D augmented reality is not as simple as 2D augmented reality. To place a 3D object on top of a glyph, it is not enough to know coordinates of 4 glyph's corners. Instead it is required to know 3D coordinates of the glyph's center in the real world (translation) and its rotation angles around X/Y/Z axes. So before going further into 3D augmented reality, we need to find the way how to determine glyph's real world 3D pose.

There are number of research papers published about 3D pose estimation describing different algorithms. The most popular of them seems to be POSIT algorithm, which is quite easy to follow and implement. The algorithm is described in "Model-Based Object Pose in 25 Lines of Code" paper by Daniel F. DeMenthon and Larry S. Davis.

The purpose of the POSIT algorithm is to estimate 3D pose of an object, which includes rotation over X/Y/Z axes and shift along X/Y/Z axes. To do this the algorithm requires image coordinates of some object's points (minimum 4 points – exactly the number of corners we have). Then it needs to know model coordinates of these points. This assumes that the model of the object we are estimating pose for is known, so we know coordinates of the corresponding points in the model (yes we know it). And finally the algorithm requires effective focal length of the camera used to picture the object.

We can easily collect all the required information for the POSIT algorithm to do its work. However the algorithm has one limitation which makes it a bit useless for us – the algorithm is designed for the non coplanar case. In other words, models' points which are used for pose estimation can not lie all in the same plane. Unfortunately this is exactly the case we have. Since glyphs are plane, it makes it impossible to estimate their pose with POSIT.

Luckily researchers did not stop on POSIT and came up with extension algorithm, which is Coplanar POSIT. Essentially it the same POSIT but for the coplanar case. The algorithm's description can be found in "Iterative Pose Estimation using Coplanar Feature Points" paper written by Oberkampf, Daniel F. DeMenthon and Larry S. Davis. As for implementation we are going to use CoplanarPOSIT class from the AForge.NET framework.

Suppose we want to estimate pose of the glyph like the one shown on the picture below (its corners are highlighted with different colors for further reference):

First, let's start with image coordinates of the points we are going to use for pose estimation. The image above shows 4 points colored in yellow, blue, red and green. The coordinates of the points are (all coordinates are relative to image's center; Y axis' positive direction is from center to top; original size of the image is 640x480):

(-77, 48) - yellow;
(44, 66) - blue;
(75, -36) - red;
(-61, -58) - green.

Now we need to get model's coordinates of these points. Let's suppose that we have coordinate system's center right in the center of the glyph, the glyph lies in the XZ plane and that we have left-handed coordinate system with Z axis going away from viewer, when X and Y go right and up correspondingly. So if our real glyph's size is 113 mm, for example, then its model definition should be something like this:

(-56.5, 0, 56.5) - yellow;
(56.5, 0, 56.5) - blue;
(56.5, 0, -56.5) - red;
(-56.5, 0, -56.5) - green.

The final thing we need is effective focal length. Image width can be taken as a good approximation of it. Since size of the example source image is 640x480, we take effective focal length equal to 640. Now we are ready to estimate pose of the glyph using the following code:

// define model of glyph with side length equal to 113 mm
Vector3[] modelPoints = new Vector3[]
{
    new Vector3( -56.5f, 0,  56.5f ),
    new Vector3(  56.5f, 0,  56.5f ),
    new Vector3(  56.5f, 0, -56.5f ),
    new Vector3( -56.5f, 0, -56.5f ),
};

// define image points
AForge.Point[] imagePoints = new AForge.Point[]
{
    new AForge.Point( -77,  48 ),
    new AForge.Point(  44,  66 ),
    new AForge.Point(  75, -36 ),
    new AForge.Point( -61, -58 ),
};

// create instance of pose estimation algorithm
CoplanarPosit coposit = new CoplanarPosit( modelPoints, 640 );

// estimate pose of the object
Matrix3x3 rotationMatrix;
Vector3 translationVector;

coposit.EstimatePose( imagePoints, out rotationMatrix, out translationVector );

Since the topic of the article does not cover 3D transformation matrices, perspective projection, etc., we will not go into details about how interpret the calculated transformation matrix. Instead we'll just have a look into small code, which uses the obtained rotation matrix and transformation vector – we’ll put X/Y/Z axes on top of the glyph to see how accurate the 3D pose estimation is:

// model used to draw coordinate system's axes
private Vector3[] axesModel = new Vector3[]
{
    new Vector3( 0, 0, 0 ),
    new Vector3( 1, 0, 0 ),
    new Vector3( 0, 1, 0 ),
    new Vector3( 0, 0, 1 ),
};

// transform the model and perform perspective projection
AForge.Point[] projectedAxes = PerformProjection( axesModel,
    // create tranformation matrix
    Matrix4x4.CreateTranslation( translationVector ) *        // 3: translate
    Matrix4x4.CreateFromRotation( rotationMatrix ) *          // 2: rotate
    Matrix4x4.CreateDiagonal( new Vector4( 56, 56, 56, 1 ) ), // 1: scale
    imageSize.Width );

...

private AForge.Point[] PerformProjection( Vector3[] model,
                        Matrix4x4 transformationMatrix, int viewSize )
{
    AForge.Point[] projectedPoints = new AForge.Point[model.Length];

    for ( int i = 0; i < model.Length; i++ )
    {
        Vector3 scenePoint = ( transformationMatrix *
            model[i].ToVector4( ) ).ToVector3( );

        projectedPoints[i] = new AForge.Point(
            (int) ( scenePoint.X / scenePoint.Z * viewSize ),
            (int) ( scenePoint.Y / scenePoint.Z * viewSize ) );
    }

    return projectedPoints;
}

When we have projected points of our 3D model, we just need to draw it:

// cx and cy are coordinates of image's centre

using ( Pen pen = new Pen( Color.Blue, 5 ) )
{
    g.DrawLine( pen,
        cx + projectedAxes[0].X, cy - projectedAxes[0].Y,
        cx + projectedAxes[1].X, cy - projectedAxes[1].Y );
}

using ( Pen pen = new Pen( Color.Red, 5 ) )
{
    g.DrawLine( pen,
        cx + projectedAxes[0].X, cy - projectedAxes[0].Y,
        cx + projectedAxes[2].X, cy - projectedAxes[2].Y );
}

using ( Pen pen = new Pen( Color.Lime, 5 ) )
{
    g.DrawLine( pen,
        cx + projectedAxes[0].X, cy - projectedAxes[0].Y,
        cx + projectedAxes[3].X, cy - projectedAxes[3].Y );
}

The only difference between POSIT and Coplanar POSIT algorithms for user is the fact that Coplanar POSIT algorithm provides 2 estimations of object's pose - there are two solutions of equations system for coplanar version of the algorithm. The only way to check which pose estimation is better is to apply both estimated transformations to the model, perform perspective projection and compare with provided image points. The pose estimation which leads to similar image points is supposed to be best. Note: all this is done by Coplanar POSIT algorithm's implementation automatically, so it provides the best estimation. However, if user needs it, the alternate estimation is also available (see documentation to CoplanarPosit class). But we'll get back to it ...

3D augmented realit

Now when we have all the required bits of knowledge, it is time to put them all together in order to get 3D augmented reality, where a virtual 3D object is put on top of the real glyph.

3D rendering

One of the first things to start from is to decide which library/framework to use for 3D rendering. For this augmented reality project I decided to try Microsoft's XNA framework. Note: since the main topic of this article is not related to XNA, a beginners' introduction into XNA will not be part of it.

Since XNA framework is targeted to games development mostly, its integration with WinForms applications was not something straight forward from its very first release. The idea was that XNA manages entire game's window, graphics and input/output. However things have improved since that time and there are official samples exist showing integration of XNA into WinForms applications. Following some of those XNA samples and tutorials, it will become clear at some point in time that a simple code for rendering a small model may look something like this:

protected override void Draw( )
{
    GraphicsDevice.Clear( Color.Black );

    // draw simple models for now with single mesh
    if ( ( model != null ) && ( model.Meshes.Count == 1 ) )
    {
        ModelMesh mesh = model.Meshes[0];

        // spin the object according to how much time has passed
        float time = (float) timer.Elapsed.TotalSeconds;

        // object's rotation and transformation matrices
        Matrix rotation = Matrix.CreateFromYawPitchRoll(
            time * 0.5f, time * 0.6f, time * 0.7f );
        Matrix translation = Matrix.CreateTranslation( 0, 0, 0 );

        // create transform matrices
        Matrix viewMatrix = Matrix.CreateLookAt(
            new Vector3( 0, 0, 3 ), Vector3.Zero, Vector3.Up );
        Matrix projectionMatrix = Matrix.CreatePerspective(
            1, 1 / GraphicsDevice.Viewport.AspectRatio, 1f, 10000 );
        Matrix world = Matrix.CreateScale( 1 / mesh.BoundingSphere.Radius ) *
            rotation * translation;

        foreach ( Effect effect in mesh.Effects )
        {
            if ( effect is BasicEffect )
            {
                ( (BasicEffect) effect ).EnableDefaultLighting( );
            }

            effect.Parameters["World"].SetValue( world );
            effect.Parameters["View"].SetValue( viewMatrix );
            effect.Parameters["Projection"].SetValue( projectionMatrix );
        }

        mesh.Draw( );
    }
}

How much will the above code differ from the complete AR rendering? It will not be different too much actually. The above code is missing only 2 things to get some augmented reality out of it: 1) draw real scene instead of filling it with black color; 2) use proper world transformation matrix (scaling, rotation and transformation) for the virtual object to put onto a glyph. That's it - just 2 things.

For the augmented reality scene we need to render pictures of real world - video coming from camera, file or any other source and containing some optical glyphs to recognize. Without going into video acquisition/reading details, we can just assume that every new video frame is provided as .NET's Bitmap. Apparently XNA framework does not care too much about GDI+ bitmaps and does not provide means for rendering those. So we need a tool method, which allows converting Bitmap into XNA's 2D texture to render:

// Convert GDI+ bitmap to XNA texture
public static Texture2D XNATextureFromBitmap( Bitmap bitmap, GraphicsDevice device )
{
    int width  = bitmap.Width;
    int height = bitmap.Height;

    Texture2D texture = new Texture2D( device, width, height,
        1, TextureUsage.None, SurfaceFormat.Color );

    BitmapData data = bitmap.LockBits( new Rectangle( 0, 0, width, height ),
        ImageLockMode.ReadOnly, PixelFormat.Format32bppArgb );

    int bufferSize = data.Height * data.Stride;

    // copy bitmap data into texture
    byte[] bytes = new byte[bufferSize];
    Marshal.Copy( data.Scan0, bytes, 0, bytes.Length );
    texture.SetData( bytes );

    bitmap.UnlockBits( data );

    return texture;
}

Once a bitmap containing current video frame is converted to XNA's texture, it can be rendered before rendering 3D models, so those sit on top of some real world picture instead of black background. The only important thing to note is that after doing some 2D rendering it is required to restore some states of the XNA graphics device, which are shared between 2D and 3D graphics, but changed by texture rendering for its purposes.

// draw texture containing video frame
mainSpriteBatch.Begin( SpriteBlendMode.None );
mainSpriteBatch.Draw( texture, new Vector2( 0, 0 ), Color.White );
mainSpriteBatch.End( );

// restore state of some graphics device's properties after 2D graphics,
// so 3D rendering will work fine
GraphicsDevice.RenderState.DepthBufferEnable = true;
GraphicsDevice.RenderState.AlphaBlendEnable  = false;
GraphicsDevice.RenderState.AlphaTestEnable   = false;

GraphicsDevice.SamplerStates[0].AddressU = TextureAddressMode.Wrap;
GraphicsDevice.SamplerStates[0].AddressV = TextureAddressMode.Wrap;

The last and the most important part is to make sure that size, position and rotation of the rendered model correspond to the pose and position of a glyph existing in the real world. All this is not complex at this point, since it was already described in previous chapter. Now we just need to combine that all together.

Bringing optical glyph from real to virtual world

As it was mentioned above, Coplanar POSIT algorithm provides estimated rotation matrix and translation vector. Something like this:

// estimate pose of the object
Matrix3x3 rotationMatrix;
Vector3 translationVector;

coposit.EstimatePose( imagePoints, out rotationMatrix, out translationVector );

When we have glyph's rotation and translation known, we can update the XNA part to use this information in order to put 3D model into correct place and use proper rotation and size for it. Here is the part of the code (copied from initial XNA code sample) which calculates model's world matrix for XNA rendering - we will need to change this part only to complete augmented reality scene, since we already have all the rest:

...
Matrix world = Matrix.CreateScale( 1 / mesh.BoundingSphere.Radius ) *
    rotation * translation;
...

Someone potentially may think that converting AForge.NET framework's matrices/vectors to XNA's matrices should be enough to get everything working. However it is not. Although XNA uses column wise matrix representation, but AForge.NET framework uses row wise it is not the major difference to take care of. What we need to take care is the fact that XNA uses different coordinate system from the one used by pose estimation code. XNA uses right-handed coordinate system, where Z axis is directed from origin to viewer when X and Y axes are directed to right and up respectively. In such coordinates system increasing Z coordinate of an object makes it closer to viewer (camera), which makes it look bigger on projected screen. However in real world we have the opposite case - larger Z coordinate of an object means it is further away from viewer. This is known as left-handed coordinate system, when Z axis points away from viewer and X/Y axes have the same direction (right/up). So we need to convert glyph's estimated pose coordinates from left-handed to right-handed system.

The first part of conversion real world's coordinates to XNA's is to negate object's Z coordinate, so the further away an object in real world - the deeper it is in XNA scene. And the second part is to convert object's rotation angles - negate rotation around X and Y axes.

One more important thing - we need to scale XNA's 3D model. As we've seen above, we described glyph's model in millimeters. So pose estimation algorithm estimated glyph's translation also in millimeters. This will result in model's Z coordinate set to ~ -200, when a glyph is about 20 centimeters away from camera, which will make 3D model look tiny on XNA scene if model's original size is small. So all we need to do is just to scale 3D model, so it has "comparable" size to the glyph's size.

Putting all this together will replace the above mentioned line of code (which computes XNA object's world matrix) with the next code:

float modelSize = 113;

// extract rotation angles from the estimated rotation
float yaw, pitch, roll;
positRotation.ExtractYawPitchRoll( out yaw, out pitch, out roll );

// create XNA's rotation matrix
Matrix rotation = Matrix.CreateFromYawPitchRoll( -yaw, -pitch, roll );

// create XNA's translation matrix
Matrix translation = Matrix.CreateTranslation(
    positTranslation.X, positTranslation.Y, -positTranslation.Z );

// create scaling matrix, so model fits its glyph
Matrix scaling = Matrix.CreateScale( modelSize );

// finally compute XNA object's world matrix
Matrix world = Matrix.CreateScale( 1 / mesh.BoundingSphere.Radius ) *
    scaling * rotation * translation;

Well, that is it - augmented reality is done. With all the above code put together we should get an XNA screen like this:

Few things behind the scene

Although all the above is enough to get 3D augmented reality, there are few things which may be worth of mentioning. One thing is related to "noise" in glyph's corners detection. If you take a closer look at one of the videos shown above (glyph recognition and 2D augmented reality), you may notice that in some cases corners of some glyphs may do kind of shaking (moving one-two pixels) although the entire glyph is supposed to be static. This glyph shaking effect can be caused by different factors - noise in video stream, noise in illumination, artifacts of video compression, etc. All these factors lead to small errors in detection of glyphs' corners, which may vary by few pixels between consequent video frames.

This type of glyph's shaking is not an issue for those applications which require glyph detection/recognition only. But in augmented reality applications small errors like this may cause some unwanted visual effects which don't look nice. As it can be seen on the previous videos, the one-pixel change in glyph's coordinates already makes a shaking picture in 2D augmented reality. In 3D augmented reality this would be even worse, since a small change in few pixels will lead to a bit different estimation of 3D pose, which will make 3D model to shake even more.

To eliminate the above described noise in corners detection leading to AR model shaking, it is possible to implement glyphs' coordinates tracking. For example, if maximum change in all 8 coordinates of glyph's corners is 2 pixels or more, than the glyph is supposed to be moving. Otherwise, when maximum change is 1 pixel only, it is treated as noise and glyph's previous coordinates are used. One more check which can be done is to count number of corners, which changed its position by more than 1 pixel. If it is only one such corner, then it is also treated as noise. This rule is caused by the assumption that it is hardly possible to rotate a glyph in such way, that after perspective projection only one corner will change its position.

Another issue which may cause some 3D augmented realty artifacts is related to 3D pose estimation using Coplanar POSIT algorithm. As it is said in description of the algorithm, its math may come up with two valid estimations of 3D pose (valid from the math point of view). Of course both estimations are examined to find how well they are and error value is calculated for each estimation. However error values for both estimations may be quite small and potentially a wrong estimation may get lower error (again due to noise and imperfection in corners detection) on one of the video frames. This may produce bad looking effect in augmented reality, when most of the time a 3D model is displayed correctly, but from time to time its pose changes to something completely different.

The above mentioned 3D pose estimation errors also can be handled by tracking glyph's pose. For example, if best estimated pose has error value which is twice (or more) less than the error of alternate pose, then such pose is always believed to be correct. However, if difference in error values for both poses is small, then the tracking algorithm selects the pose, which seems to be closer to the glyph's pose detected on the previous video frame.

(Note: code samples for the above described tracking routines are skipped in the article and can be found in complete source code of the GRATF project)

The final result

And now it is time for the final video of 3D augmented reality with all the noise suppression and 3D pose corrections ...

Conclusion

It took me a while to complete the project from its very first stage, when a glyph recognition algorithm was prototyped, till the final result, which is the 3D augmented reality. But I must admit I enjoyed doing it and learned a lot, especially taking into account that most of it was done from scratch - just brainstorming about the algorithms, looking for bits of knowledge around the Internet, etc. Could it be done quicker? Sure. For me it was just a hobby project driven when time permits.

Although it was done a lot to get it working, there is still more to continue in order to improve it. For example, one of the crucial areas is glyph detection/recognition. At this point the algorithm may fail to detect a glyph if it moves too fast for current illumination conditions and camera's exposure time. In this case glyph's image gets blurred making it hard to do any recognition with it. Further improvements could be done in 3D pose estimation algorithms. And of course there is a lot can be done about tracking glyphs. For example, it could be possible to calculate glyph's movement/rotation velocity and acceleration along 3 axes, which could be used for making some nice 3D games and effects.

At this point all the work being done was published as an open source project. The GRATF project consists of the 2 main parts: 1) a glyph localization, recognition and pose estimation library and 2) a Glyph Recognition Studio application, which shows all in action including 2D/3D augmented reality. Since the core algorithms are put outside into a library, it makes them easy to integrate and use in another application, which requires either glyph recognition only or something more like augmented reality.

I really hope this article will find its readers and the project will find its users, so the work could be reused and extended to bring new cool applications. Or at least is could be somehow useful to all those, who start projects related to glyph recognition or just learn about computer vision.