Efficient Face Detection Algorithm using Viola Jones method

Ali Tarhini

4.82/5 (15 votes)

Jun 3, 2010

CPOL

32 min read

146328

Efficient Face Detection Algorithm using Viola Jones method Introduction Object detection is detecting a specified object class such as cars, faces, plates ext.

Efficient Face Detection Algorithm using Viola Jones method

Article is outdated. View updated one here

Introduction

Object detection is detecting a specified object class such as cars, faces, plates ext. in a given image or a video sequence. Object detection has many applications in computer based vision such as object tracking, object recognition, and scene surveillance.

Face detection has been regarded as the most complex and challenging problem in the field of computer vision, due to the large intra-class variations caused by the changes in facial appearance, lighting, and expression. Such variations result in the face distribution to be highly nonlinear and complex in any space which is linear to the original image space. Moreover, in the applications of real life surveillance and biometric, the camera limitations and pose variations make the distribution of human faces in feature space more dispersed and complicated than that of frontal faces. It further complicates the problem of robust face detection.

Face detection techniques have been researched for years and much progress has been proposed in literature. Most of the face detection methods focus on detecting frontal faces with good lighting conditions. According to Yang’s survey, these methods can be categorized into four types: knowledge-based, feature invariant, template matching and appearance-based.

Knowledge-based methods use human-coded rules to model facial features, such as two symmetric eyes, a nose in the middle and a mouth underneath the nose.
Feature invariant methods try to find facial features which are invariant to pose, lighting condition or rotation. Skin colors, edges and shapes fall into this category.
Template matching methods calculate the correlation between a test image and pre-selected facial templates.
Appearance-based, adopts machine learning techniques to extract discriminative features from a pre-labeled training set. The Eigenface method is the most fundamental method in this category. Recently proposed face detection algorithms such as support vector machines, neural networks (Rowley-Balaujam, 1998), statistical classifiers (Schneiderman-Kanade, 2000) and AdaBoost-based face detection also belong to this class.

Any of the methods can involve color segmentation, pattern matching, statistical analysis and complex transforms, where the common goal is classification with least amount of error. Bounds on the classification accuracy change from method to method yet the best techniques are found in areas where the models or rules for classification are dynamic and produced from machine learning processes.

Paul Viola and Michael Jones presented a fast and robust method for face detection which is 15 times quicker than any technique at the time of release with 95% accuracy at around 17 fps. Viola Jones Faces

The technique relies on the use of simple Haar-like features that are evaluated quickly through the use of a new image representation. Based on the concept of an “Integral Image” it generates a large set of features and uses the boosting algorithm AdaBoost to reduce the over-complete set and the introduction of a degenerative tree of the boosted classifiers provides for robust and fast interferences. The detector is applied in a scanning fashion and used on gray-scale images, the scanned window that is applied can also be scaled, as well as the features evaluated.

In the technique only simple rectangular (Haar-like) features are used, reminiscent to Haar basis functions. These features are equivalent to intensity difference readings and are quite easy to compute. There are three feature types used with varying numbers of sub-rectangles, two, two rectangles, one three and one four rectangle feature types. Using rectangular features instead of the pixels in an image provides a number of benefits, namely a sort of a ad-hoc domain knowledge is implied as well as a speed increase over pixel based systems. The calculation of the features is facilitated with the use of an “integral image”. With the introduction of a integral image Viola and Jones are able to calculate in one pass of the sample image, and is one of the keys to the speed of the system. An integral image is similar to a “summed are table”, used in computer graphics but its use is applied in pixel area evaluation.

It was outlined that the implementation of a system that used such features would provide a feature set that was far too large, hence the feature set must be only restricted to a small number of critical features. This is done with the use of boosting algorithm, AdaBoost. Interference is enhanced with the use of AdaBoost where a small set of features is selected from a large set, and in doing so a strong hypothesis is formed, in this case resulting in a strong classifier. Simply having a reduced set of features was not enough to reduce the vast amounts of computation in a detector task, since it is naturally a probabilistic one, hence Viola and Jones proposed the use of degenerative tree of classifiers.

Described by Viola and Jones as a degenerative tree, and sometimes referred to as a decision stump, its use also speeds the detection process. A degenerative tree is the daisy chaining of general to specific classifiers, whereby the first few classifiers are general enough to discount an image sub window and so on the time of further observations by the more specific classifiers down the chain, this can save a large degree of computation.

Features

In Viola-Jones system a simple feature is used, with relation to the feature sets.

Viola and Jones make note that the fact the choice of features instead of a statistical pixel based system is important due to the benefit of ad-hoc domain encoding. In the case of face detection this is particularly important. Features can be used to represent both the statistically close facial information and sparsely related background data in a sample image.

In its simplest form the features can be thought of as pixel intensity set evaluations. This is where the sum of the luminance of the pixels in the while region of the feature is subtracted from the sum of the luminance in the remaining gray section. This difference value is used as the feature value, and can be combined to form a weak hypothesis on regions of the image. Within the implementation four of the Haar-like features are chosen, the first with horizontal division, the second a vertical, the third containing two vertical divisions and the last containing both the horizontal and vertical division. The features are called Haar-like because of their resemblance to Haar-basis functions.

Integral Image

In order to be successful a face detection algorithm must possess two key features, accuracy and speed. There is generally a trade-off between the two. Through the use of a new image representation, termed integral images, Viola and Jones describe a means for fast feature evaluation, and this proves to be an effective means to speed up the classification task of the system.

Integral images are easy to understand, they are constructed by simply taking the sum of the luminance values above and to the left of a pixel in an image. Viola and Jones make note of the fact that the integral image is effectively the double integral of the sample image, first along the rows then along the columns. Integral images are equivalent to summed-area tables, yet their use is not texture mapping, being so, their implementation us quite well documented.

1	1	1
1	1	1
1	1	1

1	2	3
2	4	6
3	6	9

The brilliance in using an integral image to speed up a feature extraction lies in the fact that any rectangle in an image can be calculated from that images integral image, in only four indexes to the integral image. This makes the otherwise exhaustive process of summing luminances quite rapid. In fact the calculation of an images integral image can be calculated in only one pass of the image, and Matlab experiments have shown that a large set of images (12000) can be calculated within less than 2 seconds.

Integral application

Given a rectangle specified as four coordinates A(x1,y1) upper left and D(x4,y4) lower right, evaluating the area of the rectangle is done in four image references to the integral image, this represents a huge performance increase in terms of feature extraction.

Sum of grey rectangle = D - (B + C) + A

Since both rectangle B and C include rectangle A the sum of A has to be added to the calculation.

Adaboost

The AdaBoost algorithm was introduced in 1995 by Freund and Schapire. AdaBoost solved many of the practical problems that existed in the previous boosting Algorithms. The algorithm takes a feature set and a training set of positive and negative images, any number of machine learning approaches could be used to learn a classification function. Sung and Poggio use a mixture of Gaussian model. Rowley, Baluja, and Kanade use a small set of simple image features and a neural network.

Osuna, et al. used a support vector machine. More recently Roth et al. have proposed a new and unusual image representation and have used the Winnow learning procedure.

Recall that there are 45,396 rectangle features associated with each image sub-window, a number far larger than the number of pixels. Even though each feature can be computed very efficiently, computing the complete set is prohibitively expensive. The hypothesis, which is borne out by experiment, is that a very small number of these features can be combined to form an effective classifier. The main challenge is to find these features.

In this system a variant of AdaBoost is used both to select the features and to train the classifier.

In its original form, the AdaBoost learning algorithm is used to boost the classification performance of a simple learning algorithm (e.g., it might be used to boost the performance of a simple perceptron). It does this by combining a collection of weak classification functions to form a stronger classifier. In the language of boosting the simple learning algorithm is called a weak learner. So, for example the perceptron learning algorithm searches over the set of possible perceptrons and returns the perceptron with the lowest classification error. The learner is called weak because we do not expect even the best classification function to classify the training data well (i.e. for a given problem the best perceptron may only classify the training data correctly 51% of the time). In order for the weak learner to be boosted, it is called upon to solve a sequence of learning problems. After the first round of learning, the examples are re-weighted in order to emphasize those which were incorrectly classified by the previous weak classifier. The final strong classifier takes the form of a perceptron, a weighted combination of weak classifiers followed by a threshold.

The formal guarantees provided by the AdaBoost learning procedure are quite strong. Freund and Schapire proved that the training error of the strong classifier approaches zero exponentially in the number of rounds. More importantly a number of results were later proved about generalization performance. The key insight is that generalization performance is related to the margin of the examples, and that AdaBoost achieves large margins rapidly.

The conventional AdaBoost procedure can be easily interpreted as a greedy feature selection process.

Consider the general problem of boosting, in which a large set of classification functions are combined using a weighted majority vote. The challenge is to associate a large weight with each good classification function and a smaller weight with poor functions. AdaBoost is an aggressive mechanism for selecting a small set of good classification functions which nevertheless have significant variety. Drawing an analogy between weak classifiers and features, AdaBoost is an effective procedure for searching out a small number of good “features” which nevertheless have significant variety.

Weak Classifier

The core inference provided in the system is brought through the use of “weak classifiers” where a weak classifier hj is a simple structure containing a feature vector fj, threshold ?j and parity pj. Where the output of the classifier is dependent on whether or not the feature value if less than the threshold, hj result is binary. A weak classifier set is essentially produced when the complete feature set has been derived. From the feature set it is possible to evaluate the individual classifier thresholds (across a large image dataset) and also define the parity of each classifier.
Note: Here “x” is a 24x24 pixel sub-window of an image.

Thresholding

A specific method for classifier thresholding is not defined explicitly by Viola and Jones; one can only assume they would favor the less error prediction of the threshold. The implementation described in this book use the simple mean average, and yielded adequate results in doing so, similar to published results.

Boosting

In practice no single feature can perform the classification task with low error. Features which are selected early in the process yield error rates between 0.1 and 0.3. Features selected in later rounds, as the task becomes more difficult, yield error rates between 0.4 and 0.5. The table below shows the learning algorithm.

The boosting algorithm for learning a query online. T hypotheses are constructed each using a single feature. The final hypothesis is a weighted linear combination of the T hypotheses where the weights are inversely proportional to the training errors.

Learning Discussion

To understand boosting we will use the horse-racing gambler example. A horse-racing gambler, hoping to maximize his winnings, decides to create a computer program that will accurately predict the winner of a horse race based on the usual information (number of races recently won by each horse, betting odds for each horse, etc.). To create such a program, he asks a highly successful expert gambler to explain his betting strategy. Not surprisingly, the expert is unable to articulate a grand set of rules for selecting a horse. On the other hand, when presented with the data for a specific set of races, the expert has no trouble coming up with a ”rule of thumb” for that set of races (such as, ”Bet on the horse that has recently won the most races” or ”Bet on the horse with the most favored odds”). Although such a rule of thumb, by itself, is obviously very rough and inaccurate, it is not unreasonable to expect it to provide predictions that are at least a little bit better than random guessing. Furthermore, by repeatedly asking the expert’s opinion on different collections of races, the gambler is able to extract many rules of thumb.

In order to use these rules of thumb to maximum advantage, there are two problems faced by the gambler: first, how should he choose the collections of races presented to the expert so as to extract rules of thumb from the expert that will be the most useful? Second, once he has collected many rules of thumb, how can they be combined into a single, highly accurate prediction rule?

Boosting refers to a general and provably effective method of producing a very accurate prediction rule by combining rough and moderately inaccurate rules of thumb in a manner similar to that suggested above.

Boosting algorithm framework

Many general feature selection procedures have been proposed. Our final application demanded a very aggressive process which would discard the vast majority of features. For a similar recognition problem Papageorgiou et al. proposed a scheme for feature selection based on feature variance. They demonstrated good results selecting 37 features out of a total 1734 features. While this is a significant reduction, the number of features evaluated for every image sub-window is still reasonably large.

Roth et al. propose a feature selection process based on the Winnow exponential perceptron learning rule. These authors use a very large and unusual feature set, where each pixel is mapped into a binary vector of d dimensions (when a particular pixel takes on the value x, in the range [0, d-1], the xth dimension is set to 1 and the other dimensions to 0). The binary vectors for each pixel are concatenated to form a single binary vector with “nd” dimensions (n is the number of pixels). The classification rule is a perceptron, which assigns one weight to each dimension of the input vector. The Winnow learning process converges to a solution where many of these weights are zero. Nevertheless a very large number of features are retained (perhaps a few hundred or thousand).

Learning Results

Initial experiments demonstrated that a classifier constructed from 200 features would yield reasonable results. Given detection rate of 95% the classifier yielded a false positive rate of 1 in 14084 on a testing dataset.

For the task of face detection, the initial rectangle features selected by AdaBoost are meaningful and easily interpreted. The first feature selected seems to focus on the property that the region of the eyes is often darker than the region of the nose and cheeks. This feature is relatively large in comparison with the detection sub-window, and should be somewhat insensitive to size and location of the face. The second feature selected relies on the property that the eyes are darker than the bridge of the nose.

In summary the 200-feature classifier provides initial evidence that a boosted classifier constructed from rectangle features is an effective technique for object detection. In terms of detection, these results are compelling but not sufficient for many real-world tasks. In terms of computation, this classifier is probably faster than any other published system, requiring 0.7 seconds to scan a 384 by 288 pixel image. Unfortunately, the most straightforward technique for improving detection performance, adding features to the classifier, directly increases computation time.

The Attentional Cascade

This section describes an algorithm for constructing a cascade of classifiers which achieves increased detection performance while radically reducing computation time. The key insight is that smaller, and therefore more efficient, boosted classifiers can be constructed which reject many of the negative sub-windows while detecting almost all positive instances. Simpler classifiers are used to reject the majority of sub-windows before more complex classifiers are called upon to achieve low false positive rates.

Stages in the cascade are constructed by training classifiers using AdaBoost. Starting with a two-feature strong classifier, an effective face filter can be obtained by adjusting the strong classifier threshold to minimize false negatives. The initial AdaBoost threshold, , is designed to yield a low error rate on the training data. A lower threshold yields higher detection rates and higher false positive rates. Based on performance measured using a validation training set, the two-feature classifier can be adjusted to detect 100% of the faces with a false positive rate of 40%.

The detection performance of the two-feature classifier is far from acceptable as an object detection system. Nevertheless the classifier can significantly reduce the number sub-windows that need further processing with very few operations:

1. Evaluate the rectangle features (requires between 6 and 9 array references per feature).
2. Compute the weak classifier for each feature (requires one threshold operation per feature).
3. Combine the weak classifiers (requires one multiply per feature, an addition, and finally a threshold).

A two feature classifier amounts to about 60 microprocessor instructions. It seems hard to imagine that any simpler filter could achieve higher rejection rates. By comparison, scanning a simple image template, or a single layer perceptron, would require at least 20 times as many operations per sub-window.

The overall form of the detection process is that of a degenerate decision tree, what we call a “cascade”. A positive result from the first classifier triggers the evaluation of a second classifier which has also been adjusted to achieve very high detection rates. A positive result from the second classifier triggers a third classifier, and so on. A negative outcome at any point leads to the immediate rejection of the sub-window.

Schematic depiction of a the detection cascade. A series of classifiers are applied to every subwindow.

The initial classifier eliminates a large number of negative examples with very little processing.

Subsequent layers eliminate additional negatives but require additional computation. After several stages of processing the number of sub-windows have been reduced radically. Further processing can take any form such as additional stages of the cascade (as in our detection system) or an alternative detection system.

The structure of the cascade reflects the fact that within any single image an overwhelming majority of sub-windows are negative. As such, the cascade attempts to reject as many negatives as possible at the earliest stage possible. While a positive instance will trigger the evaluation of every classifier in the cascade, this is an exceedingly rare event.

Much like a decision tree, subsequent classifiers are trained using those examples which pass through all the previous stages. As a result, the second classifier faces a more difficult task than the first. The examples which make it through the first stage are “harder” than typical examples. The more difficult examples faced by deeper classifiers push the entire receiver operating characteristic (ROC) curve downward. At a given detection rate, deeper classifiers have correspondingly higher false positive rates.

Training a Cascade of Classifiers

The cascade design process is driven from a set of detection and performance goals. For the face detection task, past systems have achieved good detection rates (between 85 and 95 percent) and extremely low false positive rates (on the order of 10-5 or 10-6). The number of cascade stages and the size of each stage must be sufficient to achieve similar detection performance while minimizing computation.

Given a trained cascade of classifiers, the false positive rate of the cascade is

Where F is the false positive rate of the cascaded classifier, K is the number of classifiers, and fi is the false positive rate of the ith classifier on the examples that get through to it. The detection rate is

Where D is the detection rate of the cascaded classifier, K is the number of classifiers, and di is the detection rate of the ith classifier on the examples that get through to it.

Given concrete goals for overall false positive and detection rates, target rates can be determined for each stage in the cascade process. For example a detection rate of 0.9 can be achieved by a 10 stage classifier if each stage has a detection rate of 0.99 (since 0.9 ˜ 0.99¹⁰). While achieving this detection rate may sound like a daunting task, it is made significantly easier by the fact that each stage need only achieve a false positive rate of about 30% (0.3 ˜ 6* 10^-6).

The number of features evaluated when scanning real images is necessarily a probabilistic process. Any given sub-window will progress down through the cascade, one classifier at a time, until it is decided that the window is negative or, in rare circumstances, the window succeeds in each test and is labeled positive.

The expected behavior of this process is determined by the distribution of image windows in a typical test set. The key measure of each classifier is its “positive rate”, the proportion of windows which are labeled as potentially containing the object of interest. The expected number of features which are evaluated is:

Where N is the expected number of features evaluated, K is the number of classifiers, pi is the positive rate of the ith classifier, and ni are the number of features in the ith classifier. Interestingly, since objects are extremely rare the “positive rate” is effectively equal to the false positive rate.

The process by which each element of the cascade is trained requires some care. The AdaBoost learning procedure presented in Section 3 attempts only to minimize errors, and is not specifically designed to achieve high detection rates at the expense of large false positive rates. One simple, and very conventional, scheme for trading off these errors is to adjust the threshold of the perceptron produced by AdaBoost. Higher thresholds yield classifiers with fewer false positives and a lower detection rate. Lower thresholds yield classifiers with more false positives and a higher detection rate. It is not clear, at this point, whether adjusting the threshold in this way preserves the training and generalization guarantees provided by AdaBoost.

The overall training process involves two types of tradeoffs. In most cases classifiers with more features will achieve higher detection rates and lower false positive rates. At the same time classifiers with more features require more time to compute. In principle one could define an optimization framework in which

the number of classifier stages
the number of features, ni, of each stage
threshold of each stage

are traded off in order to minimize the expected number of features N given a target for F and D. Unfortunately finding this optimum is a tremendously difficult problem.

In practice a very simple framework is used to produce an effective classifier which is highly efficient. The user selects the minimum acceptable rates for fi and di. Each layer of the cascade is trained by AdaBoost with the number of features used being increased until the target detection and false positives rates are met for this level. The rates are determined by testing the current detector on a validation set. If the overall target false positive rate is not yet met then another layer is added to the cascade. The negative set for training subsequent layers is obtained by collecting all false detections found by running the current detector on a set of images which do not contain any instances of the object.

Simple Experiment

In order to explore the feasibility of the cascade approach two simple detectors were trained: a monolithic 200-feature classifier and a cascade of ten 20-feature classifiers. The first stage classifier in the cascade was trained using 5000 faces and 10000 non-face sub-windows randomly chosen from non-face images. The second stage classifier was trained on the same 5000 faces plus 5000 false positives of the first classifier. This process continued so that subsequent stages were trained using the false positives of the previous stage. The monolithic 200-feature classifier was trained on the union of all examples used to train all the stages of the cascaded classifier. Note that without reference to the cascaded classifier, it might be difficult to select a set of non-face training examples to train the monolithic classifier. We could of course use all possible sub-windows from all of our non-face images, but this would make the training time impractically long.

The sequential way in which the cascaded classifier is trained effectively reduces the non-face training set by throwing out easy examples and focusing on the “hard” ones.

Detector Cascade Discussion

A notion similar to the cascade appears in the face detection system described by Rowley et al. trained two neural networks. One network was moderately complex, focused on a small region of the image, and detected faces with a low false positive rate. They also trained a second neural network which was much faster, focused on larger regions of the image, and detected faces with a higher false positive rate. Rowley et al. used the faster second network to prescreen the image in order to find candidate regions for the slower more accurate network. Though it is difficult to determine exactly, it appears that Rowley et al.’s two network face system is the fastest existing face detector. Our system uses a similar approach, but it extends this two stage cascade to include 32 stages.

The structure of the cascaded detection process is essentially that of a degenerate decision tree, and as such is related to the work of Amit and Geman. Unlike techniques which use a fixed detector, Amit and Geman propose an alternative point of view where unusual co-occurrences of simple image features are used to trigger the evaluation of a more complex detection process. In this way the full detection process need not be evaluated at many of the potential image locations and scales. While this basic insight is very valuable, in their implementation it is necessary to first evaluate some feature detector at every location. These features are then grouped to find unusual co-occurrences. In practice, since the form of our detector and the features that it uses are extremely efficient, the amortized cost of evaluating our detector at every scale and location is much faster than finding and grouping edges throughout the image.

In recent work Fleuret and Geman have presented a face detection technique which relies on a “chain” of tests in order to signify the presence of a face at a particular scale and location. The image properties measured by Fleuret and Geman, disjunctions of fine scale edges, are quite different than rectangle features which are simple, exist at all scales, and are somewhat interpretable. The two approaches also differ radically in their learning philosophy. The motivation for Fleuret and Geman’s learning process is density estimation and density discrimination, while our detector is purely discriminative. Finally the false positive rate of Fleuret and Geman’s approach appears to be higher than that of previous approaches like Rowley et al. and this approach. Unfortunately the paper does not report quantitative results of this kind. The included example images each have between 2 and 10 false positives.

Training Dataset

The face training set consisted of 4916 hand labeled faces scaled and aligned to a base resolution of 24 by 24 pixels. The faces were extracted from images downloaded during a random crawl of the World Wide Web. Some typical face examples are shown in Figure 8. Notice that these examples contain more of the head than the examples used by Rowley or et al. or Sung. Initial experiments also used 16 by 16 pixel training images in which the faces were more tightly cropped, but got slightly worse results. Presumably the 24 by 24 examples include extra visual information such as the contours of the chin and cheeks and the hair line which help to improve accuracy. Because of the nature of the features used, the larger sized sub windows do not slow performance. In fact, the additional information contained in the larger sub-windows could be used to reject non-faces earlier in the detection cascade.

Structure of the Detector Cascade

The final detector is a 32 layer cascade of classifiers which included a total of 4297 features.

The first classifier in the cascade is constructed using two features and rejects about 60% of non-faces while correctly detecting close to 100% of faces. The next classifier has five features and rejects 80% of non-faces while detecting almost 100% of faces. The next three layers are 20-feature classifiers followed by two 50-feature classifiers followed by five 100-feature classifiers and then twenty 200-feature classifiers.

The particular choices of number of features per layer were driven through a trial and error process in which the number of features was increased until a significant reduction in the false positive rate could be achieved. More levels were added until the false positive rate on the validation set was nearly zero while still maintaining a high correct detection rate. The final number of layers, and the size of each layer, is not critical to the final system performance.

The two, five and first twenty-feature classifiers were trained with the 4916 faces and 10,000 non-face sub-windows (also of size 24 by 24 pixels) using the Adaboost training procedure. The non-face sub-windows were collected by selecting random sub-windows from a set of 9500 images which did not contain faces. Different sets of non-face sub-windows were used for training the different classifiers to ensure that they were somewhat independent and didn’t use the same features.

The non-face examples used to train subsequent layers were obtained by scanning the partial cascade across large non-face images and collecting false positives. A maximum of 6000 such non-face sub-windows were collected for each layer. There are approximately 350 million non-face sub-windows contained in the 9500 non-face images.

Training time for the entire 32 layer detector was on the order of weeks on a single 466 MHz Alpha Station XP900. During this laborious training process several improvements to the learning algorithm were discovered. These improvements, which will be described elsewhere, yield a 100 fold decrease in training time.

Speed of the Final Detector

The speed of the cascaded detector is directly related to the number of features evaluated per scanned sub window. The number of features evaluated depends on the images being scanned. Evaluated on the MIT+CMU test set, an average of 8 features out of a total of 4297 is evaluated per sub-window. This is possible because a large majority of sub-windows are rejected by the first or second layer in the cascade.

Scanning the Detector

The final detector is scanned across the image at multiple scales and locations. Scaling is achieved by scaling the detector itself, rather than scaling the image. This process makes sense because the features can be evaluated at any scale with the same cost. Good results were obtained using a set of scales a factor of 1.25 apart.

The detector is also scanned across location. Subsequent locations are obtained by shifting the window some number of pixels ?. This shifting process is affected by the scale of the detector: if the current scale is “s” the window is shifted by [s?], where [] is the rounding operation.

The choice of ? affects both the speed of the detector as well as accuracy. Since the training images have some translational variability the learned detector achieves good detection performance in spite of small shifts in the image. As a result the detector sub-window can be shifted more than one pixel at a time. However, a step size of more than one pixel tends to decrease the detection rate slightly while also decreasing the number of false positives. We present results for two different step sizes.

Integration of Multiple Detections

Since the final detector is insensitive to small changes in translation and scale, multiple detections will usually occur around each face in a scanned image. The same is often true of some types of false positives. In practice it often makes sense to return one final detection per face. Toward this end it is useful to postprocess the detected sub-windows in order to combine overlapping detections into a single detection. In these experiments detections are combined in a very simple fashion. The set of detections are first partitioned into disjoint subsets. Two detections are in the same subset if their bounding regions overlap.

Each partition yields a single final detection. The corners of the final bounding region are the average of the corners of all detections in the set.

In some cases this postprocessing decreases the number of false positives since an overlapping subset of false positives is reduced to a single detection.

Conclusion

Paul Viola and Michael Jones presented an approach for object detection which minimizes computation time while achieving high detection accuracy. The approach was used to construct a face detection system which is approximately 15 faster than any previous approach. Preliminary experiments, which will be described elsewhere, show that highly efficient detectors for other objects, such as pedestrians, can also be constructed in this way.

New algorithms, representations, and insights where presented which are quite generic and may have broader application in computer vision and image processing.

The first contribution is a new a technique for computing a rich set of image features using the integral image. In order to achieve true scale invariance, almost all object detection systems must operate on multiple image scales. The integral image, by eliminating the need to compute a multi-scale image pyramid, reduces the initial image processing required for object detection significantly. In the domain of face detection the advantage is quite dramatic. Using the integral image, face detection is completed before an image pyramid can be computed.

While the integral image should also have immediate use for other systems which have used Harr-like features, it can foreseeably have impact on any task where Harr-like features may be of value. Initial experiments have shown that a similar feature set is also effective for the task of parameter estimation, where the expression of a face, the position of a head, or the pose of an object is determined.

The second contribution is a technique for feature selection based on AdaBoost. An aggressive and effective technique for feature selection will have impact on a wide variety of learning tasks. Given an effective tool for feature selection, the system designer is free to define a very large and very complex set of features as input for the learning process. The resulting classifier is nevertheless computationally efficient, since only a small number of features need to be evaluated during run time. Frequently the resulting classifier is also quite simple; within a large set of complex features it is more likely that a few critical features can be found which capture the structure of the classification problem in a straightforward fashion.

The third contribution is a technique for constructing a cascade of classifiers which radically reduce computation time while improving detection accuracy. Early stages of the cascade are designed to reject a majority of the image in order to focus subsequent processing on promising regions. One key point is that the cascade presented is quite simple and homogeneous in structure. Previous approaches for attentive filtering, propose a more complex and heterogeneous mechanism for filtering. Similarly a hierarchical structure for detection is proposed in which the stages are quite different in structure and processing. A homogenous system, besides being easy to implement and understand, has the advantage that simple tradeoffs can be made between processing time and detection performance.

Face detection is the essential first step towards many advanced computer vision, biometrics recognition and multimedia applications, such as face tracking, face recognition, and video surveillance.

An extremely fast face detector will have broad practical applications. These include user interfaces, image databases, and teleconferencing. This increase in speed will enable real-time face detection applications on systems where they were previously infeasible. In applications where rapid frame-rates are not necessary, our system will allow for significant additional post-processing and analysis. In addition our system can be implemented on a wide range of small low power devices, including hand-held’s and embedded processors.