Viola–Jones object detection framework

The Viola–Jones object detection framework is the first object detection framework to provide competitive object detection rates in real-time proposed in 2001 by Paul Viola and Michael Jones. Although it can be trained to detect a variety of object classes, it was motivated primarily by the problem of face detection.

Problem description

The problem to be solved is detection of faces in an image. A human can do this easily, but a computer needs precise instructions and constraints. To make the task more manageable, Viola–Jones requires full view frontal upright faces. Thus in order to be detected, the entire face must point towards the camera and should not be tilted to either side. While it seems these constraints could diminish the algorithm's utility somewhat, because the detection step is most often followed by a recognition step, in practice these limits on pose are quite acceptable.

Components of the framework

Feature types and evaluation

The characteristics of Viola–Jones algorithm which make it a good detection algorithm are:

Robust – very high detection rate & very low false-positive rate always.
Real time – For practical applications at least 2 frames per second must be processed.
Face detection only - The goal is to distinguish faces from non-faces.

The algorithm has four stages:

Haar Feature Selection
Creating an Integral Image
Adaboost Training
Cascading Classifiers

The features sought by the detection framework universally involve the sums of image pixels within rectangular areas. As such, they bear some resemblance to Haar basis functions, which have been used previously in the realm of image-based object detection. However, since the features used by Viola and Jones all rely on more than one rectangular area, they are generally more complex. The figure on the right illustrates the four different types of features used in the framework. The value of any given feature is the sum of the pixels within clear rectangles subtracted from the sum of the pixels within shaded rectangles. Rectangular features of this sort are primitive when compared to alternatives such as steerable filters. Although they are sensitive to vertical and horizontal features, their feedback is considerably coarser.

Haar Features

All human faces share some similar properties. These regularities may be matched using Haar Features.
A few properties common to human faces:

The eye region is darker than the upper-cheeks.
The nose bridge region is brighter than the eyes.

Composition of properties forming matchable facial features:

Location and size: eyes, mouth, bridge of nose
Value: oriented gradients of pixel intensities

The four features matched by this algorithm are then sought in the image of a face.
Rectangle features:

Value = Σ - Σ
Three types: two-, three-, four-rectangles, Viola & Jones used two-rectangle features
For example: the difference in brightness between the white & black rectangles over a specific area
Each feature is related to a special location in the sub-window
Summed area table

An image representation called the integral image evaluates rectangular features in constant time, which gives them a considerable speed advantage over more sophisticated alternative features. Because each feature's rectangular area is always adjacent to at least one other rectangle, it follows that any two-rectangle feature can be computed in six array references, any three-rectangle feature in eight, and any four-rectangle feature in nine.

Learning algorithm

The speed with which features may be evaluated does not adequately compensate for their number, however. For example, in a standard 24x24 pixel sub-window, there are a total of possible features, and it would be prohibitively expensive to evaluate them all when testing an image. Thus, the object detection framework employs a variant of the learning algorithm AdaBoost to both select the best features and to train classifiers that use them. This algorithm constructs a “strong” classifier as a linear combination of weighted simple “weak” classifiers.
Each weak classifier is a threshold function based on the feature.
The threshold value and the polarity are determined in the training, as well as the coefficients.
Here a simplified version of the learning algorithm is reported:
Input: Set of positive and negative training images with their labels. If image is a face, if not.

Initialization: assign a weight to each image.
For each feature with
# Renormalize the weights such that they sum to one.
# Apply the feature to each image in the training set, then find the optimal threshold and polarity that minimizes the weighted classification error. That is where
# Assign a weight to that is inversely proportional to the error rate. In this way best classifiers are considered more.
# The weights for the next iteration, i.e., are reduced for the images that were correctly classified.
Set the final classifier to
Cascade architecture

On average only 0.01% of all sub-windows are positive
Equal computation time is spent on all sub-windows
Must spend most time only on potentially positive sub-windows.
A simple 2-feature classifier can achieve almost 100% detection rate with 50% FP rate.
That classifier can act as a 1st layer of a series to filter out most negative windows
2nd layer with 10 features can tackle “harder” negative-windows which survived the 1st layer, and so on...
A cascade of gradually more complex classifiers achieves even better detection rates. The evaluation of the strong classifiers generated by the learning process can be done quickly, but it isn't fast enough to run in real-time. For this reason, the strong classifiers are arranged in a cascade in order of complexity, where each successive classifier is trained only on those selected samples which pass through the preceding classifiers. If at any stage in the cascade a classifier rejects the sub-window under inspection, no further processing is performed and continue on searching the next sub-window. The cascade therefore has the form of a degenerate tree. In the case of faces, the first classifier in the cascade – called the attentional operator – uses only two features to achieve a false negative rate of approximately 0% and a false positive rate of 40%. The effect of this single classifier is to reduce by roughly half the number of times the entire cascade is evaluated.

In cascading, each stage consists of a strong classifier. So all the features are grouped into several stages where each stage has certain number of features.
The job of each stage is to determine whether a given sub-window is definitely not a face or may be a face. A given sub-window is immediately discarded as not a face if it fails in any of the stages.
A simple framework for cascade training is given below:

f = the maximum acceptable false positive rate per layer.
d = the minimum acceptable detection rate per layer.
Ftarget = target overall false positive rate.
P = set of positive examples.
N = set of negative examples.

F = 1.0; D = 1.0; i = 0
while F > Ftarget
increase i
n = 0; F= F
while F > f × F
increase n
use P and N to train a classifier with n features using AdaBoost
Evaluate current cascaded classifier on validation set to determine F and D
decrease threshold for the ith classifier
until the current cascaded classifier has a detection rate of at least d × D
N = ∅
if F > Ftarget then
evaluate the current cascaded detector on the set of non-face images
and put any false detections into the set N.
The cascade architecture has interesting implications for the performance of the individual classifiers. Because the activation of each classifier depends entirely on the behavior of its predecessor, the false positive rate for an entire cascade is:
Similarly, the detection rate is:
Thus, to match the false positive rates typically achieved by other detectors, each classifier can get away with having surprisingly poor performance. For example, for a 32-stage cascade to achieve a false positive rate of, each classifier need only achieve a false positive rate of about 65%. At the same time, however, each classifier needs to be exceptionally capable if it is to achieve adequate detection rates. For example, to achieve a detection rate of about 90%, each classifier in the aforementioned cascade needs to achieve a detection rate of approximately 99.7%.

Using Viola–Jones for object tracking

In videos of moving objects, one need not apply object detection to each frame. Instead, one can use tracking algorithms like the KLT algorithm to detect salient features within the detection bounding boxes and track their movement between frames. Not only does this improve tracking speed by removing the need to re-detect objects in each frame, but it improves the robustness as well, as the salient features are more resilient than the Viola-Jones detection framework to rotation and photometric changes.

Implementations

by Ole Helvig Jensen
MATLAB: ,
OpenCV: implemented as cvHaarDetectObjects.
*
*

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...