Face detection: the Viola-Jones approach

Face detection is a pivotal task in computer vision, crucial for a myriad of applications ranging from surveillance to facial recognition systems. The Viola-Jones approach, introduced by Paul Viola and Michael Jones in 2001, revolutionized face detection algorithms with its speed and accuracy. By employing a cascade of simple classifiers trained on Haar-like features, the Viola-Jones method can efficiently detect faces in real-time, making it widely adopted in various domains of image processing and computer vision.

Detecting faces

To detect such faces we cannot just scan through the whole picture and check the features, as it would be very computationally expensive and would quickly become unsuitable for real time tasks. We can approach the face detection problem in a different way: considering the limited amount of faces that are usually found in an image, the viola-jones approach processes non-face candidates very fast, yielding a very low false positive rate.

The starting point of this algorithm are the Haar-like feature, which considers adjacent rectangular regions at a specific location in a detection window, sums up the pixel intensities in each region and calculates the difference between these sums. This difference is then used to categorize subsections of an image. For example, with a human face, it is a common observation that among all faces the region of the eyes is darker than the region of the cheeks. Therefore, a common Haar feature for face detection is a set of two adjacent rectangles that lie above the eye and the cheek region. The position of these rectangles is defined relative to a detection window that acts like a bounding box to the target object (the face in this case).

Haar features
Haar features

For the viola-jones algorithm such patches have 24\times 24 pixels size, and instead of evaluating the whole image searching for them, the weak learners approach comes into play. A weak learner in the Viola-Jones approach works like this: the weak learner’s job is to decide whether a given feature indicates a face or not. It does this by setting a threshold and seeing if the area you’re looking at meets that threshold or not. It’s not super accurate, but it’s a bit better than just guessing randomly. So, it can somewhat tell whether there’s a face in that area or not. Specifically, from a mathematical standpoint, it sets to 1 the points that meet the threshold and -1 to the ones that don’t.

Setting a threashold

The next step in the algorithm is boosting, which combines several weak learners in a weighted sum.

h({\bf x}) = \text{sign}\left[\sum_{j=0}^{m-1} a_j h_j({\bf x})\right]

The weights are chosen depending on the classifier by using AdaBoost, which starts off assigning equal weights to all features. For each subsequent round of boosting it:

  1. Evaluates each rectangle filter on each example
  2. Selects the best threshold for each filter
  3. Selects the best filter-threshold combination
  4. Reweights the examples
Reweighting samples

At each round, the best classifier is found with computational complexity

\begin{aligned}
&O(MNK)\\
&M=\text{rounds}\\
&N=\text{examples}\\
&K=\text{features}
\end{aligned}

The last step of the algorithm is the application of multiple classifiers at different stages. In each stage of the sequence the weak learners eliminate more and more non-faces, narrowing down the possibilities until only true faces remain.

Each stage acts basically as filter to discard non-faces as quickly as possible.

Discarding at an early stage

Such classifiers become progressively more complex (and computationally expensive), but at the same time have less and less features to check.

The structure of the viola-jones approach contains 32 stages and 4297 features.

Discarding at an early stage

Specifically, each stage focuses on a different number of features.

Reject window

The original dataset that was used by Viola and Jones contained five thousand faces rescaled to 24\times 24 pixels and ten thousand non-faces. Different pictures contain different individuals, illumination and poses.

An example run of the algorithm is shown below:

Detecting face on lena

But the algorithm can also yield false positives of negatives.

Detecting faces in pictures
Did you find this article interesting?
YesNo
Scroll to Top