Adaboost in Image Classification

An Introduction to Computer Vision, Part 5

Published in

Level Up Coding

10 min readMar 1, 2021

Last time, we talked about the intuition behind training an image classifier. We discussed its similarity to how infants learn to recognize objects and associate names/labels with them as they develop their ability to speak. We then applied that intuition to better understand how the Viola-Jones algorithm was trained. One thing that we haven’t talked about, however, was the math involved during this particular stage. Let’s take a brief look at that! If you’re scared, don’t worry too much. We’re simply taking what we’ve already discussed and seeing how the algorithm creates a trained model using only numbers!

Let’s take a look at the equation!

This is actually our classifier! Time to break this down so that each of the terms actually mean something to us. F(x) represents the classifier itself. It is also referred to as the “strong classifier” because it is the sum of many “weak classifiers”. Each term αₙ fₙ(x) is a weak classifier. They’re considered “weak” because, on their own, they aren’t enough to classify an image, but together they’re strong.

The x represents the input. In the case of our object-detection, it is the current image that the algorithm is trying to classify.

The fₙ represents a specific Haar-like feature. Now, let’s say f₂ represented an Edge-feature. It doesn’t represent all edge-features in the image. Depending on the resolution of an image, there could be hundreds of thousands of edge-features of different sizes! To treat them all the same would lead to a lot of wasted time and an inaccurate model. Thus, f₂ must represents a specific edge-feature within the image.

Alpha, written as αₙ, represents the weight of the associated feature. As we discussed in the last article, the “weight” of a feature is just another way of saying how important it is. If the weight of α₅₉ is zero (α₅₉ = 0), then it means that f₅₉ has no importance when classifying the target image. An example of this would be the background in a face-detection algorithm. Most (if not all) of the features that are detected in the background would have a weight of zero since backgrounds will change from picture to picture. This is a key aspect that we’ll elaborate on when we discuss Adaboost.

The subscript n simply denotes which specific feature we’re referring to and its associated weight. It’s only there for organizational purposes. Since I decided to start with α₀ f₀(x) as the first term, all of the features are Zero-Indexed (the first item is labeled “0”, the second as “1”, and so forth). For people unaccustomed to zero-indexing, just add 1 to the number in your head and you’ll know that f₅₉ is the 60th feature that was found and stored by the algorithm.

Now that we understand the equation, let’s talk about Adaboost; What it is, what it does, and why it’s so important!

Adaboost

Adaptive boosting (or Adaboost) is another optimization that was utilized in the creation of the Viola-Jones algorithm. Remember when we learned about the Integral Image, we spoke about the Big O. We learned that the Big O is a computational problem that states that as a task increases in complexity, the time taken and resources used to perform it increase exponentially. Since there are over 180,000 features in a single 24x24 pixel image, our equation would look something like this:

F(x) = α₀ f₀(x) + α₁ f₁(x) + α₂ f₂(x) + …+α₁₇₉,₉₉₉f₁₇₉,₉₉₉(x)+ …

Like we mentioned last time, if it took 1ms to perform the computations necessary for a single feature (again, this is EXTREMELY slow), then the time to compute ALL of the features in a 24x24 pixel image would skyrocket to over 6 minutes. To address this, the Integral Image is created so that potential Haar-like features can be evaluated in a fraction of the time.

Adaboost performs a similar role for our equation. It reduces complexity of image classification even more by finding the most important features of the target object and, essentially, getting rid of the rest.

Now, how does it “get rid of” unimportant features? Well, remember earlier that any feature where αₙ = 0 is considered unimportant. Then take into account this basic multiplication rule: n x 0 = 0. Therefore, any term (αₙ fₙ(x)) in our equation where αₙ = 0 calculates to 0 (e.g. 0 fₙ(x) = 0) and is dropped!

So Adaboost’s goal is to update the weights of specific features to maximize the likelihood of classifying the target object correctly while also eliminating Haar-like features that don’t aid in the detection process. Features that give the classifier a higher number of true-positives (target image classified as “target”) and true-negatives (non-target classified as “non-target”) receive larger weights, as they’re more likely to be unique to the target object. On the other hand, features that lead to a high number of false-positives (non-targets classified as “target”) and false-negatives (targets classified as “non-target”) receive smaller weights with a possibility of being assigned a weight of zero.

That was a lot of words and numbers, so let’s see it in action and tie it in with the intuition we developed in the last article.

Where’s Walmart?

Let’s pretend we’re training an image classifier that can detect Walmart stores. We scour the internet for various images of different Walmarts, as well as non-Walmart stores, to train and test our model with.

Above is just one of the pictures of the Walmarts from the training dataset. Below is also just a single sample of the non-Walmart storefronts from the training set.

For simplicity’s sake, we’re going to just refer to these two images (I’m only one man, after all). In reality, the algorithm would scour the training images looking for commonalities that are unique to Walmarts and test its observations against the validation set (sometimes referred to as the “test set”).

Our first step is to set all of the images (in both training and validation sets) to grayscale.

The next step would be to set all of the images to the same resolution, as well as scale them down. We’re going to skip that for now, but remember that it’s a crucial step in practice.

Below is our validation set split into “Walmart” and “Non-Walmart” groups for ease of interpretation. Again, all of the transformations that we applied to the training set would be applied to this as well, but we’re going to just skip that.

Training

Now that our images are in the proper format, we start training our classifier. If you remember from last time, this is very similar to an infant learning to differentiate their parent’s from other people and objects. The algorithm is narrowing down the number of features in an image until it finds the ones that are most unique to the target object. Let’s take a look at our training image.

Let’s say that the yellow line represents 11 separate edge-features. The model looked at several Walmart images and made the determination that these edges were important, but what was most important were the vertical edges. These happened to make sharp angles that it “feels” are unique to Walmarts. So it sets the weights of the vertical edges to be much larger than the horizontal. It then sets smaller weights to all of the other features in the image and tests this against the validation set.

As we can see, the model was able to see the vertical edges in the three images on the left (the top-left image was a little iffy, but it still predicted correctly). The top-right image was misclassified because it had no vertical edges along the roof. Similarly, the bottom-right image had more Four-rectangle features (remember that those detect diagonal line features, usually on a small scale), so it was misclassified as well.

Let’s see how it did on the non-Walmart images.

So the Lowe’s picture was classified correctly because the same reason the bottom-right Walmart was misclassified: a lot of diagonal four-rectangle features. The Best Buy in the bottom-left was classified correctly because the size of the vertical edge features were too big. Why were the others false-positives? Well, look at the Target below.

That outline is simply too general and clearly isn’t unique to just Walmart. The algorithm takes this into account and adapts (This is where the “adaptive” part of Adaboost comes into play). It lowers the weights of the vertical edges and raises the weights of the horizontal edges. It re-runs the validation set and is able to re-classify some images correctly, but images that were true-positives before are now false-positives/-negatives. The algorithm reiterates this process over and over. It manages to get accuracy up a little, but determines that this just simply isn’t enough to go on. On the plus side, during all of the iterations, it discovered that adjusting the weights of the features found in the sky did nothing to help with classifying the image. So it sets the weight of all of those to zero! Our model just got a little simpler!

What has our model learned so far?

Horizontal and vertical edges are important, but not a defining feature
Four-rectangle features are more important than it originally estimated, but not the most important.
The features above that boundary have no effect in the model’s predictive power.

Moving on, in its search for additional features that might help with classification, the algorithm notices that the overhang above the doors are lighter than the recessed doors themselves. This creates a large edge-feature, as seen below.

Again, the algorithm adjusts the weight of this particular edge-feature and tests the new equation against the validation set.

Alright! It managed to classify the bottom-right Walmart correctly this time around. We’re still getting false-negatives, though, and the change in parameters caused one that was previously a true-positive to now be a false-negative. (Again, the bottom-left one is iffy, but you can still make out an overhang that is lighter than the doors)

And the non-Walmarts?

So this edge-feature actually increased the probability of obtaining false-positives. As a result, the algorithm, again, adapts to this new found knowledge. Increasing the weight of this feature did make a small difference in helping to classify some Walmarts, so it’s not insignificant. However, creating a higher false-positive rate means that it shouldn’t be considered a defining feature. The algorithm, again, refines the model by iterating over and over to find an appropriate weight for this feature and simultaneously finds more features that have little to no effect on the model’s accuracy (trees, people, cars, carts, bollards, stop signs, etc.) and gives them weights of zero as well! At this point, it’s dropped a LOT of features from model, which, in turn reduces the number of terms in our equation drastically.

What additional knowledge did the model gain this time?

While this Edge-feature is somewhat important with helping to identify some Walmarts, overall accuracy decreases when it is given larger weights.
Additional, smaller features (listed above) can be removed without affecting performance.

The algorithm continues to iterate, adjusting weights as it goes. Finally it notices these six line and four-rectangle features:

The accuracy of the model increased as the weights of these features increased. This fact was true even when reducing the weights of other Haar-like features that were once considered important. Finally, it reduced the weights of everything in the model that weren’t these six to zero. The result?

We have a single false-negative above. This is simply because it is an older Walmart, built before the corporation came up with the minimalist logo. What about the non-Walmarts?

No false-positives this time around! So the six Haar-like features have the most effect on our model’s performance. What’s the implication here? Well, let’s look at how our model has changed.

We started out with an equation like this:

F(x) = α₀ f₀(x) + α₁ f₁(x) + α₂ f₂(x) + …+α₁₇₉,₉₉₉f₁₇₉,₉₉₉(x)+ …

After during training, Adaboost reduced the complexity of the model until it created a highly accurate model. The resulting equation would look something like the one below:

F(x) = α₀ f₀(x) + α₁ f₁(x) + α₂ f₂(x) + … +α₅f₅(x)

(Remember that the terms above are zero-indexed, so the count starts at 0)

Now, it’s highly unlikely that these six features would happen to correspond with the first six terms of our equation, but you get the point. We went from a model what had over 180,000 features to just six!

The Takeaway

While this exercise has been simplified, it does show just how powerful Adaboost is. It’s yet another elegant solution to hardware limitations, as well as limitations created by the Big O.