High level computer vision and convolutional neural networks

Object recognition

Object recognition is a general term to describe a collection of related computer vision tasks that involve identifying objects in images or videos.

It differs from image classification as this task involves associating one or more categories to a given image, while mere object recognition could just involve finding the location of the objects of interest without assigning them a category tag.

Object detection/localization involves finding the bounding boxes of the objects within an image.

Classification vs detection

In the case of segmentation, each pixels could be associated with a specific class to find the contours of each object.

object detection vs instance segmentation

Each object detection task must be able to deal with:

  • Different camera positions
different camera positions
  • Perspective deformations
Perspective deformations
  • Illumination changes
Illumination changes
  • Intra-class variance
Intra-class variance

Template matching

Template matching

A template is something that can be used as a model to match against other images. The task of template matching is to find instances of the template in the image according to some similarity measure.

template matching example

Some of the challenges involved with this technique are the high template variability and the possible deformations that the objects can have, as well as the application of affine transformations or the presence of noise or illumination differences.

Moreover, even simple differencing techniques don’t necessarily provide reliable results.

Template matching downsides

Correlation based

In the context of correlation-based template matching techniques the method used to detect the presence of a certain object involves using a template T that get places in every possible position across the image. This process involves sliding the template over the image, aligning it with different regions, and comparing it to the corresponding parts of the image. We just use the basic approach that doesn’t involve scaling or rotating such templates.

Ultimately, a comparison method is used to determine the similarity between the template and each region of the image. These methods include:

  • Sum of Squared Differences (SSD): Measures the squared difference between corresponding pixel values in the template and image region
    \phi(x, \, y) = \sum_{u, \, v \, \in \, T} (I(x+u, \, y+v) - T(u, \, v))^2
  • Sum of Absolute Differences (SAD): Measures the absolute difference between corresponding pixel values
    \phi(x, \, y) = \sum_{u, \, v \, \in \, T} \lvert I(x+u, \, y+v)-T(u, \, v)\rvert
  • Zero-Normalized Cross-Correlation (ZNCC): Measures the normalized cross-correlation between the template and image region, taking into account variations in brightness and contrast
    \phi(x, \, y) = \dfrac{\displaystyle\sum_{u, \, v \, \in \, T}\left(I(x+u, \, y+v) - \bar{I}(x, \, y))\cdot (T(u, \, v)-\overline{T}\right)}{\sigma_I(x, \, y)\cdot \sigma_T}

Note:

  • \bar{I}(x, \, y) represents the average on a certain window of the image
  • \overline{T} represents the template average
  • \sigma_I(x, \, y) is the standard deviation on the image window
  • \sigma_T is the standard deviation on the template

Generalized Hough transform

Already seen in a previous section here.

Bag of words (BoW)

The bag of words technique, originally inspired by document analysis, has been adapted and widely utilized in image and object classification tasks. This method is designed to be invariant to various factors, particularly viewpoint changes and deformations, making it robust in diverse scenarios.

At its core, the bag of words technique decomposes complex patterns in images into (semi) independent features, allowing for efficient analysis and classification. Just as in document analysis where words are treated as independent features, in image classification, visual features are extracted from local image regions and aggregated into a “bag of words” representation.

Bag of words

The bag of words technique enables the classification of images based on their content, regardless of specific geometric transformations or distortions.

Bag of words example

By searching the dictionary that contains all the features different images can then be matched with them to recognize the content.

Bag of words example

The first step that must be done before doing the matching is the extraction of meaningful keypoints and descriptors

Extracting meningful descriptors

To then be able to cluster those features

Feature clustering

And use each cluster to represent the different classes using a representative sample, for example the centroid.

Centroids

Convolutional neural networks (CNNs)

When applying deep learning on images, we can treat the image as a 2D matrix

Images as a 2D matrix

In convolutional neural networks (CNNs), local connectivity refers to the concept that each neuron in a convolutional layer is connected to only a local region, known as its receptive field, in the previous layer. This arrangement allows the network to capture spatial information effectively, as each neuron is responsible for processing a specific region of the input.

Local conneectivity

Shared weights further enhance this spatial sensitivity by enforcing weight sharing across the receptive fields of neurons, resulting in spatially invariant responses to certain features. The different weights are combined using convolution operations.

shared weights
All green units share the same parameters but operate on a different input window

Additionally, CNNs typically employ multiple feature maps in each convolutional layer, enabling the network to learn diverse sets of features from the input data.

using different feature maps
Brown and green units compute different functions

Finally, subsampling, often achieved through pooling layers, is used to reduce the dimensionality of feature maps while preserving their essential features.

Example of max pooling

Often, the pooling operations is coupled with a convolution operation to extract meaningful features from the image.

Pooling and convolution

An example is shown below.

Pooling and convolution example

Combining together all the aforesaid steps, we get:

CNN example

Regularization strategies

Regularization strategies are essential techniques to prevent overfitting and improve the generalization ability of models. Dropout is a widely used regularization method in neural networks, where units are randomly dropped out, or deactivated, during training with a fixed probability p. This encourages the network to learn more robust and distributed representations by preventing co-adaptation of neurons.

L2 weight decay, another common regularization technique, penalizes large weights by adding a regularization term to the loss function proportional to the squared magnitude of the weights. The regularization strength, controlled by an adjustable decay value, determines the degree of penalization for large weights, thereby encouraging simpler models.

Early stopping is a straightforward regularization approach that halts the training process when the performance on a validation dataset fails to improve after a certain number of subsequent epochs, referred to as the patience parameter. By preventing the model from continuing to learn on noisy or irrelevant information, early stopping helps prevent overfitting and results in models that generalize better to unseen data.

Preprocessing strategies

Data augmentation is used to enhance the diversity and variability of training data by introducing variations such as noise, rotation, flipping, and adjustments to hue, color, and saturation. The augmented data enriches the training dataset and helps improve the robustness and generalization ability of models.

Batch processing

The mini-batch gradient descent technique involves using a small batch of data samples for each iteration of the optimizer instead of the entire dataset. This not only accelerates the gradient computation process but also helps alleviate memory constraints by processing only a subset of the data at a time.

Did you find this article interesting?
YesNo
Scroll to Top