An Introduction to Computer Vision

Calibraint

Author

September 29, 2023

Last updated: August 13, 2024

Computers can do many things, but can they see? Can they understand what they see? Can they help us with things that need vision, like security, healthcare, entertainment, education, and more?

Computer vision in AI teaches systems to deal with visual information and extract information from them. It is a field that makes computers see like humans.

You can picture computer vision technology as giving eyes to computers. Computers can help us in many ways that need vision, like finding diseases from pictures, making photos and videos better, creating new worlds with virtual and augmented reality, and even driving cars by themselves.

However, implementing computer vision algorithms is not easy. It is very hard and complicated. Humans can see and understand things naturally, but computers need a lot of data and rules to learn how to do the same.

Computer vision also has many complications to deal with. They can be anything like different lights, things blocking the view, noise, distortion, and perspective. These things can make computer vision systems less good and accurate. That is why computer vision experts are always trying to find new ways to fix these problems and make systems better.

In this blog, we will talk more about computer vision, like how it works, what are the main ways, use cases, and some cool computer vision projects.

Basics of Computer Vision in AI

Let us learn more about computer vision and some of the things it can do.

Image Processing

Image processing is the first and most important thing in computer vision. It means doing different things to images to make them better, reduce noise, fix distortion, adjust contrast, etc. Image processing can also be used to get useful information from images, such as edges, corners, shapes, etc.

Some of the common terms in image processing are:

Filtering

Filtering means removing unwanted things from an image, such as noise, blur, etc. Filters can be either linear or nonlinear.

Linear filters use a weighted average of the pixel values around a pixel to get a new pixel value. Nonlinear filters use a nonlinear function of the pixel values around a pixel to get a new pixel value. Some examples of filters are Gaussian filter, median filter, Laplacian filter, etc.

Thresholding

Thresholding means changing a grayscale image into a binary image by giving a pixel value of either 0 or 1 based on a threshold value. Thresholding can be used to separate foreground and background objects in an image or to create masks for further processing.

Some popular examples of thresholding methods are global thresholding, adaptive thresholding, Otsu’s method, etc.

Morphological Operations

Morphological operations mean using simple shapes called structuring elements in an image to change its shape and structure. Morphological operations can be either dilation or erosion.

Dilation makes the boundaries of the foreground objects in an image bigger by adding pixels to them. Erosion makes the boundaries of the foreground objects in an image smaller by removing pixels from them. Some examples of morphological operations are opening, closing, skeletonization, etc.

Histogram Equalization

Histogram equalization means adjusting the contrast of an image by changing its pixel values so that they have a uniform distribution. Histogram equalization can make the details and features of an image more visible.

Some examples of histogram equalization methods are global histogram equalization, adaptive histogram equalization, contrast limited adaptive histogram equalization (CLAHE), etc.

Feature Extraction

Feature extraction is the process of extracting meaningful and distinctive information from images that can be used for further analysis and recognition. Features are important and different information from images that we can use for more analysis and recognition. Features can be either local or global.

Local features are small parts or points in an image that do not change much when the image changes in size, rotation, light, etc. Global features are the whole image or big parts in an image that show its overall look and features.

Some of the common feature extraction techniques are:

Edge Detection

Edge detection is finding the borders or breaks in an image that separate different objects or parts. Edges can be found by using methods that measure the changes in pixel values along different directions or by using methods that find the places where the second change of pixel values changes sign.

Some examples of edge detection methods are Canny Edge Detector, Sobel Operator, Laplacian or Gaussian (LoG) Operator, etc.

Corner Detection

Corner detection is finding the points in an image that have a high bend or change in direction. Corners can be found by using methods that measure the local changes in pixel values along different directions or by using methods that find the eigenvalues and eigenvectors of a matrix that shows the local structure of an image.

Some examples of corner detection methods are Harris corner detector, Shi-Tomasi corner detector, FAST corner detector, etc.

Blob Detection

Blob detection is finding the parts in an image that have similar pixel values or features. Blobs can be found by using methods that group pixels based on their intensity or color values or by using methods that find local maximum or minimum in a scale-space representation of an image.

Some examples of blob detection methods are simple thresholding, connected components labeling, Laplacian of Gaussian (LoG) blob detector, SIFT blob detector, etc.

Feature Descriptors

Feature descriptors represent the features extracted from an image by using numbers that show their features and characteristics. Feature descriptors can be either handcrafted or learned. Handcrafted feature descriptors use predefined rules or formulas to get feature numbers from feature points or parts.

Learned feature descriptors use machine learning algorithms to learn feature numbers from feature points or parts based on training data. Some examples of feature descriptors are SIFT descriptor, SURF descriptor, ORB descriptor, HOG descriptor, LBP descriptor, etc.

Object Detection

Object detection is finding and naming objects of interest in an image or a video sequence. Object detection can be either single-object detection or multiple-object detection.

Single-object detection is finding the location and name of a single object in an image or a video frame. Multiple-object detection is finding the locations and names of multiple objects in an image or a video frame.

Some of the common object detection techniques are:

Template Matching

Template matching is finding the parts in an image that match a given template or model of an object. Template matching can be done by using methods that measure the similarity or difference between the template and the image parts based on pixel values or feature numbers.

Some examples of template matching methods are cross-correlation, sum of squared differences (SSD), normalized cross-correlation (NCC), etc.

Viola-Jones Algorithm

Viola-Jones algorithm is a fast and reliable method for face detection that uses a series of simple classifiers called Haar-like features to say no to non-face parts and say yes to face parts in an image. Haar-like features are rectangular features that measure the difference in pixel values between next to each other parts in an image.

The Viola-Jones algorithm uses a machine learning technique called AdaBoost to select and combine the best Haar-like features to make a strong classifier that can find faces with high accuracy and speed.

Histogram of Oriented Gradients (HOG)

Histogram of oriented gradients (HOG) is a feature extraction technique that makes histograms of gradient directions in local cells or blocks of an image. HOG features show the shape and look of an object by describing its edges and shapes. HOG features can be used for object detection by using a machine learning technique called support vector machine (SVM) to train and name objects based on their HOG feature numbers.

Convolutional Neural Networks (CNN)

Convolutional neural networks (CNN) are deep learning models that have many layers of artificial neurons that do convolution, pooling, activation, and fully connected operations on input images or feature maps. CNNs can learn complex and hierarchical features from images by using large amounts of labeled data and a backpropagation algorithm.

CNNs can be used for object detection by using different architectures and methods, such as region-based CNN (R-CNN), fast R-CNN, faster R-CNN, single shot multibox detector (SSD), you only look once (YOLO), etc.

Face Recognition

Face recognition is the process of identifying or verifying the identity of a person based on their face image or video. It can be either face identification or face verification. Face identification involves finding the name or label of a person from a database of known faces based on their face image or video. Face verification involves confirming or rejecting the claim of a person’s identity based on their face image or video and a reference face image or video.

Some of the common face recognition techniques are:

Eigenfaces

Eigenfaces is a method for face recognition that uses principal component analysis (PCA) to reduce the size and get the most important features from face images. Eigenfaces are the eigenvectors of the covariance matrix of face images that show the changes in face look.

Eigenfaces can be used for face recognition by projecting face images onto a low-dimensional subspace spanned by eigenfaces and measuring the distance between projected face images.

Fisherfaces

Fisherfaces is a method for face recognition that uses linear discriminant analysis (LDA) to reduce the dimensionality and get the most different features from face images. Fisherfaces are the eigenvectors of a matrix that maximizes the ratio of between-class scatter to within-class scatter of face images that show the differences between face classes.

Fisherfaces can be used for face recognition by projecting face images onto a low-dimensional subspace spanned by fisherfaces and measuring the distance between projected face images.

Local Binary Patterns (LBP)

Local binary patterns (LBP) is a feature extraction technique that computes binary codes from local neighborhoods of pixels in an image. LBP codes capture the texture and pattern information of an image by comparing the pixel values with their neighbors and assigning 0 or 1 based on whether they are smaller or larger than the center pixel.

LBP codes can be used for face recognition by making histograms of LBP codes from different parts of a face image and measuring the similarity between histograms using chi-square distance or histogram intersection.

DeepFace

DeepFace is a deep learning model for face recognition that uses a convolutional neural network to learn features from aligned face images and measure their similarity using a metric learning technique called triplet loss. DeepFace can recognize faces with high accuracy by using a large dataset of labeled faces called Facebook Social Graph.

DeepFace can achieve near-human performance on face recognition by using a large dataset of 4 million labeled face images and applying several techniques such as alignment, augmentation, dropout, etc.

Segmentation

Segmentation is dividing an image into parts that have meaning or are similar based on some things such as pixel values, color, texture, shape, etc. Segmentation can be either semantic segmentation or instance segmentation. Semantic segmentation is giving each pixel in an image a class.

Existing applications of computer vision in the real world

Here are some applications of computer vision that have been popular with the masses. They have been in the market for many years and have improved their methods many times based on user feedback and reception.

Google Photos

Google Photos is a photo and video service offered by Google. It uses computer vision to sort, search, and edit photos and videos uploaded to it. Google Photos can automatically find faces, objects, scenes, landmarks, animals, and more in photos and videos, and group them into albums, stories, collages, animations, etc.

Users can search for photos and videos by using words or phrases, such as “beach”, “dog”, “sunset”, etc. The platform can also make photos and videos better by adding filters, effects, adjustments, etc.

Snapchat

Snapchat is a popular social media app among the younger generation. It lets users send and receive photos and videos. The app uses computer vision to create fun and interactive features, such as lenses, filters, stickers, emojis, etc. Users can use lenses to create augmented reality effects that can change the face or surroundings into different characters, animals, objects, etc.

Filters are overlays that can add text, graphics, colors, etc. to photos and videos. Stickers are images or animations that users can add to photos and videos. Bitmojis are personalized avatars that users can create and use in snaps they send to each other.

FaceApp

FaceApp is a photo editing app that uses computer vision to change the look of subjects in different ways. FaceApp can make users look older or younger, change their gender or hairstyle, add or remove facial features, swap or merge faces with someone else, etc. FaceApp uses artificial neural networks to make realistic and high-quality changes in faces.

Tesla Autopilot

Tesla Autopilot is a driver assistance system that uses computer vision to enable semi-autonomous driving in Tesla vehicles. It can do various tasks, such as steering, accelerating, braking, changing lanes, parking, etc., by using cameras, radars, ultrasonic sensors, and neural networks to see the environment and control the vehicle.

DeepMind AlphaFold

DeepMind AlphaFold is a deep learning system that uses computer vision to guess the three-dimensional shape of proteins from their amino acid sequences. Proteins are important molecules for life that do various functions in living things. The shape of proteins decides their function and interactions with other molecules.

Predicting the shape of proteins is a hard and important problem in biology and medicine. DeepMind AlphaFold can guess the shape of proteins with high accuracy and speed by using convolutional neural networks to model the spatial relationships between amino acids.

Conclusion

Computer vision is a field that has many opportunities. However, it has its equal share of challenges to deal with as well. As we are witnessing more applications of computer vision in artificial intelligence we can expect more improvements in the near future.