Perhaps you have already heard in the news or in your daily work terms like “#deeplearning”, “#machinelearning” or “convolutional neural network”. Over the last ten years, these words have spread mouth to mouth quick due to, mainly, the appearance of Tesla and semi-autonomous driving vehicles.

However, the truth is, deep learning was there long before Tesla revolutionized the automotive industry. In the middle of 20th century, Machine learning techniques started being applied on research works mainly focused on statistics and probabilities’ calculation. These methods proposed mathematical ways of *learning* trends and data distribution by observing several data samples over a period of time, so that they could *predict *how this data would behave in future time moments. These were the initial steps of what machine learning and deep learning is nowadays.

It was not until the end of the 80s, beginning of the 90s that one of the so-called “fathers of the #AI”, Yann LeCun, came up with the term “*Convolutional Neural Network (CNN)*”, which is the basis of deep learning and image recognition.

A neural network can be understood as a set of neurons inside a layer connected between them and to other neurons in adjacent layers, which each of them aim to learn features or patterns from the input data in order to understand its distribution and predict its evolution in the future.

But, *what does this really mean and how is this applied to autonomous driving, e.g.? *Let’s see how machine learning and deep learning is applied to image understanding. As above commented, CNNs appeared in the late 80s and proposed several type of layers that form the #CNN that aim to learn features from the input image that characterize the objects that we want to find in that images. The following image illustrates how CNNs work[1]. If, for example, we look for pedestrians, some important key-features that we would like to locate in our images are human joints, since they characterize the shape of the pedestrian and, furthermore, give us information about its pose.

CNNs propose a set of convolutional layers applied in a sequence on the image in order to find the features in the image belonging to our target human joints. Images are nothing but matrices, typically formed by three color channels (red-green-blue, RGB) where each matrix element represents a pixel, with a value between 0 and 255.

Convolutional layers are mathematical operations applied on the image consisting on a convolution of a *filter* or *kernel* applied on the input image. These filters are also zero-and-one matrices but with smaller dimensions than the input image and are applied iteratively through the whole image. The content of the filters determines the type of feature we look for on the image: horizontal lines, circle patterns, etc. Each kernel has a weight and we aim to **learn** these weights over a big set of images so that we know which filters are more relevant to find what we search on the images that we have.

In the above-mentioned example of the pedestrian detection, our CNN would be a set of several convolutional layers with several filters and their correspondent weights. In order learn these weights, the first image will be passed over all of the layers so that, at the end of the sequence, we compare the features that we have “filtered” or obtained with the CNN with the actual labeled pedestrian on the image, this is what is called the “**loss**”. This loss is typically the difference between where I expected the human join vs. where it is now. What we want is to minimize or optimize this loss, once again, our pedestrian detection summarizes in a mathematical problem of optimization. We then use this loss to update the weights we randomly initialized and start all over with the second image. Repeating this process over more than 1000 images, we end up with proper weights that form a CNN able to accurately detect human joints and with them, detect the shape of the humans present on our images.

This principle of weights learning and image understanding has evolved over the last decades, resulting in new types of CNNs able to classify, detect or segment objects on images faster, more accurate and using less steps, what makes them everyday more and more suited for the industry.

REFERENCES:

## Comments