Artificial Intelligence / Autonomous Vehicle Driving


   

 Artificial Intelligence / Basic Understanding / Data Analysis 

 

The Nvidia / Amazon / Google / Facebook / Apple   has started and designed the artificial intelligence /software having various names  capable of assess the internet  and special features on your phone .

 All These Artificial Intelligence Are  Being Trained By Providing Various Data To Train Themselves  

Example - A For Mango How We Know That Its An Mango Or Any Other Fruit / Vegetable Through Our Senses  But How To Give These Senses To An Computer Simply We Cant So We Train How Lets see 
 
1) We uses the shape to identity the type of object .
 We Import The Images To The Artificial Intelligence And Feed To Its Memory . 

Whenever An Object Is Kept In Front Of It . Its Starts To Scan It By Type Of Scanning U Have Enabled In Artificial Intelligence But That Is  Not Easy How Its Sounds Like 

2) We Try To Train The AI By Feeding More Data On Different Types of Mangoes Available All Around The World 


  Autonomous Vehicle Driving By NVIDIA 


A self-driving car (also known as an autonomous car or a driverless car)is a vehicle that is capable of sensing its environment and navigating without much human input.


The potential benefits of autonomous cars include reduced mobility and infrastructure costs, increased safety, increased mobility, increased customer satisfaction, and reduced crime. These benefits also include a potentially significant reduction in traffic collisions resulting injuries; and related costs, including less need for insurance

How Does AI Work in Autonomous Vehicles ?


AI has become a popular buzz word these days, but how does it actually work in autonomous vehicles?

Let us first look at the human perspective of driving a car with the use of sensory functions such as vision and sound to watch the road and the other cars on the road. When we stop at a red light or wait for a pedestrian to cross the road, we are using our memory to make these quick decisions. The years of driving experience habituate us to look for the little things that we encounter often on the roads — it could be a better route to the office or just a big bump in the road.


We are building autonomous vehicles that drive themselves, but we want them to drive like human drivers do. That means we need to provide these vehicles with the sensory functions, cognitive functions (memory, logical thinking, decision-making and learning) and executive capabilities that humans use to drive vehicles.

AI Perception Action Cycle In Autonomous Vehicles



A repetitive loop, called Perception Action Cycle, is created when the autonomous vehicle generates data from its surrounding environment and feeds it into the intelligent agent, who in turn makes decisions and enables the autonomous vehicle to perform specific actions in that same environment.The figure below illustrates the data flow in autonomous vehicles:



Face Detection through Deep Neural Network.


Face detection is still an undergoing Research topic ,and one of the most favorite computer vision problem in the industry. As of 2014 there are 245 million surveillance cameras being installed and in the era of the Big data , the amount  data fetched through the videos is very very high. The problem with the surveillance cameras , is the person can not permanently sit and keep monitoring  the video feed .Hence face monitoring or detection is a prior necessity along the way for increasing security . Interestingly due to lot of research keep flourishing in the area of Deep Learning and Machine Learning , accurate face detection to some extent is possible.Though face detection problem has been solved by classical Haar Cascade , LBP algorithms but still lot of research is going on to perfect the system.Here we will be talking about the Classical approaches and with time we discuss new CNN and MMOD techniques to solve the problem of the face detection.

Classical Approach:

Face Detection using Haar feature-based cascade classifiers is a very effective object detection method proposed by Paul Viola and Michael Jones in their paper, "Rapid Object Detection using a Boosted Cascade of Simple Features" in 2001.It is a machine learning approach where a cascaded function is trained from lot of positive and negative images. The algorithm initially requires a lot of Positive and Negative images to train the classifiers.The idea is to detect objects in different images.

The feature extraction is done by the Haar features as shown .These haar features are simply constitutional kernel and each feature is a single value obtained by subtracting sum of pixels under white rectangle from sum of pixels under black rectangle.In the four rectangle feature computes the difference between diagonal pairs of rectangles (Viola P. & Jones M.; 2001).
All the possible size and locations of the kernel are used to calculate features and they are huge number of features , even 24 x 24 kernel can produce 160000+ features.   

Algorithm Working: 

Viola Jones Haar cascade algorithm can be classified into 4 sub divisions:
  1. Haar Feature detection
  2. Integral Images
  3. AdaBoost training and
  4. cascade Classifiers
 1. Haar Feature:
All human faces share some common properties like "The eye region is darker than the upper-cheeks", "The nose bridge region is brighter than the eyes",these similarities may be matched using Haar feature and consequently detect weather the face is present or not in a particular frame or image.

The image on the left side shows a feature that looks similar to haar feature- "The eye region is darker than the upper-cheeks"
The image on left side shows a feature that looks similar to haar feature -"The nose bridge region is brighter than the eyes"


The  value of the rectangular regions is calculated by equation:
Σ (pixels in black area) - Σ (pixels in white area) 
But this computation is done by a different approach. Since the computation is done by using rectangular images the feature calculation can be done in a quick manner by using the approach of Integral image.

2. Integral Images 
A summed area table is a data structure and algorithm for quickly and efficiently generating the sum of values in a rectangular subset of a grid. In the image processing domain, it is also known as an integral image. The value in the integral image at any point (x , y) is the sum of all the pixels to the left and above (x , y) in original test image.
ii(x,y) = Σ i(x',y')
 3. Ada Boost training
The Haar cascade as said is a classification learning process requires a set of positive and negative images for training the set of the Haar features are selected by using Ada-boost for training the classifier. To increase the learning performance of the algorithm (which is sometime called as weak learner), the Ada-boost algorithm is used. Ad-boost is a boosting algorithm in Machine Learning domain.The boosting refers to the family of algorithm which converts the weak classifiers into strong classifiers.

A very simple explanation to Boosting is suppose we need to classify on the basis of height weather the person is man or woman . So we can estimate that people above height 5'8'' are men and rest are woman. It may be not a correct guess , may be you are 50-60 times wrong but still you will be right most of the times . This is a form of a weak classifier and Ada Boost focuses on making weak classifiers into strong classifiers.

The process of ‘Boosting’ works with the learning of single simple classifier "like height above 5'8'' is a single simple classifier" and rewriting the weight of the data where errors were made with higher weights. Afterwards a second simple classifier is learned on the weighted classifier, and the data is reweighted on the combination of 1st and 2nd classifier and so on until the final classifier is learned. Therefore, the final classifier is the combination of all previous n-classifiers.

 4. Cascade Classifiers:
In practice and even in practical scenario there is no such strong classifiers.Instead a series of such weak classifiers are used to train to form a cascade of classifiers.The simple classifiers comes earlier in the cascade and they can reject majority of regions or subwindows which are likely to have no face,while retaining the regions which have greater chance of having a face.Now the next set of classifiers will be running on the remaining regions which need more complex analysis,and this is where the later stages of cascade prove to be useful.


There are various other approaches like LBP local binary pattern used for face detection that works in a similar approach, but is much faster and thus is used in the development boards like Raspberry-Pi , Beagle Bone etc.  But as the power of computing and our understanding of Neural Network and Deep Neural Networks has increased ,our way to approach such kind of Computer Visions problems have also changed.Now we will talk much in details of Deep Learning approach to solve the classical Face Detection problem.

  

Convolution Neural Network for Computer Vision



Convolution Neural Networks are similar to neural networks and are made of Neurons that have same weights and biases. Like all Neural networks , CNN also receives some inputs performs a dot product and follows a Non Linearity . Like normal Neural Networks, CNN also have to follow the same journey from raw Image pixels on one end to class scores on the other end. They still have a loss function (like SVM / Softmax ) on the last layer and all the tips and tricks involved in learning Normal Neural Network.
So , the question arises Why CNN? and Why Convolution ? Why an architecture requires convolution all together? . Lets handle all these queries one by one and slowly we will groove on to a famous CNN architecture Alex-Net and understand why it got famous for computer vision.

The Co-relation of Brain with Artificial Neural Network

Many researchers having been trying to understand Brain and the functioning of brain specially the visual cortex . The brain is a highly dense network of the neurons , and the activations or firing of the neurons are responsible for our actions , thoughts and memory . Each time we do a certain task these connections become stronger and stronger and thats responsible for a good memory . Our objective is to correlate it to CNN and how our eyes and our visual cortex network interprets the information .Here is Neural structure of the Brain having millions of such Network . These neurons shares parameters and from years our brain has evolved so much that it takes a fraction of sec for you to analyze the object kept beside you is what? We will discuss some important concepts  of local connectivity , parameter sharing in relation to brain Neural architecture .

Why convolution?

Strictly  speaking convolution is a mathematical concept of multiplying two matrices but the multiplication is not traditional matrix multiplication, despite being similarly denoted by *.
Convolution is the process of flipping both the rows and columns of the matrix and then multiplying locationally similar entries and summing them up. Now since over point of discussion is focus on the images and their convolution, we will call one of the matrix as Kernel/Filter and one matrix as Image.
Image is nothing but a 2D signal from which you can   extract valuable information like edges, shape , features of the object , color segmentation and lot more.Typically we do frequency analysis of the signal to extract useful information for which we do transforms like famous Fourier transforms , Cosine transforms and Wavelet transforms (give information in both frequency and time domain).
Convolution is slightly different with respect to transforms (mentioned above) as  we tend to find out a some co-relation between the two signals. Lost??
Here is an example of multiplication of a gray scale image and a matrix. The multiplication results into an image showing edges. So we have a convoluted an image to extract edges . So we have found out the co-relation of the image with the matrix and this matrix is called as convolution kernel. It means if we have several of such kernels , we can extract a lot of such kind of features.
This is the basic idea of convolution and since with much and much complex kernels we can extract complex features like shapes corners and  one can definitely extract multiple features to ultimately identify what the image contains and finally predict to which class it belongs.The math involved in convolution is out the scope so for now just remember the convolution is the process of filtering through the image to find a certain pattern .A simple example of finding the face like features is most popular algorithm Haar Cascade involving certain Haar like features looking on the entire image .

Why Computer Vision is Not so Easy ?

Well .. the answer to this question in single line is " Even  we have not been able to understood the vision process of our own eye perfectly " , and then vision for computer and to make it understand what an object is really difficult . 
Object detection is considered to be the most basic application of computer vision.In real life, every time we(humans) open our eyes, we unconsciously detect objects.Image captured by our eye is transmitted to brain . Brain process the information and figures out what object it is from its own trained network of  sensory nerves and all this happens implicitly and within no to time.
Since it is very super-intuitive for us, we fail to appreciate the key challenges involved when we try to design systems similar to our eye. Lets start by looking at some of the key roadblocks:
  • Viewpoint variations: Same object in various angles , different poses and all depends upon  the relative position of the observer and object . It seems easy to intuitive that it is same object but to teach a computer it is still a challenge. 
  • Illumination: A great roadblock in the field of computer vision is illumination .  The problem with the illumination is it depreciates the information of the signal and different illumination environment can create different  aspects for the same image.
  • Background Clutter: Another major issue with the computer vision is background cluttering .Look carefully and you will find a person , though it seems to easy to identify but it is hell lot difficult to train a machine to make it understand what it is.  

 

Introduction to Convolution Neural Network

CNN  was first pioneered by Yann Legun of New York university and is formal director of Facebook AI group , which mean Facebook uses CNN for it AI operations like auto tagging and facial recognition.Before getting deep into Convolution Neural Network , lets first talk about why Deep Neural Networks are popular and why are they working so well.  
The formal answer to this query is the traditional approach to the computer vision problem which includes machine learning techniques are unable to cater vast amount of variations in the images. The motive is to capture these non linear variations and learn them and of-course deep learning does it well. CNN does this in a way that it by consecutively modeling small pieces of information and combining them deeper in network. This information is also decided by the network and subsequently from bits and pieces of the information gathered it keeps on learning new features and new variations in the image subsequently learning the shape of the object .




The importance lies not only in building such kind of network but also in understanding what kind of parameters are learned by each neuron layers and seeing what features have been learned .So it becomes is important to understand how current DNNs work and to fuel intuitions for how to improve them .The image shows how CNN looks the world , running filters and extracting the features from the image starting from simpler features like edges, color components and then in subsequent layers certain shape and finally segregating the image and ultimately predicting the class to which it belongs.

A CNN typically consists of 3 types of layers:

1.Convolution Layer
2. Pooling Layer
3. Fully Connected Layer

 

Architecture Overview

Before digging deep into the CNN lets visualize it at an architecture level and how it works.
To visualize this layer, imagine a set of evenly spaced flashlights all shining directly at a wall which is an image. Every flashlight is looking for the exact same pattern through a process called as Convolution. As ddiscussed earlier CNN has filters and theses weights are nothing but the filters which slide through the entire image looking for a specific pattern . The typical filter on the first layer is small spatially but extends through the full depth of the input volume . Depth here refers to the 3 RGB channel .Hence images represented in context to CNN will be shown in form of  volume. During the forward pass , we convolve each filter across the width height of the input volume and compute the dot products between the entries of the filter and input at any position and generates a 2D activation map or feature map that give responses of that filter at every spatial position.Intuitively the network will learn filters that activate when they see some typical visual feature.

Convolution layer   

This is the main crux of the layer .This layer is visualized in the form of a cuboid having height width and channel.Every image is represented in the form of an 3D matrix having width*height*Depth/channel. we generally skip depth but here we will consider it . Depth is nothing but the RGB channel .Every color image has three components (Red, Blue and Green) and they also contains the non liner information ,expressed technically as Hue or color information. 

The figure will help you to visualize the image.Now a convolution layer is formed by running a convolution kernel over it as we need to find out patterns or more specifically non linear patterns.As discussed earlier a filter or convolution kernel is another block or cuboid of smaller height and width but same depth which is swept over this base block. 
The filter of size 5 x 5 x 3 is started from top left corner of the image and sweeps down till the lower right bottom of the image .The filter is applied over the entire image as well as convoluted along with the channel i.e. RGB and this is because we mostly see the volumetric representation of the image or cuboid in this case .


So with the entire set of filters (e.g. 12 filters ) and each of the filters producing a separate 2D activation or feature map , we stack up these activation maps along the depth dimension and produce the output volume.

 

Detailed Notation:

The image produced after convolution is called as Activation Map or feature map . such  feature maps are obtained by repeated application of a filters  across sub-regions of the entire image, by convolution of the input image with a linear filter, adding a bias term and after that applying a non-linear function which can be tanh or relu (Mostly relu is used). If we denote the k-th feature map at a given layer as , whose filters are determined by the weights and bias , then the feature map  .



You see the size shirked to 28 * 28 Why so? Let’s look at a simpler case.
Suppose the initial image had size 6x6xd and the filter has size 3x3xd. Here  the depth as d because it is always three . Since depth is same, we can have a look at the front view of how filter would work.

Here we can see that the result would be 4x4x1 volume block. .Every time we skip one block and window moves forward . This is called as stride which is the number of blocks to move in each step and is one of the Hyper parameter about  which we will be discuss shortly. So we have a simple formula
output size = (N – F)/S + 1
 where 'N' is image size 'F' is filter size and 'S' is stride hence (32-5)/1 +1 =28.A caution is we can use a value of S which can result in a Non integer value we generally avoid such values , and we avoid this by padding our image with zeros giving them a border around it to avoid such kind of Non integral values.
Now have a look at this image and try to consolidate what we have learned till now .We see the size keeps on shrinking as we increase the filters and strides.




This can incorporate a undesirable situation in case of deep networks where the size of the image would become very small too early. Also, it would restrict the use of large size filters as they would result in faster size reduction.
Hence to avoid this ,we will again do padding of the image  to keep the size of the image same as the original size and how much padding is to be done is decided by the formula (F-1)/2 where 'F' is the filter size . 
So to keep the size of the image constant we will do padding of 2 and the thus the image size is now 34*34 and the general formula becomes as below:

output size = (N – F +2P)/S + 1
 where N=32 F=5 P=2 and S=1 so output size is again (32-5+2*2)/1 +1 =32. 


Parameters of the of the Convolution layer

There are two kind of parameters involved here one is called as the Hyper Parameters and other is simply parameters.Hyper parameters are the parameters that will not change through out the entire convolution layer like Number of filters , Stride , Padding etc whereas the parameters are those which are subsequently changing throughout the entire network. Have a look .....
  • Hyper-parameters:
    • K: #filters
    • F: filter size (FxF)
    • S: stride
    • P: amount of padding
  • parameters = (F.F.D).K + K
    • F.F.D : Number of parameters for each filter (analogous to volume of the cuboid)
    • (F.F.D).K : Volume of each filter multiplied by the number of filters
    • +K: adding K parameters for the bias term.
Some additional points to be taken into consideration:
  • K should be set as powers of 2 for computational efficiency
  • F is generally taken as odd number
  • F=1 might sometimes be used and it makes sense because there is a depth component involved

Pooling Layer

When we use padding in convolution layer, the image size remains same. So, pooling layers are used to reduce the size of image.Mostly the type of down sampling used in CNN architecture is Max Pooling. It is also a form of Non Linear down sampling.  They work by sampling in each layer using filters hence max-pooling partitions the image into a set of non-overlapping rectangles and for each sub-region, outputs the maximum value.
Consider the following 4×4 layer. So if we use a 2×2 filter with stride 2 and max-pooling, we get the following response:
Here you can see that 4 2×2 matrix are combined into 1 and their maximum value is taken. Generally, max-pooling is used but other options like average pooling can be considered.So overall to progressively reduce the spatial size of the representation and  to reduce the amount of parameters and computation in the network pooling layer is used.

Apart from the Dimensionality reduction why Pooling is used?
Max pooling is useful for two reasons:
1. It eliminates non -Maximal values reducing the the computation for upper layers. 
2. It provides a form of Translation Invarience .   
 Max-pooling is useful in vision for two reasons:
  1. By eliminating non-maximal values, it reduces computation for upper layers.
  2. It provides a form of translation invariance. Imagine cascading a max-pooling layer with a convolutional layer. There are 8 directions in which one can translate the input image by a single pixel. If max-pooling is done over a 2x2 region, 3 out of these 8 possible configurations will produce exactly the same output at the convolutional layer. For max-pooling over a 3x3 window, this jumps to 5/8.
    Hence it is a “smart” way of reducing the dimensionality of intermediate representations.

Fully Connected Layer

At the end of convolution and pooling layers, networks generally use fully-connected layers in which each pixel is considered as a separate neuron just like a regular neural network. The last fully-connected layer will contain as many neurons as the number of classes to be predicted.The only difference between FC and CONV layer is that the neurons in CONV layer are connected to a local region in the input and CONV volume share parameters but still both the layers compute dot products . Hence it is possible to convert them back and forth.

The tutorials presented here will introduce you to simple CNN algorithms and will also show you how to run them using Theano. Theano is a python library that makes writing deep learning models easy, and gives the option of training them on a GPU.

Training Complex CNN architectures

Now that we have seen various building blocks in CNN and how to implement their forward pass, we will build and train a convolutional network on CIFAR-10 dataset. We will use popular Lenet architecture.

Load all necessary Packages