Convolution Neural Network for Computer Vision



Convolution Neural Networks are similar to neural networks and are made of Neurons that have same weights and biases. Like all Neural networks , CNN also receives some inputs performs a dot product and follows a Non Linearity . Like normal Neural Networks, CNN also have to follow the same journey from raw Image pixels on one end to class scores on the other end. They still have a loss function (like SVM / Softmax ) on the last layer and all the tips and tricks involved in learning Normal Neural Network.
So , the question arises Why CNN? and Why Convolution ? Why an architecture requires convolution all together? . Lets handle all these queries one by one and slowly we will groove on to a famous CNN architecture Alex-Net and understand why it got famous for computer vision.

The Co-relation of Brain with Artificial Neural Network

Many researchers having been trying to understand Brain and the functioning of brain specially the visual cortex . The brain is a highly dense network of the neurons , and the activations or firing of the neurons are responsible for our actions , thoughts and memory . Each time we do a certain task these connections become stronger and stronger and thats responsible for a good memory . Our objective is to correlate it to CNN and how our eyes and our visual cortex network interprets the information .Here is Neural structure of the Brain having millions of such Network . These neurons shares parameters and from years our brain has evolved so much that it takes a fraction of sec for you to analyze the object kept beside you is what? We will discuss some important concepts  of local connectivity , parameter sharing in relation to brain Neural architecture .

Why convolution?

Strictly  speaking convolution is a mathematical concept of multiplying two matrices but the multiplication is not traditional matrix multiplication, despite being similarly denoted by *.
Convolution is the process of flipping both the rows and columns of the matrix and then multiplying locationally similar entries and summing them up. Now since over point of discussion is focus on the images and their convolution, we will call one of the matrix as Kernel/Filter and one matrix as Image.
Image is nothing but a 2D signal from which you can   extract valuable information like edges, shape , features of the object , color segmentation and lot more.Typically we do frequency analysis of the signal to extract useful information for which we do transforms like famous Fourier transforms , Cosine transforms and Wavelet transforms (give information in both frequency and time domain).
Convolution is slightly different with respect to transforms (mentioned above) as  we tend to find out a some co-relation between the two signals. Lost??
Here is an example of multiplication of a gray scale image and a matrix. The multiplication results into an image showing edges. So we have a convoluted an image to extract edges . So we have found out the co-relation of the image with the matrix and this matrix is called as convolution kernel. It means if we have several of such kernels , we can extract a lot of such kind of features.
This is the basic idea of convolution and since with much and much complex kernels we can extract complex features like shapes corners and  one can definitely extract multiple features to ultimately identify what the image contains and finally predict to which class it belongs.The math involved in convolution is out the scope so for now just remember the convolution is the process of filtering through the image to find a certain pattern .A simple example of finding the face like features is most popular algorithm Haar Cascade involving certain Haar like features looking on the entire image .

Why Computer Vision is Not so Easy ?

Well .. the answer to this question in single line is " Even  we have not been able to understood the vision process of our own eye perfectly " , and then vision for computer and to make it understand what an object is really difficult . 
Object detection is considered to be the most basic application of computer vision.In real life, every time we(humans) open our eyes, we unconsciously detect objects.Image captured by our eye is transmitted to brain . Brain process the information and figures out what object it is from its own trained network of  sensory nerves and all this happens implicitly and within no to time.
Since it is very super-intuitive for us, we fail to appreciate the key challenges involved when we try to design systems similar to our eye. Lets start by looking at some of the key roadblocks:
  • Viewpoint variations: Same object in various angles , different poses and all depends upon  the relative position of the observer and object . It seems easy to intuitive that it is same object but to teach a computer it is still a challenge. 
  • Illumination: A great roadblock in the field of computer vision is illumination .  The problem with the illumination is it depreciates the information of the signal and different illumination environment can create different  aspects for the same image.
  • Background Clutter: Another major issue with the computer vision is background cluttering .Look carefully and you will find a person , though it seems to easy to identify but it is hell lot difficult to train a machine to make it understand what it is.  

 

Introduction to Convolution Neural Network

CNN  was first pioneered by Yann Legun of New York university and is formal director of Facebook AI group , which mean Facebook uses CNN for it AI operations like auto tagging and facial recognition.Before getting deep into Convolution Neural Network , lets first talk about why Deep Neural Networks are popular and why are they working so well.  
The formal answer to this query is the traditional approach to the computer vision problem which includes machine learning techniques are unable to cater vast amount of variations in the images. The motive is to capture these non linear variations and learn them and of-course deep learning does it well. CNN does this in a way that it by consecutively modeling small pieces of information and combining them deeper in network. This information is also decided by the network and subsequently from bits and pieces of the information gathered it keeps on learning new features and new variations in the image subsequently learning the shape of the object .




The importance lies not only in building such kind of network but also in understanding what kind of parameters are learned by each neuron layers and seeing what features have been learned .So it becomes is important to understand how current DNNs work and to fuel intuitions for how to improve them .The image shows how CNN looks the world , running filters and extracting the features from the image starting from simpler features like edges, color components and then in subsequent layers certain shape and finally segregating the image and ultimately predicting the class to which it belongs.

A CNN typically consists of 3 types of layers:

1.Convolution Layer
2. Pooling Layer
3. Fully Connected Layer

 

Architecture Overview

Before digging deep into the CNN lets visualize it at an architecture level and how it works.
To visualize this layer, imagine a set of evenly spaced flashlights all shining directly at a wall which is an image. Every flashlight is looking for the exact same pattern through a process called as Convolution. As ddiscussed earlier CNN has filters and theses weights are nothing but the filters which slide through the entire image looking for a specific pattern . The typical filter on the first layer is small spatially but extends through the full depth of the input volume . Depth here refers to the 3 RGB channel .Hence images represented in context to CNN will be shown in form of  volume. During the forward pass , we convolve each filter across the width height of the input volume and compute the dot products between the entries of the filter and input at any position and generates a 2D activation map or feature map that give responses of that filter at every spatial position.Intuitively the network will learn filters that activate when they see some typical visual feature.

Convolution layer   

This is the main crux of the layer .This layer is visualized in the form of a cuboid having height width and channel.Every image is represented in the form of an 3D matrix having width*height*Depth/channel. we generally skip depth but here we will consider it . Depth is nothing but the RGB channel .Every color image has three components (Red, Blue and Green) and they also contains the non liner information ,expressed technically as Hue or color information. 

The figure will help you to visualize the image.Now a convolution layer is formed by running a convolution kernel over it as we need to find out patterns or more specifically non linear patterns.As discussed earlier a filter or convolution kernel is another block or cuboid of smaller height and width but same depth which is swept over this base block. 
The filter of size 5 x 5 x 3 is started from top left corner of the image and sweeps down till the lower right bottom of the image .The filter is applied over the entire image as well as convoluted along with the channel i.e. RGB and this is because we mostly see the volumetric representation of the image or cuboid in this case .


So with the entire set of filters (e.g. 12 filters ) and each of the filters producing a separate 2D activation or feature map , we stack up these activation maps along the depth dimension and produce the output volume.

 

Detailed Notation:

The image produced after convolution is called as Activation Map or feature map . such  feature maps are obtained by repeated application of a filters  across sub-regions of the entire image, by convolution of the input image with a linear filter, adding a bias term and after that applying a non-linear function which can be tanh or relu (Mostly relu is used). If we denote the k-th feature map at a given layer as , whose filters are determined by the weights and bias , then the feature map  .



You see the size shirked to 28 * 28 Why so? Let’s look at a simpler case.
Suppose the initial image had size 6x6xd and the filter has size 3x3xd. Here  the depth as d because it is always three . Since depth is same, we can have a look at the front view of how filter would work.

Here we can see that the result would be 4x4x1 volume block. .Every time we skip one block and window moves forward . This is called as stride which is the number of blocks to move in each step and is one of the Hyper parameter about  which we will be discuss shortly. So we have a simple formula
output size = (N – F)/S + 1
 where 'N' is image size 'F' is filter size and 'S' is stride hence (32-5)/1 +1 =28.A caution is we can use a value of S which can result in a Non integer value we generally avoid such values , and we avoid this by padding our image with zeros giving them a border around it to avoid such kind of Non integral values.
Now have a look at this image and try to consolidate what we have learned till now .We see the size keeps on shrinking as we increase the filters and strides.




This can incorporate a undesirable situation in case of deep networks where the size of the image would become very small too early. Also, it would restrict the use of large size filters as they would result in faster size reduction.
Hence to avoid this ,we will again do padding of the image  to keep the size of the image same as the original size and how much padding is to be done is decided by the formula (F-1)/2 where 'F' is the filter size . 
So to keep the size of the image constant we will do padding of 2 and the thus the image size is now 34*34 and the general formula becomes as below:

output size = (N – F +2P)/S + 1
 where N=32 F=5 P=2 and S=1 so output size is again (32-5+2*2)/1 +1 =32. 


Parameters of the of the Convolution layer

There are two kind of parameters involved here one is called as the Hyper Parameters and other is simply parameters.Hyper parameters are the parameters that will not change through out the entire convolution layer like Number of filters , Stride , Padding etc whereas the parameters are those which are subsequently changing throughout the entire network. Have a look .....
  • Hyper-parameters:
    • K: #filters
    • F: filter size (FxF)
    • S: stride
    • P: amount of padding
  • parameters = (F.F.D).K + K
    • F.F.D : Number of parameters for each filter (analogous to volume of the cuboid)
    • (F.F.D).K : Volume of each filter multiplied by the number of filters
    • +K: adding K parameters for the bias term.
Some additional points to be taken into consideration:
  • K should be set as powers of 2 for computational efficiency
  • F is generally taken as odd number
  • F=1 might sometimes be used and it makes sense because there is a depth component involved

Pooling Layer

When we use padding in convolution layer, the image size remains same. So, pooling layers are used to reduce the size of image.Mostly the type of down sampling used in CNN architecture is Max Pooling. It is also a form of Non Linear down sampling.  They work by sampling in each layer using filters hence max-pooling partitions the image into a set of non-overlapping rectangles and for each sub-region, outputs the maximum value.
Consider the following 4×4 layer. So if we use a 2×2 filter with stride 2 and max-pooling, we get the following response:
Here you can see that 4 2×2 matrix are combined into 1 and their maximum value is taken. Generally, max-pooling is used but other options like average pooling can be considered.So overall to progressively reduce the spatial size of the representation and  to reduce the amount of parameters and computation in the network pooling layer is used.

Apart from the Dimensionality reduction why Pooling is used?
Max pooling is useful for two reasons:
1. It eliminates non -Maximal values reducing the the computation for upper layers. 
2. It provides a form of Translation Invarience .   
 Max-pooling is useful in vision for two reasons:
  1. By eliminating non-maximal values, it reduces computation for upper layers.
  2. It provides a form of translation invariance. Imagine cascading a max-pooling layer with a convolutional layer. There are 8 directions in which one can translate the input image by a single pixel. If max-pooling is done over a 2x2 region, 3 out of these 8 possible configurations will produce exactly the same output at the convolutional layer. For max-pooling over a 3x3 window, this jumps to 5/8.
    Hence it is a “smart” way of reducing the dimensionality of intermediate representations.

Fully Connected Layer

At the end of convolution and pooling layers, networks generally use fully-connected layers in which each pixel is considered as a separate neuron just like a regular neural network. The last fully-connected layer will contain as many neurons as the number of classes to be predicted.The only difference between FC and CONV layer is that the neurons in CONV layer are connected to a local region in the input and CONV volume share parameters but still both the layers compute dot products . Hence it is possible to convert them back and forth.

The tutorials presented here will introduce you to simple CNN algorithms and will also show you how to run them using Theano. Theano is a python library that makes writing deep learning models easy, and gives the option of training them on a GPU.

Training Complex CNN architectures

Now that we have seen various building blocks in CNN and how to implement their forward pass, we will build and train a convolutional network on CIFAR-10 dataset. We will use popular Lenet architecture.

Load all necessary Packages























No comments: