Convolution
Neural Networks are similar to neural networks and are made of
Neurons that have same weights and biases. Like all Neural networks ,
CNN also receives some inputs performs a dot product and follows a
Non Linearity . Like normal Neural Networks, CNN also have to follow
the same journey from raw Image pixels on one end to class scores on
the other end. They still have a loss function (like SVM / Softmax )
on the last layer and all the tips and tricks involved in learning
Normal Neural Network.
So
, the question arises Why CNN? and Why Convolution ? Why an
architecture requires convolution all together? . Lets handle all
these queries one by one and slowly we will groove on to a famous CNN
architecture Alex-Net and understand why it got famous for computer
vision.
The Co-relation of Brain with Artificial Neural Network
Many researchers having been trying to understand Brain and the functioning of brain specially the visual cortex . The brain is a highly dense network of the neurons , and the activations or firing of the neurons are responsible for our actions , thoughts and memory . Each time we do a certain task these connections become stronger and stronger and thats responsible for a good memory . Our objective is to correlate it to CNN and how our eyes and our visual cortex network interprets the information .Here is Neural structure of the Brain having millions of such Network . These neurons shares parameters and from years our brain has evolved so much that it takes a fraction of sec for you to analyze the object kept beside you is what? We will discuss some important concepts of local connectivity , parameter sharing in relation to brain Neural architecture .
Why
convolution?
Strictly
speaking convolution is a mathematical concept of multiplying two
matrices but the multiplication is not traditional matrix
multiplication, despite being similarly denoted by *.
Convolution
is the process of flipping both the rows and columns of the matrix
and then multiplying locationally similar entries and summing them
up. Now since over point of discussion is focus on the images and
their convolution, we will call one of the matrix as Kernel/Filter
and one matrix as Image.
Image is nothing but a 2D signal from which you can extract valuable information like edges, shape , features of the object , color segmentation and lot more.Typically we do frequency analysis of the signal to
extract useful information for which we do transforms like famous
Fourier transforms , Cosine transforms and Wavelet transforms (give
information in both frequency and time domain).
Convolution
is slightly different with respect to transforms (mentioned above)
as we tend to find out a some co-relation between the two
signals. Lost??
Here
is an example of multiplication of a gray scale image and a matrix.
The multiplication
results into an image showing edges. So we have a convoluted an
image to extract edges . So we have found out the co-relation of the
image with the matrix and this matrix is called as convolution
kernel. It means if we have several of such kernels , we can extract
a lot of such kind of features.
This
is the basic idea of convolution and since with much and much complex
kernels we can extract complex features like shapes corners and
one can definitely extract multiple features to ultimately identify
what the image contains and finally predict to which class it
belongs.The math involved in convolution is out the scope so for now
just remember the convolution is the process of filtering through the
image to find a certain pattern .A simple example of finding the face like features is most popular algorithm Haar Cascade involving certain Haar like features looking on the entire image .
Why Computer Vision is Not so Easy ?
Well
.. the answer to this question in single line is " Even we
have not been able to understood the vision process of our own eye
perfectly " , and then vision for computer and to make it
understand what an object is really difficult .
Object
detection
is considered to be the most basic application of computer vision.In
real life, every time we(humans) open our eyes, we unconsciously
detect objects.Image captured by our eye is transmitted to brain .
Brain process the information and figures out what object it is from
its own trained network of sensory nerves and all this happens
implicitly and within no to time.
Since
it is very super-intuitive for us, we fail to appreciate the key
challenges involved when we try to design systems similar to our eye.
Lets start by looking at some of the key roadblocks:
- Viewpoint
variations: Same object in various angles , different poses and
all depends upon the relative position of the observer and
object . It seems easy to intuitive that it is same object but to
teach a computer it is still a challenge.
- Illumination:
A
great roadblock in the field of computer vision is illumination .
The problem with the illumination is it depreciates the information
of the signal and different illumination environment can create
different aspects for the same image.
- Background
Clutter: Another major issue with the computer vision is
background cluttering .Look carefully and you will find a person ,
though it seems to easy to identify but it is hell lot difficult to
train a machine to make it understand what it is.
Introduction to Convolution Neural Network
CNN
was first pioneered by Yann Legun of New York university and is
formal director of Facebook AI group , which mean Facebook uses CNN
for it AI operations like auto tagging and facial recognition.Before
getting deep into Convolution Neural Network , lets first talk about
why Deep Neural Networks are popular and why are they working so
well.
The
formal answer to this query is the traditional approach to the
computer vision problem which includes machine learning techniques
are unable to cater vast amount of variations in the images. The motive is to capture these non linear variations and learn them and of-course deep learning does it well. CNN does this in a way that it by consecutively modeling small pieces of information and combining them deeper in network. This information is also decided by the network and subsequently from bits and pieces of the information gathered it keeps on learning new features and new variations in the image subsequently learning the shape of the object .
The importance lies not only in building such kind of network but also in understanding what kind of parameters are learned by each neuron layers and seeing what features have been learned .So it becomes is important to understand how current DNNs work and to fuel intuitions for how to improve them .The image shows how CNN looks the world , running filters and extracting the features from the image starting from simpler features like edges, color components and then in subsequent layers certain shape and finally segregating the image and ultimately predicting the class to which it belongs.
A
CNN typically consists of 3 types of layers:
1.Convolution
Layer
2.
Pooling
Layer
3.
Fully
Connected Layer
Architecture
Overview
Before
digging deep into the CNN lets visualize it at an architecture level
and how it works.
To
visualize this layer, imagine a set of evenly spaced flashlights all
shining directly at a wall which is an image. Every flashlight is
looking for the exact same pattern through a process called as Convolution. As ddiscussed earlier CNN has filters and theses weights are nothing but the filters which slide through the entire image looking for a specific pattern . The typical filter on the first layer is small spatially but extends through the full depth of the input volume . Depth here refers to the 3 RGB channel .Hence images represented in context to CNN will be shown in form of volume. During the forward pass , we convolve each filter across the width height of the input volume and compute the dot products between the entries of the filter and input at any position and generates a 2D activation map or feature map that give responses of that filter at every spatial position.Intuitively the network will learn filters that activate when they see some typical visual feature.
Convolution layer
This
is the main crux of the layer .This layer is visualized in the form
of a cuboid having height width and channel.Every image is
represented in the form of an 3D matrix having
width*height*Depth/channel. we generally skip depth but here we will
consider it . Depth is nothing but the RGB channel .Every color image
has three components (Red, Blue and Green) and they also contains the
non liner information ,expressed technically as Hue or color
information.
The
figure will help you to visualize the image.Now a convolution layer
is formed by running a convolution kernel over it as we need to find
out patterns or more specifically non linear patterns.As discussed
earlier a filter or convolution kernel is another block or cuboid of
smaller height and width but same depth which is swept over this base
block.
The filter of size 5 x 5 x 3 is started from top left corner of the image and sweeps down
till the lower right bottom of the image .The filter is applied over the entire image as well as convoluted along with the channel i.e. RGB and this is because we mostly see the volumetric representation of the image or cuboid in this case .
So with the entire set of filters (e.g. 12 filters ) and each of the filters producing a separate 2D activation or feature map , we stack up these activation maps along the depth dimension and produce the output volume.
Detailed Notation:
The image produced after convolution is called as Activation Map or feature map . such feature maps are obtained by repeated application of a filters across
sub-regions of the entire image, by convolution of the
input image with a linear filter, adding a bias term and after that applying a
non-linear function which can be tanh or relu (Mostly relu is used). If we denote the k-th feature map at a given layer as
, whose filters are determined by the weights and bias
, then the feature map .
You see the size shirked to
28 * 28 Why so? Let’s look at a simpler case.
Suppose
the initial image had size 6x6xd and the filter has size 3x3xd. Here
the depth as d because it is always three . Since depth is same,
we can have a look at the front view of how filter would work.
Here we
can see that the result would be 4x4x1 volume block. .Every
time we skip one block and window moves forward . This is called as
stride which is the number of blocks to move in each step and is one
of the Hyper parameter about which we will be discuss shortly.
So we have a simple formula
output
size = (N – F)/S + 1
where
'N' is image size 'F' is filter size and 'S' is stride hence (32-5)/1
+1 =28.A caution is we can use a value of S which can result in a Non
integer value we generally avoid such values , and we avoid this by
padding our image with zeros giving them a border around it to avoid
such kind of Non integral values.
Now
have a look at this image and try to consolidate what we have learned
till now .We see the size keeps on shrinking as we increase the
filters and strides.
This
can incorporate a undesirable situation in case of deep networks
where the size of the image would become very small too early.
Also, it would restrict the use of large size filters as they would
result in faster size reduction.
Hence
to avoid this ,we will again do padding of the image to keep
the size of the image same as the original size and how much padding
is to be done is decided by the formula (F-1)/2 where 'F' is the
filter size .
So
to keep the size of the image constant we will do padding of 2 and
the thus the image size is now 34*34 and the general formula becomes
as below:
output
size = (N – F +2P)/S + 1
where
N=32 F=5 P=2 and S=1 so output size is again (32-5+2*2)/1 +1
=32.
Parameters of the of the Convolution layer
There
are two kind of parameters involved here one is called as the Hyper
Parameters and other is simply parameters.Hyper parameters are the
parameters that will not change through out the entire convolution
layer like Number of filters , Stride , Padding etc whereas the
parameters are those which are subsequently changing throughout the
entire network. Have a look .....
Some
additional points to be taken into consideration:
Pooling Layer
When
we use padding in convolution layer, the image size remains same. So,
pooling layers are used to reduce the size of image.Mostly the type of down sampling used in CNN architecture is Max Pooling. It is also a form of Non Linear down sampling. They work by
sampling in each layer using filters hence max-pooling partitions the image into
a set of non-overlapping rectangles and for each sub-region, outputs the
maximum value.
Consider the following 4×4
layer. So if we use a 2×2 filter with stride 2 and max-pooling, we
get the following response:
Here
you can see that 4 2×2 matrix are combined into 1 and their maximum
value is taken. Generally, max-pooling is used but other options like
average pooling can be considered.So overall to progressively reduce
the spatial size of the representation and to reduce the amount
of parameters and computation in the network pooling layer is used.
Apart from the Dimensionality reduction why Pooling is used?
Max pooling is useful for two reasons:
1. It eliminates non -Maximal values reducing the the computation for upper layers.
2. It provides a form of Translation Invarience .
Max-pooling is useful in vision for two reasons:
By eliminating non-maximal values, it reduces computation for upper layers.
It provides a form of translation invariance. Imagine
cascading a max-pooling layer with a convolutional layer. There are 8
directions in which one can translate the input image by a single pixel.
If max-pooling is done over a 2x2 region, 3 out of these 8 possible
configurations will produce exactly the same output at the convolutional
layer. For max-pooling over a 3x3 window, this jumps to 5/8.
Hence it is a
“smart” way of reducing the dimensionality of intermediate representations.
Fully Connected Layer
At
the end of convolution and pooling layers, networks generally use
fully-connected layers in which each pixel is considered as a
separate neuron just like a regular neural network. The last
fully-connected layer will contain as many neurons as the number of
classes to be predicted.The only difference between FC and CONV layer is that the neurons in CONV layer are connected to a local region in the input and CONV volume share parameters but still both the layers compute dot products . Hence it is possible to convert them back and forth.
The tutorials presented here will introduce you to simple CNN
algorithms and will also show you how to run them using Theano. Theano is a python library that makes writing deep learning models easy, and gives the option of
training them on a GPU.
Training Complex CNN architectures
Now that we have seen various building blocks in CNN and how to implement their forward pass, we will build and train a convolutional network on CIFAR-10 dataset. We will use popular Lenet architecture.
Load all necessary Packages