AI News, Conv Nets: A Modular Perspective

Conv Nets: A Modular Perspective

In the last few years, deep neural networks have lead to breakthrough results on a variety of pattern recognition problems, such as computer vision and voice recognition.

One of the essential components leading to these results has been a special kind of neural network called a convolutional neural network.

At its most basic, convolutional neural networks can be thought of as a kind of neural network that uses many identical copies of the same neuron.1 This allows the network to have lots of neurons and express computationally large models while keeping the number of actual parameters – the values describing how neurons behave – that need to be learned fairly small.

This trick of having multiple copies of the same neuron is roughly analogous to the abstraction of functions in mathematics and computer science.

When programming, we write a function once and use it in many places – not writing the same code a hundred times in different places makes it faster to program, and results in fewer bugs.

Similarly, a convolutional neural network can learn a neuron once and use it in many places, making it easier to learn the model and reducing error.

more sophisticated approach notices a kind of symmetry in the properties it’s useful to look for in the data.

It’s useful to know the frequencies at the beginning, it’s useful to know the frequencies in the middle, and it’s also useful to know the frequencies at the end.

So, we can create a group of neurons, \(A\), that look at small time segments of our data.2 \(A\) looks at all such segments, computing certain features.

They allow later convolutional layers to work on larger sections of the data, because a small patch after the pooling layer corresponds to a much larger patch before it.

In fact, the most famous successes of convolutional neural networks are applying 2D convolutional neural networks to recognizing images.

For example, in a 2-dimensional convolutional layer, one neuron might detect horizontal edges, another might detect vertical edges, and another might detect green-red color contrasts.

In the paper, the model achieves some very impressive results, setting new state of the art on a number of benchmark datasets.

Neurons on the other side specialize on color and texture, detecting color contrasts and patterns.4 Remember that the neurons are randomly initialized.

They were quickly followed by a lot of other work testing modified approaches and gradually improving the results, or applying them to other areas.

And, in addition to the neural networks community, many in the computer vision community have adopted deep convolutional neural networks.

Consider a 1-dimensional convolutional layer with inputs \(\{x_n\}\) and outputs \(\{y_n\}\): It’s relatively easy to describe the outputs in terms of the inputs: \[y_n = A(x_{n}, x_{n+1}, ...)\] For example, in the above: \[y_0 = A(x_0, x_1)\] \[y_1 = A(x_1, x_2)\] Similarly, if we consider a 2-dimensional convolutional layer, with inputs \(\{x_{n,m}\}\) and outputs \(\{y_{n,m}\}\): We can, again, write down the outputs in terms of the inputs: \[y_{n,m} = A\left(\begin{array}{ccc} x_{n,~m}, &

x_{2,~1}~\\\end{array}\right)\] If one combines this with the equation for \(A(x)\), \[A(x) = \sigma(Wx + b)\] one has everything they need to implement a convolutional neural network, at least in theory.

Secondly, it will remove a lot of messiness from our formulation, handling all the bookkeeping presently showing up in the indexing of \(x\)s – the present formulation may not seem messy yet, but that’s only because we haven’t got into the tricky cases yet.

Each hidden layer is made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, and where neurons in a single layer function completely independently and do not share any connections.

In CIFAR-10, images are only of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have 32*32*3 = 3072 weights.

(Note that the word depth here refers to the third dimension of an activation volume, not to the depth of a full Neural Network, which can refer to the total number of layers in a network.) For example, the input images in CIFAR-10 are an input volume of activations, and the volume has dimensions 32x32x3 (width, height, depth respectively).

Moreover, the final output layer would for CIFAR-10 have dimensions 1x1x10, because by the end of the ConvNet architecture we will reduce the full image into a single vector of class scores, arranged along the depth dimension.

During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position.

Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network.

If you’re a fan of the brain/neuron analogies, every entry in the 3D output volume can also be interpreted as an output of a neuron that looks at only a small region in the input and shares parameters with all neurons to the left and right spatially (since these numbers all result from applying the same filter).

It is important to emphasize again this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in space (along width and height), but always full along the entire depth of the input volume.

If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter).

We discuss these next: We can compute the spatial size of the output volume as a function of the input volume size (\(W\)), the receptive field size of the Conv Layer neurons (\(F\)), the stride with which they are applied (\(S\)), and the amount of zero padding used (\(P\)) on the border.

In general, setting zero padding to be \(P = (F - 1)/2\) when the stride is \(S = 1\) ensures that the input volume and output volume will have the same size spatially.

For example, when the input has size \(W = 10\), no zero-padding is used \(P = 0\), and the filter size is \(F = 3\), then it would be impossible to use stride \(S = 2\), since \((W - F + 2P)/S + 1 = (10 - 3 + 0) / 2 + 1 = 4.5\), i.e.

Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of \(K = 96\), the Conv layer output volume had size [55x55x96].

As a fun aside, if you read the actual paper it claims that the input images were 224x224, which is surely incorrect because (224 - 11)/4 + 1 is quite clearly not an integer.

It turns out that we can dramatically reduce the number of parameters by making one reasonable assumption: That if one feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2).

With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique set of weights (one for each depth slice), for a total of 96*11*11*3 = 34,848 unique weights, or 34,944 parameters (+96 biases).

In practice during backpropagation, every neuron in the volume will compute the gradient for its weights, but these gradients will be added up across each depth slice and only update a single set of weights per slice.

Notice that if all neurons in a single depth slice are using the same weight vector, then the forward pass of the CONV layer can in each depth slice be computed as a convolution of the neuron’s weights with the input volume (Hence the name: Convolutional Layer).

The activation map in the output volume (call it V), would then look as follows (only some of the elements are computed in this example): Remember that in numpy, the operation * above denotes elementwise multiplication between the arrays.

To construct a second activation map in the output volume, we would have: where we see that we are indexing into the second depth dimension in V (at index 1) because we are computing the second activation map, and that a different set of parameters (W1) is now used.

Since 3D volumes are hard to visualize, all the volumes (the input volume (in blue), the weight volumes (in red), the output volume (in green)) are visualized with each depth slice stacked in rows.

The input volume is of size \(W_1 = 5, H_1 = 5, D_1 = 3\), and the CONV layer parameters are \(K = 2, F = 3, S = 2, P = 1\).

The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias.

A common implementation pattern of the CONV layer is to take advantage of this fact and formulate the forward pass of a convolutional layer as one big matrix multiply as follows: This approach has the downside that it can use a lot of memory, since some values in the input volume are replicated multiple times in X_col.

For example, if you stack two 3x3 CONV layers on top of each other then you can convince yourself that the neurons on the 2nd layer are a function of a 5x5 patch of the input (we would say that the effective receptive field of these neurons is 5x5).

The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations.

More generally, the pooling layer: It is worth noting that there are only two commonly seen variations of the max pooling layer found in practice: A pooling layer with \(F = 3, S = 2\) (also called overlapping pooling), and more commonly \(F = 2, S = 2\).

Hence, during the forward pass of a pooling layer it is common to keep track of the index of the max activation (sometimes also called the switches) so that gradient routing is efficient during backpropagation.

Consider a ConvNet architecture that takes a 224x224x3 image, and then uses a series of CONV layers and POOL layers to reduce the image to an activations volume of size 7x7x512 (in an AlexNet architecture that we’ll see later, this is done by use of 5 pooling layers that downsample the input spatially by a factor of two each time, making the final spatial size 224/2/2/2/2/2 = 7).

Note that instead of a single vector of class scores of size [1x1x1000], we’re now getting an entire 6x6 array of class scores across the 384x384 image.

Evaluating the original ConvNet (with FC layers) independently across 224x224 crops of the 384x384 image in strides of 32 pixels gives an identical result to forwarding the converted ConvNet one time.

This trick is often used in practice to get better performance, where for example, it is common to resize an image to make it bigger, use a converted ConvNet to evaluate the class scores at many spatial positions and then average the class scores.

For example, note that if we wanted to use a stride of 16 pixels we could do so by combining the volumes received by forwarding the converted ConvNet twice: First over the original image and second over the image but with the image shifted spatially by 16 pixels along both width and height.

Second, if we suppose that all the volumes have \(C\) channels, then it can be seen that the single 7x7 CONV layer would contain \(C \times (7 \times 7 \times C) = 49 C^2\) parameters, while the three 3x3 CONV layers would only contain \(3 \times (C \times (3 \times 3 \times C)) = 27 C^2\) parameters.

I like to summarize this point as “don’t be a hero”: Instead of rolling your own architecture for a problem, you should look at whatever architecture currently works best on ImageNet, download a pretrained model and finetune it on your data.

3x3 or at most 5x5), using a stride of \(S = 1\), and crucially, padding the input volume with zeros in such way that the conv layer does not alter the spatial dimensions of the input.

In an alternative scheme where we use strides greater than 1 or don’t zero-pad the input in CONV layers, we would have to very carefully keep track of the input volumes throughout the CNN architecture and make sure that all strides and filters “work out”, and that the ConvNet architecture is nicely and symmetrically wired.

If the CONV layers were to not zero-pad the inputs and only perform valid convolutions, then the size of the volumes would reduce by a small amount after each CONV, and the information at the borders would be “washed away” too quickly.

For example, filtering a 224x224x3 image with three 3x3 CONV layers with 64 filters each and padding 1 would create three activation volumes of size [224x224x64].

The whole VGGNet is composed of CONV layers that perform 3x3 convolutions with stride 1 and pad 1, and of POOL layers that perform 2x2 max pooling with stride 2 (and no padding).

We can write out the size of the representation at each step of the processing and keep track of both the representation size and the total number of weights: As is common with Convolutional Networks, notice that most of the memory (and also compute time) is used in the early CONV layers, and that most of the parameters are in the last FC layers.

There are three major sources of memory to keep track of: Once you have a rough estimate of the total number of values (for activations, gradients, and misc), the number should be converted to size in GB.

Take the number of values, multiply by 4 to get the raw number of bytes (since every floating point is 4 bytes, or maybe by 8 for double precision), and then divide by 1024 multiple times to get the amount of memory in KB, MB, and finally GB.

Convolutional neural network

In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural networks, most commonly applied to analyzing visual imagery.

CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing.[1] They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics.[2][3] Convolutional networks were inspired by biological processes[4] in that the connectivity pattern between neurons resembles the organization of the animal visual cortex.

A very high number of neurons would be necessary, even in a shallow (opposite of deep) architecture, due to the very large input sizes associated with images, where each pixel is a relevant variable.

The convolution operation brings a solution to this problem as it reduces the number of free parameters, allowing the network to be deeper with fewer parameters.[8] For instance, regardless of image size, tiling regions of size 5 x 5, each with the same shared weights, requires only 25 learnable parameters.

In this way, it resolves the vanishing or exploding gradients problem in training traditional multi-layer neural networks with many layers by using backpropagation[citation needed].

Convolutional networks may include local or global pooling layers[clarification needed], which combine the outputs of neuron clusters at one layer into a single neuron in the next layer.[9][10] For example, max pooling uses the maximum value from each of a cluster of neurons at the prior layer.[11] Another example is average pooling, which uses the average value from each of a cluster of neurons at the prior layer[citation needed].

Work by Hubel and Wiesel in the 1950s and 1960s showed that cat and monkey visual cortexes contain neurons that individually respond to small regions of the visual field.

Provided the eyes are not moving, the region of visual space within which visual stimuli affect the firing of a single neuron is known as its receptive field[citation needed].

Their 1968 paper[12] identified two basic visual cell types in the brain: The neocognitron [13] was introduced in 1980.[11][14] The neocognitron does not require units located at multiple network positions to have the same trainable weights.

This idea appears in 1986 in the book version of the original backpropagation paper.[15]:Figure 14 Neocognitrons were developed in 1988 for temporal signals.[clarification needed][16] Their design was improved in 1998,[17] generalized in 2003[18] and simplified in the same year.[19] LeNet-5, a pioneering 7-level convolutional network by LeCun et al.

Similarly, a shift invariant neural network was proposed for image character recognition in 1988.[2][3] The architecture and training algorithm were modified in 1991[20] and applied for medical image processing[21] and automatic detection of breast cancer in mammograms.[22] A

This design was modified in 1989 to other de-convolution-based designs.[24][25] The feed-forward architecture of convolutional neural networks was extended in the neural abstraction pyramid[26] by lateral and feedback connections.

Following the 2005 paper that established the value of GPGPU for machine learning,[27] several publications described more efficient ways to train convolutional neural networks using GPUs.[28][29][30][31] In 2011, they were refined and implemented on a GPU, with impressive results.[9] In 2012, Ciresan et al.

significantly improved on the best performance in the literature for multiple image databases, including the MNIST database, the NORB database, the HWDB1.0 dataset (Chinese characters), the CIFAR10 dataset (dataset of 60000 32x32 labeled RGB images),[11] and the ImageNet dataset.[32] While traditional multilayer perceptron (MLP) models were successfully used for image recognition[example needed], due to the full connectivity between nodes they suffer from the curse of dimensionality, and thus do not scale well to higher resolution images.

For example, in CIFAR-10, images are only of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully connected neuron in a first hidden layer of a regular neural network would have 32*32*3 = 3,072 weights.

Also, such network architecture does not take into account the spatial structure of data, treating input pixels which are far apart in the same way as pixels that are close together.

Weight sharing dramatically reduces the number of free parameters learned, thus lowering the memory requirements for running the network and allowing the training of larger, more powerful networks.

The layer's parameters consist of a set of learnable filters (or kernels), which have a small receptive field, but extend through the full depth of the input volume.

During the forward pass, each filter is convolved across the width and height of the input volume, computing the dot product between the entries of the filter and the input and producing a 2-dimensional activation map of that filter.

Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation map.

When dealing with high-dimensional inputs such as images, it is impractical to connect neurons to all neurons in the previous volume because such a network architecture does not take the spatial structure of the data into account.

Convolutional networks exploit spatially local correlation by enforcing a local connectivity pattern between neurons of adjacent layers: each neuron is connected to only a small region of the input volume.

Since all neurons in a single depth slice share the same parameters, then the forward pass in each depth slice of the CONV layer can be computed as a convolution of the neuron's weights with the input volume (hence the name: convolutional layer).

The result of this convolution is an activation map, and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume.

The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters and amount of computation in the network, and hence to also control overfitting.

The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples at every depth slice in the input by 2 along both width and height, discarding 75% of the activations.

Average pooling was often used historically but has recently fallen out of favor compared to max pooling, which works better in practice.[34] Due to the aggressive reduction in the size of the representation, the trend is towards using smaller filters[35] or discarding the pooling layer altogether.[36] Region of Interest pooling (also known as RoI pooling) is a variant of max pooling, in which output size is fixed and input rectangle is a parameter.[37] Pooling is an important component of convolutional neural networks for object detection based on Fast R-CNN[38] architecture.

Preserving more information about the input would require keeping the total number of activations (number of feature maps times number of pixel positions) non-decreasing from one layer to the next.

In stochastic pooling,[43] the conventional deterministic pooling operations are replaced with a stochastic procedure, where the activation within each pooling region is picked randomly according to a multinomial distribution, given by the activities within the pooling region.

Since the degree of model overfitting is determined by both its power and the amount of training it receives, providing a convolutional network with more training examples can reduce overfitting.

Since these networks are usually trained with all available data, one approach is to either generate new data from scratch (if possible) or perturb existing data to create new ones.

For example, input images could be asymmetrically cropped by a few percent to create new examples with the same label as the original.[45] One of the simplest methods to prevent overfitting of a network is to simply stop the training before overfitting has had a chance to occur.

Another simple way to prevent overfitting is to limit the number of parameters, typically by limiting the number of hidden units in each layer or limiting network depth.

Limiting the number of parameters restricts the predictive power of the network directly, reducing the complexity of the function that it can perform on the data, and thus limits the amount of overfitting.

simple form of added regularizer is weight decay, which simply adds an additional error, proportional to the sum of weights (L1 norm) or squared magnitude (L2 norm) of the weight vector, to the error at each node.

after seeing a new shape once they can recognize it from a different viewpoint.[47] Currently, the common way to deal with this problem is to train the network on transformed data in different orientations, scales, lighting, etc.

The pose relative to retina is the relationship between the coordinate frame of the retina and the intrinsic features' coordinate frame.[48] Thus, one way of representing something is to embed the coordinate frame within it.

The vectors of neuronal activity that represent pose ('pose vectors') allow spatial transformations modeled as linear operations that make it easier for the network to learn the hierarchy of visual entities and generalize across viewpoints.

In 2012 an error rate of 0.23 percent on the MNIST database was reported.[11] Another paper on using CNN for image classification reported that the learning process was 'surprisingly fast';

in the same paper, the best published results as of 2011 were achieved in the MNIST database and the NORB database.[9] When applied to facial recognition, CNNs achieved a large decrease in error rate.[50] Another paper reported a 97.6 percent recognition rate on '5,600 still images of more than 10 subjects'.[4] CNNs were used to assess video quality in an objective way after manual training;

the resulting system had a very low root mean square error.[51] The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object classification and detection, with millions of images and hundreds of object classes.

The winner GoogLeNet[53] (the foundation of DeepDream) increased the mean average precision of object detection to 0.439329, and reduced classification error to 0.06656, the best result to date.

That performance of convolutional neural networks on the ImageNet tests was close to that of humans.[54] The best algorithms still struggle with objects that are small or thin, such as a small ant on a stem of a flower or a person holding a quill in their hand.

For example, they are not good at classifying objects into fine-grained categories such as the particular breed of dog or species of bird, whereas convolutional neural networks handle this.

One approach is to treat space and time as equivalent dimensions of the input and perform convolutions in both time and space.[56][57] Another way is to fuse the features of two convolutional neural networks, one for the spatial and one for the temporal stream [58][59][60].

CNN models are effective for various NLP problems and achieved excellent results in semantic parsing,[63] search query retrieval,[64] sentence modeling,[65] classification,[66] prediction[67] and other traditional NLP tasks.[68] CNNs have been used in drug discovery.

In 2015, Atomwise introduced AtomNet, the first deep learning neural network for structure-based rational drug design.[69] The system trains directly on 3-dimensional representations of chemical interactions.

Similar to how image recognition networks learn to compose smaller, spatially proximate features into larger, complex structures,[70] AtomNet discovers chemical features, such as aromaticity, sp3 carbons and hydrogen bonding.

Subsequently, AtomNet was used to predict novel candidate biomolecules for multiple disease targets, most notably treatments for the Ebola virus[71] and multiple sclerosis.[72] CNNs have been used in the game of checkers.

The learning process did not use prior human professional games, but rather focused on a minimal set of information contained in the checkerboard: the location and type of pieces, and the piece differential[clarify].

Ultimately, the program (Blondie24) was tested on 165 games against players and ranked in the highest 0.4%.[73][74] It also earned a win against the program Chinook at its 'expert' level of play.[75] CNNs have been used in computer Go.

In December 2014, Clark and Storkey published a paper showing that a CNN trained by supervised learning from a database of human professional games could outperform GNU Go and win some games against Monte Carlo tree search Fuego 1.1 in a fraction of the time it took Fuego to play.[76] Later it was announced that a large 12-layer convolutional neural network had correctly predicted the professional move in 55% of positions, equalling the accuracy of a 6 dan human player.

When the trained convolutional network was used directly to play games of Go, without any search, it beat the traditional search program GNU Go in 97% of games, and matched the performance of the Monte Carlo tree search program Fuego simulating ten thousand playouts (about a million positions) per move.[77] A

couple of CNNs for choosing moves to try ('policy network') and evaluating positions ('value network') driving MCTS were used by AlphaGo, the first to beat the best human player at the time.[78] For many applications, little training data is available.

Other deep reinforcement learning models preceded it.[81] Convolutional deep belief networks (CDBN) have structure very similar to convolutional neural networks and are trained similarly to deep belief networks.

time delay neural network allows speech signals to be processed time-invariantly, analogous to the translation invariance offered by CNNs.[84] They were introduced in the early 1980s.

In the section on linear classification we computed scores for different visual categories given the image using the formula \( s = W x \), where \(W\) was a matrix and \(x\) was an input column vector containing all pixel data of the image.

In the case of CIFAR-10, \(x\) is a [3072x1] column vector, and \(W\) is a [10x3072] matrix, so that the output scores is a vector of 10 class scores.

There are several choices we could make for the non-linearity (which we’ll study below), but this one is a common choice and simply thresholds all activations that are below zero to zero.

Notice that the non-linearity is critical computationally - if we left it out, the two matrices could be collapsed to a single matrix, and therefore the predicted class scores would again be a linear function of the input.

three-layer neural network could analogously look like \( s = W_3 \max(0, W_2 \max(0, W_1 x)) \), where all of \(W_3, W_2, W_1\) are parameters to be learned.

The area of Neural Networks has originally been primarily inspired by the goal of modeling biological neural systems, but has since diverged and become a matter of engineering and achieving good results in Machine Learning tasks.

Approximately 86 billion neurons can be found in the human nervous system and they are connected with approximately 10^14 - 10^15 synapses.

The idea is that the synaptic strengths (the weights \(w\)) are learnable and control the strength of influence (and its direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another.

Based on this rate code interpretation, we model the firing rate of the neuron with an activation function \(f\), which represents the frequency of the spikes along the axon.

Historically, a common choice of activation function is the sigmoid function \(\sigma\), since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1.

An example code for forward-propagating a single neuron might look as follows: In other words, each neuron performs a dot product with the input and its weights, adds the bias and applies the non-linearity (or activation function), in this case the sigmoid \(\sigma(x) = 1/(1+e^{-x})\).

As we saw with linear classifiers, a neuron has the capacity to “like” (activation near one) or “dislike” (activation near zero) certain linear regions of its input space.

With this interpretation, we can formulate the cross-entropy loss as we have seen in the Linear Classification section, and optimizing it would lead to a binary Softmax classifier (also known as logistic regression).

The regularization loss in both SVM/Softmax cases could in this biological view be interpreted as gradual forgetting, since it would have the effect of driving all synaptic weights \(w\) towards zero after every parameter update.

The sigmoid non-linearity has the mathematical form \(\sigma(x) = 1 / (1 + e^{-x})\) and is shown in the image above on the left.

The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1).

Also note that the tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: \( \tanh(x) = 2 \sigma(2x) -1 \).

Other types of units have been proposed that do not have the functional form \(f(w^Tx + b)\) where a non-linearity is applied on the dot product between the weights and the data.

TLDR: “What neuron type should I use?” Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of “dead” units in a network.

For regular neural networks, the most common layer type is the fully-connected layer in which neurons between two adjacent layers are fully pairwise connected, but neurons within a single layer share no connections.

Working with the two example networks in the above picture: To give you some context, modern Convolutional Networks contain on orders of 100 million parameters and are usually made up of approximately 10-20 layers (hence deep learning).

The full forward pass of this 3-layer neural network is then simply three matrix multiplications, interwoven with the application of the activation function: In the above code, W1,W2,W3,b1,b2,b3 are the learnable parameters of the network.

Notice also that instead of having a single input column vector, the variable x could hold an entire batch of training data (where each input example would be a column of x) and then all examples would be efficiently evaluated in parallel.

Neural Networks work well in practice because they compactly express nice, smooth functions that fit well with the statistical properties of data we encounter in practice, and are also easy to learn using our optimization algorithms (e.g.

Similarly, the fact that deeper networks (with multiple hidden layers) can work better than a single-hidden-layer networks is an empirical observation, despite the fact that their representational power is equal.

As an aside, in practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more.

We could train three separate neural networks, each with one hidden layer of some size and obtain the following classifiers: In the diagram above, we can see that Neural Networks with more neurons can express more complicated functions.

For example, the model with 20 hidden neurons fits all the training data but at the cost of segmenting the space into many disjoint red and green decision regions.

The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It’s clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e.

Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss.

In practice, what you find is that if you train a small network the final loss can display a good amount of variance - in some cases you get lucky and converge to a good place but in some cases you get trapped in one of the bad minima.

The best explanation of Convolutional Neural Networks on the Internet!

CNNs have wide applications in image and video recognition, recommender systems and natural language processing.

Each neuron receives several inputs, takes a weighted sum over them, pass it through an activation function and responds with an output.

Unlike neural networks, where the input is a vector, here the input is a multi-channeled image (3 channeled in this case).

We take the 5*5*3 filter and slide it over the complete image and along the way take the dot product between the filter and chunks of the input image.

(Hint: There are 28*28 unique positions where the filter can be put on the image) The convolution layer is the main building block of a convolutional neural network.

For a particular feature map (the output received on convolving the image with a particular filter is called a feature map), each neuron is connected only to a small chunk of the input image and all the neurons have the same connection weights.

Local connectivity is the concept of each neural connected only to a subset of the input image (unlike a neural network where all the neurons are fully connected) This helps to reduce the number of parameters in the whole system and makes the computation more efficient.

Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks and work in a similar way.

Like Facebook Page for updates

Convolutional Neural Networks (ConvNets or CNNs) are a category of Neural Networks that have proven very effective in areas such as image recognition and classification.

In Figure 1 above, a ConvNet is able to recognize scenes and the system is able to suggest relevant captions (“a soccer player is kicking a soccer ball”) while Figure 2 shows an example of ConvNets being used for recognizing everyday objects, humans and animals.

There have been several new architectures proposed in the recent years which are improvements over the LeNet, but they all use the main concepts from the LeNet and are relatively easier to understand if you have a clear understanding of the former.

The Convolutional Neural Network in Figure 3 is similar in architecture to the original LeNet and classifies an input image into four categories: dog, cat, boat or bird (the original LeNet was used mainly for character recognition tasks).

As evident from the figure above, on receiving a boat image as input, the network correctly assigns the highest probability for boat (0.94) among all four categories.

There are four main operations in the ConvNet shown in Figure 3 above: These operations are the basic building blocks of every Convolutional Neural Network, so understanding how these work is an important step to developing a sound understanding of ConvNets.

The primary purpose of Convolution in case of a ConvNet is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. We will not go into the mathematical details of Convolution here, but will try to understand how it works over images.

Consider a 5 x 5 image whose pixel values are only 0 and 1 (note that for a grayscale image, pixel values range from 0 to 255, the green matrix below is a special case where pixel values are only 0 and 1):

Then, the Convolution of the 5 x 5 image and the 3 x 3 matrix can be computed as shown in the animation in Figure 5 below: Take a moment to understand how the computation above is being done.

We slide the orange matrix over our original image (green) by 1 pixel (also called ‘stride’) and for every position, we compute element wise multiplication (between the two matrices) and add the multiplication outputs to get the final integer which forms a single element of the output matrix (pink).

In practice, a CNN learns the values of these filters on its own during the training process (although we still need to specify parameters such as number of filters, filter size, architecture of the network etc.

In case of Max Pooling, we define a spatial neighborhood (for example, a 2×2 window) and take the largest element from the rectified feature map within that window.

In the network shown in Figure 11, pooling operation is applied separately to each feature map (notice that, due to this, we get three output maps from three input maps).

Together these layers extract the useful features from the images, introduce non-linearity in our network and reduce feature dimension while aiming to make the features somewhat equivariant to scale and translation [18].

The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset.

For example, the image classification task we set out to perform has four possible outputs as shown in Figure 14 below (note that Figure 14 does not show connections between the nodes in the fully connected layer)

This is ensured by using the Softmax as the activation function in the output layer of the Fully Connected Layer. The Softmax function takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sum to one.

When a new (unseen) image is input into the ConvNet, the network would go through the forward propagation step and output a probability for each class (for a new image, the output probabilities are calculated using the weights which have been optimized to correctly classify all the previous training examples).

In general, the more convolution steps we have, the more complicated features our network will be able to learn to recognize. For example, in Image Classification a ConvNet may learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features, such as facial shapes in higher layers [14].

these features were learnt using a Convolutional Deep Belief Network and the figure is included here just for demonstrating the idea (this is only an example: real life convolution filters may detect objects that have no meaning to humans).

The input image contains 1024 pixels (32 x 32 image) and the first Convolution layer (Convolution Layer 1) is formed by convolution of six unique 5 × 5 (stride 1) filters with the input image.

For a more thorough understanding of some of these concepts, I would encourage you to go through the notes from Stanford’s course on ConvNets as well as other excellent resources mentioned under References below. If you face any issues understanding any of the above concepts or have questions / suggestions, feel free to leave a comment below.

Convolutional Neural Networks - Ep. 8 (Deep Learning SIMPLIFIED)

Out of all the current Deep Learning applications, machine vision remains one of the most popular. Since Convolutional Neural Nets (CNN) are one of the best ...

Neural Networks 8: hidden units = features

How Convolutional Neural Networks work

A gentle guided tour of Convolutional Neural Networks. Come lift the curtain and see how the magic is done. For slides and text, check out the accompanying ...

Layers in a Neural Network explained

In this video, we explain the concept of layers in a neural network and show how to create and specify layers in code with Keras. Follow deeplizard on Twitter: ...

Deep Learning - Choosing Network Size

How many nodes and layers do we need? We combine elements of scikit learn and Keras Neural Nets in this lesson.

CS231n Lecture 7 - Convolutional Neural Networks

Convolutional Neural Networks: architectures, convolution / pooling layers Case study of ImageNet challenge winning ConvNets.

Recurrent Neural Networks - Ep. 9 (Deep Learning SIMPLIFIED)

Our previous discussions of deep net applications were limited to static patterns, but how can a net decipher and label patterns that change with time?

10.4: Neural Networks: Multilayer Perceptron Part 1 - The Nature of Code

In this video, I move beyond the Simple Perceptron and discuss what happens when you build multiple layers of interconnected perceptrons ("fully-connected ...

Lecture 9 | CNN Architectures

In Lecture 9 we discuss some common architectures for convolutional neural networks. We discuss architectures which performed well in the ImageNet ...

Neural networks [9.5] : Computer vision - pooling and subsampling