# AI News, Must Know Tips/Tricks in Deep Neural Networks

## Must Know Tips/Tricks in Deep Neural Networks

Deep Neural Networks, especially Convolutional Neural Networks (CNN), allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction.

These methods have dramatically improved the state-of-the-arts in visual object recognition, object detection, text recognition and many other domains such as drug discovery and genomics.

Thus, they collected and concluded many implementation details for DCNNs.  Here they will introduce these extensive implementation details, i.e., tricks or tips, for building and training your own deep networks.

## CS231n Convolutional Neural Networks for Visual Recognition

Each hidden layer is made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, and where neurons in a single layer function completely independently and do not share any connections.

In CIFAR-10, images are only of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have 32*32*3 = 3072 weights.

(Note that the word depth here refers to the third dimension of an activation volume, not to the depth of a full Neural Network, which can refer to the total number of layers in a network.) For example, the input images in CIFAR-10 are an input volume of activations, and the volume has dimensions 32x32x3 (width, height, depth respectively).

Moreover, the final output layer would for CIFAR-10 have dimensions 1x1x10, because by the end of the ConvNet architecture we will reduce the full image into a single vector of class scores, arranged along the depth dimension.

During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position.

Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network.

If you’re a fan of the brain/neuron analogies, every entry in the 3D output volume can also be interpreted as an output of a neuron that looks at only a small region in the input and shares parameters with all neurons to the left and right spatially (since these numbers all result from applying the same filter).

It is important to emphasize again this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in space (along width and height), but always full along the entire depth of the input volume.

If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter).

We discuss these next: We can compute the spatial size of the output volume as a function of the input volume size ($$W$$), the receptive field size of the Conv Layer neurons ($$F$$), the stride with which they are applied ($$S$$), and the amount of zero padding used ($$P$$) on the border.

In general, setting zero padding to be $$P = (F - 1)/2$$ when the stride is $$S = 1$$ ensures that the input volume and output volume will have the same size spatially.

For example, when the input has size $$W = 10$$, no zero-padding is used $$P = 0$$, and the filter size is $$F = 3$$, then it would be impossible to use stride $$S = 2$$, since $$(W - F + 2P)/S + 1 = (10 - 3 + 0) / 2 + 1 = 4.5$$, i.e.

Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of $$K = 96$$, the Conv layer output volume had size [55x55x96].

As a fun aside, if you read the actual paper it claims that the input images were 224x224, which is surely incorrect because (224 - 11)/4 + 1 is quite clearly not an integer.

It turns out that we can dramatically reduce the number of parameters by making one reasonable assumption: That if one feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2).

With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique set of weights (one for each depth slice), for a total of 96*11*11*3 = 34,848 unique weights, or 34,944 parameters (+96 biases).

In practice during backpropagation, every neuron in the volume will compute the gradient for its weights, but these gradients will be added up across each depth slice and only update a single set of weights per slice.

Notice that if all neurons in a single depth slice are using the same weight vector, then the forward pass of the CONV layer can in each depth slice be computed as a convolution of the neuron’s weights with the input volume (Hence the name: Convolutional Layer).

The activation map in the output volume (call it V), would then look as follows (only some of the elements are computed in this example): Remember that in numpy, the operation * above denotes elementwise multiplication between the arrays.

To construct a second activation map in the output volume, we would have: where we see that we are indexing into the second depth dimension in V (at index 1) because we are computing the second activation map, and that a different set of parameters (W1) is now used.

Since 3D volumes are hard to visualize, all the volumes (the input volume (in blue), the weight volumes (in red), the output volume (in green)) are visualized with each depth slice stacked in rows.

The input volume is of size $$W_1 = 5, H_1 = 5, D_1 = 3$$, and the CONV layer parameters are $$K = 2, F = 3, S = 2, P = 1$$.

The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias.

A common implementation pattern of the CONV layer is to take advantage of this fact and formulate the forward pass of a convolutional layer as one big matrix multiply as follows: This approach has the downside that it can use a lot of memory, since some values in the input volume are replicated multiple times in X_col.

For example, if you stack two 3x3 CONV layers on top of each other then you can convince yourself that the neurons on the 2nd layer are a function of a 5x5 patch of the input (we would say that the effective receptive field of these neurons is 5x5).

The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations.

More generally, the pooling layer: It is worth noting that there are only two commonly seen variations of the max pooling layer found in practice: A pooling layer with $$F = 3, S = 2$$ (also called overlapping pooling), and more commonly $$F = 2, S = 2$$.

Hence, during the forward pass of a pooling layer it is common to keep track of the index of the max activation (sometimes also called the switches) so that gradient routing is efficient during backpropagation.

Consider a ConvNet architecture that takes a 224x224x3 image, and then uses a series of CONV layers and POOL layers to reduce the image to an activations volume of size 7x7x512 (in an AlexNet architecture that we’ll see later, this is done by use of 5 pooling layers that downsample the input spatially by a factor of two each time, making the final spatial size 224/2/2/2/2/2 = 7).

Note that instead of a single vector of class scores of size [1x1x1000], we’re now getting an entire 6x6 array of class scores across the 384x384 image.

Evaluating the original ConvNet (with FC layers) independently across 224x224 crops of the 384x384 image in strides of 32 pixels gives an identical result to forwarding the converted ConvNet one time.

This trick is often used in practice to get better performance, where for example, it is common to resize an image to make it bigger, use a converted ConvNet to evaluate the class scores at many spatial positions and then average the class scores.

For example, note that if we wanted to use a stride of 16 pixels we could do so by combining the volumes received by forwarding the converted ConvNet twice: First over the original image and second over the image but with the image shifted spatially by 16 pixels along both width and height.

Second, if we suppose that all the volumes have $$C$$ channels, then it can be seen that the single 7x7 CONV layer would contain $$C \times (7 \times 7 \times C) = 49 C^2$$ parameters, while the three 3x3 CONV layers would only contain $$3 \times (C \times (3 \times 3 \times C)) = 27 C^2$$ parameters.

I like to summarize this point as “don’t be a hero”: Instead of rolling your own architecture for a problem, you should look at whatever architecture currently works best on ImageNet, download a pretrained model and finetune it on your data.

3x3 or at most 5x5), using a stride of $$S = 1$$, and crucially, padding the input volume with zeros in such way that the conv layer does not alter the spatial dimensions of the input.

In an alternative scheme where we use strides greater than 1 or don’t zero-pad the input in CONV layers, we would have to very carefully keep track of the input volumes throughout the CNN architecture and make sure that all strides and filters “work out”, and that the ConvNet architecture is nicely and symmetrically wired.

If the CONV layers were to not zero-pad the inputs and only perform valid convolutions, then the size of the volumes would reduce by a small amount after each CONV, and the information at the borders would be “washed away” too quickly.

For example, filtering a 224x224x3 image with three 3x3 CONV layers with 64 filters each and padding 1 would create three activation volumes of size [224x224x64].

The whole VGGNet is composed of CONV layers that perform 3x3 convolutions with stride 1 and pad 1, and of POOL layers that perform 2x2 max pooling with stride 2 (and no padding).

We can write out the size of the representation at each step of the processing and keep track of both the representation size and the total number of weights: As is common with Convolutional Networks, notice that most of the memory (and also compute time) is used in the early CONV layers, and that most of the parameters are in the last FC layers.

There are three major sources of memory to keep track of: Once you have a rough estimate of the total number of values (for activations, gradients, and misc), the number should be converted to size in GB.

Take the number of values, multiply by 4 to get the raw number of bytes (since every floating point is 4 bytes, or maybe by 8 for double precision), and then divide by 1024 multiple times to get the amount of memory in KB, MB, and finally GB.

## Must Know Tips/Tricks in Deep Neural Networks (by Xiu-Shen Wei)

We assume you already know the basic knowledge of deep learning, and here we will present the implementation details (tricks or tips) in Deep Neural Networks, especially CNN for image-related tasks, mainly in eight aspects: 1) data augmentation;

Since deep networks need to be trained on a huge number of training images to achieve satisfactory performance, if the original image data set contains limited training images, it is better to do data augmentation to boost the performance.

It only makes sense to apply this pre-processing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm.

In case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional pre-processing step.

Then, you can compute the covariance matrix that tells us about the correlation structure in the data: After that, you decorrelate the data by projecting the original (but zero-centered) data into the eigenbasis: The last transformation is whitening, which takes the data in the eigenbasis and divides every dimension by the eigenvalue to normalize the scale: Note that here it adds 1e-5 (or a small constant) to prevent division by zero.

One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input.

The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network.

It turns out that you can normalize the variance of each neuron's output to 1 by scaling its weight vector by the square root of its fan-in (i.e., its number of inputs), which is as follows: where “randn” is the aforementioned Gaussian and “n” is the number of its inputs.

[4] derives an initialization specifically for ReLUs, reaching the conclusion that the variance of neurons in the network should be as: which is the current recommendation for use in practice, as discussed in [4].

There are several ways of controlling the capacity of Neural Networks to prevent overfitting: Finally, from the tips above, you can get the satisfactory settings (e.g., data processing, architectures choices and details, etc.) for your own deep networks.

It is well known that an ensemble is usually significantly more accurate than a single learner, and ensemble methods have already achieved great success in many real-world tasks.

As discussed in a recent technique report [10], when deep CNNs are trained on these imbalanced training sets, the results show that imbalanced training data can potentially have a severely negative impact on overall performance in deep networks.

Because the original cultural event images are imbalanced, we merely extract crops from the classes which have a small number of training images, which on one hand can supply diverse data sources, and on the other hand can solve the class-imbalanced problem.

At the beginning of fine-tuning on your data set, you firstly fine-tune on the classes which have a large number of training samples (images/crops), and secondly, continue to fine-tune but on the classes with limited number samples.

## Applied Deep Learning - Part 4: Convolutional Neural Networks

In Part 2 we applied deep learning to real-world datasets, covering the 3 most commonly encountered problems as case studies: binary classification, multiclass classification and regression.

The recent surge of interest in deep learning is due to the immense popularity and effectiveness of convnets.

The main advantage of CNN compared to its predecessors is that it automatically detects the important features without any human supervision.

We are dealing with a very powerful and efficient model which performs automatic feature extraction to achieve superhuman accuracy (yes CNN models now do image classification better than humans).

We perform a series convolution + pooling operations, followed by a number of fully connected layers.

In our case the convolution is applied on the input data using a convolution filter to produce a feature map.

Here the filter is at the top left, the output of the convolution operation “4” is shown in the resulting feature map.

In reality an image is represented as a 3D matrix with dimensions of height, width and depth, where depth corresponds to color channels (RGB).

A convolution filter has a specific height and width, like 3x3 or 5x5, and by design it covers the entire depth of its input so it needs to be 3D as well.

We perform multiple convolutions on an input, each using a different filter and resulting in a distinct feature map.

Let’s say we have a 32x32x3 image and we use a filter of size 5x5x3 (note that the depth of the convolution filter matches the depth of the image, both being 3).

When the filter is at a particular location it covers a small volume of the input, and we perform the convolution operation described above.

We slide the filter over the input like above and perform the convolution at every location aggregating the result in a feature map.

If we used 10 different filters we would have 10 feature maps of size 32x32x1 and stacking them along the depth dimension would give us the final output of the convolution layer: a volume of size 32x32x10, shown as the large blue box on the right.

Note that the height and width of the feature map are unchanged and still 32, it’s due to padding and we will elaborate on that shortly.

The animation shows the sliding operation at 4 locations, but in reality it’s performed over the entire input.

But keep in mind that any type of convolution involves a relu operation, without that the network won’t achieve its true potential.

We see that the size of the feature map is smaller than the input, because the convolution filter needs to be contained in the input.

Padding is commonly used in CNN to preserve the size of the feature maps, otherwise they would shrink at each layer, which is not desirable.

The 3D convolution figures we saw above used padding, that’s why the height and width of the feature map was the same as the input (both 32x32), and only the depth changed.

Pooling layers downsample each feature map independently, reducing the height and width, keeping the depth intact.

The most common type of pooling is max pooling which just takes the max value in the pooling window.

If the input to the pooling layer has the dimensionality 32x32x10, using the same pooling parameters described above, the result will be a 16x16x10 feature map.

Both the height and width of the feature map are halved, but the depth doesn’t change because pooling works independently on each depth slice the input.

We have 4 important hyperparameters to decide on: After the convolution + pooling layers we add a couple of fully connected layers to wrap up the CNN architecture.

Remember that the output of both convolution and pooling layers are 3D volumes, but a fully connected layer expects a 1D vector of numbers.

So we flatten the output of the final pooling layer to a vector and that becomes the input to the fully connected layer.

For example given an image, the convolution layer detects features such as two eyes, long ears, four legs, a short tail and so on.

The fully connected layers then act as a classifier on top of these features, and assign a probability for the input image being a dog.

The first layers detect edges, the next layers combine them to detect shapes, to following layers merge this information to infer that this is a nose.

The fully connected layers learn how to use these features produced by convolutions in order to correctly classify the images.

We will use the following architecture: 4 convolution + pooling layers, followed by 2 fully connected layers.

There are 4 new methods we haven’t seen before: Dropout is by far the most popular regularization technique for deep neural networks.

Even the state-of-the-art models which have 95% accuracy get a 2% accuracy boost just by adding dropout, which is a fairly substantial gain at that level.

The dropped-out neurons are resampled with probability p at every training step, so a dropped out neuron at one step can be active at the next one.

The hyperparameter p is called the dropout-rate and it’s typically a number around 0.5, corresponding to 50% of the neurons being dropped out.

The reason is that dropout prevents the network to be too dependent on a small number of neurons, and forces every neuron to be able to operate independently.

This might sound familiar from constraining the code size of the autoencoder in Part 3, in order to learn more intelligent representations.

But if every morning you tossed a coin to decide whether you will go to work or not, then your coworkers will need to adapt.

We will take a look at loss and accuracy curves, comparing training set performance against the validation set.

Training loss keeps going down but the validation loss starts increasing after around epoch 10.

The model is memorizing the training data, but it’s failing to generalize to new instances, and that’s why the validation performance goes worse.

But fortunately there is a solution to this problem which enables us to train deep models on small datasets, and it’s called data augmentation.

The common case in most machine learning applications, especially in image classification tasks is that obtaining new training data is not easy.

It enriches or “augments” the training data by generating new examples via random transformation of existing ones.

We need to generate realistic images, and the transformations should be learnable, simply adding noise won’t help.

Also, data augmentation is only performed on the training data, we don’t touch the validation or test set.

These are new training instances, applying transformations on the original image doesn’t change the fact that this is still a cat image.

The code for the model definition will not change at all, since we’re not changing the architecture of our model.

And validation accuracy jumped from 73% with no data augmentation to 81% with data augmentation, 11% improvement.

Second, we made the model transformation invariant, meaning the model saw a lot of shifted/rotated/scaled images so it’s able to recognize them better.

There are much more complicated models which perform better, for example Microsoft’s ResNet model was the winner of 2015 ImageNet challenge with 3.6% error rate, but the model has 152 layers!

As we discussed above the height and width correspond to the dimensions of the feature map, and each depth channel is a distinct feature map encoding independent features.

Instead of looking at a single feature map, it would be more interesting to visualize multiple feature maps from a convolution layer.

So let’s visualize the feature maps corresponding to the first convolution of each block, the red arrows in the figure below.

There is one catch though, we won’t actually visualize the filters themselves, but instead we will display the patterns each filter maximally responds to.

Lower layers encode/detect simple structures, as we go deeper the layers build on top of each other and learn to encode more complex patterns.

very thorough online free book about deep learning can be found here, with the CNN section available here.

## Deep learning

In the last chapter we learned that deep neural networks are often much harder to train than shallow neural networks.

We'll also look at the broader picture, briefly reviewing recent progress on using deep nets for image recognition, speech recognition, and other applications.

We'll work through a detailed example - code and all - of using convolutional nets to solve the problem of classifying handwritten digits from the MNIST data set:

As we go we'll explore many powerful techniques: convolutions, pooling, the use of GPUs to do far more training than we did with our shallow networks, the algorithmic expansion of our training data (to reduce overfitting), the use of the dropout technique (also to reduce overfitting), the use of ensembles of networks, and others.

We conclude our discussion of image recognition with a survey of some of the spectacular recent progress using networks (particularly convolutional nets) to do image recognition.

We'll briefly survey other models of neural networks, such as recurrent neural nets and long short-term memory units, and how such models can be applied to problems in speech recognition, natural language processing, and other areas.

And we'll speculate about the future of neural networks and deep learning, ranging from ideas like intention-driven user interfaces, to the role of deep learning in artificial intelligence.

For the $28 \times 28$ pixel images we've been using, this means our network has $784$ ($= 28 \times 28$) input neurons.

Our earlier networks work pretty well: we've obtained a classification accuracy better than 98 percent, using training and test data from the MNIST handwritten digit data set.

But the seminal paper establishing the modern subject of convolutional networks was a 1998 paper, 'Gradient-based learning applied to document recognition', by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.

LeCun has since made an interesting remark on the terminology for convolutional nets: 'The [biological] neural inspiration in models like convolutional nets is very tenuous.

That's why I call them 'convolutional nets' not 'convolutional neural nets', and why we call the nodes 'units' and not 'neurons' '.

Despite this remark, convolutional nets use many of the same ideas as the neural networks we've studied up to now: ideas such as backpropagation, gradient descent, regularization, non-linear activation functions, and so on.

In a convolutional net, it'll help to think instead of the inputs as a $28 \times 28$ square of neurons, whose values correspond to the $28 \times 28$ pixel intensities we're using as inputs:

To be more precise, each neuron in the first hidden layer will be connected to a small region of the input neurons, say, for example, a $5 \times 5$ region, corresponding to $25$ input pixels.

So, for a particular hidden neuron, we might have connections that look like this: That region in the input image is called the local receptive field for the hidden neuron.

To illustrate this concretely, let's start with a local receptive field in the top-left corner: Then we slide the local receptive field over by one pixel to the right (i.e., by one neuron), to connect to a second hidden neuron:

Note that if we have a $28 \times 28$ input image, and $5 \times 5$ local receptive fields, then there will be $24 \times 24$ neurons in the hidden layer.

This is because we can only move the local receptive field $23$ neurons across (or $23$ neurons down), before colliding with the right-hand side (or bottom) of the input image.

In this chapter we'll mostly stick with stride length $1$, but it's worth knowing that people sometimes experiment with different stride lengths* *As was done in earlier chapters, if we're interested in trying different stride lengths then we can use validation data to pick out the stride length which gives the best performance.

The same approach may also be used to choose the size of the local receptive field - there is, of course, nothing special about using a $5 \times 5$ local receptive field.

In general, larger local receptive fields tend to be helpful when the input images are significantly larger than the $28 \times 28$ pixel MNIST images..

In other words, for the $j, k$th hidden neuron, the output is: \begin{eqnarray} \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4 w_{l,m} a_{j+l, k+m} \right).

Informally, think of the feature detected by a hidden neuron as the kind of input pattern that will cause the neuron to activate: it might be an edge in the image, for instance, or maybe some other type of shape.

To see why this makes sense, suppose the weights and bias are such that the hidden neuron can pick out, say, a vertical edge in a particular local receptive field.

To put it in slightly more abstract terms, convolutional networks are well adapted to the translation invariance of images: move a picture of a cat (say) a little ways, and it's still an image of a cat* *In fact, for the MNIST digit classification problem we've been studying, the images are centered and size-normalized.

One of the early convolutional networks, LeNet-5, used $6$ feature maps, each associated to a $5 \times 5$ local receptive field, to recognize MNIST digits.

Let's take a quick peek at some of the features which are learned* *The feature maps illustrated come from the final convolutional network we train, see here.:

Each map is represented as a $5 \times 5$ block image, corresponding to the $5 \times 5$ weights in the local receptive field.

By comparison, suppose we had a fully connected first layer, with $784 = 28 \times 28$ input neurons, and a relatively modest $30$ hidden neurons, as we used in many of the examples earlier in the book.

That, in turn, will result in faster training for the convolutional model, and, ultimately, will help us build deep networks using convolutional layers.

Incidentally, the name convolutional comes from the fact that the operation in Equation (125)\begin{eqnarray} \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4 w_{l,m} a_{j+l, k+m} \right) \nonumber\end{eqnarray}$('#margin_15061212923_reveal').click(function() {$('#margin_15061212923').toggle('slow', function() {});});

A little more precisely, people sometimes write that equation as $a^1 = \sigma(b + w * a^0)$, where $a^1$ denotes the set of output activations from one feature map, $a^0$ is the set of input activations, and $*$ is called a convolution operation.

In particular, I'm using 'feature map' to mean not the function computed by the convolutional layer, but rather the activation of the hidden neurons output from the layer.

In max-pooling, a pooling unit simply outputs the maximum activation in the $2 \times 2$ input region, as illustrated in the following diagram:

Note that since we have $24 \times 24$ neurons output from the convolutional layer, after pooling we have $12 \times 12$ neurons.

So if there were three feature maps, the combined convolutional and max-pooling layers would look like:

Here, instead of taking the maximum activation of a $2 \times 2$ region of neurons, we take the square root of the sum of the squares of the activations in the $2 \times 2$ region.

It's similar to the architecture we were just looking at, but has the addition of a layer of $10$ output neurons, corresponding to the $10$ possible values for MNIST digits ('0', '1', '2', etc):

Problem Backpropagation in a convolutional network The core equations of backpropagation in a network with fully-connected layers are (BP1)\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}$('#margin_518328611346_reveal').click(function() {$('#margin_518328611346').toggle('slow', function() {});});-(BP4)\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}$('#margin_493404030989_reveal').click(function() {$('#margin_493404030989').toggle('slow', function() {});});

Suppose we have a network containing a convolutional layer, a max-pooling layer, and a fully-connected output layer, as in the network discussed above.

The program we'll use to do this is called network3.py, and it's an improved version of the programs network.py and network2.py developed in earlier chapters* *Note also that network3.py incorporates ideas from the Theano library's documentation on convolutional neural nets (notably the implementation of LeNet-5), from Misha Denil's implementation of dropout, and from Chris Olah..

But now that we understand those details, for network3.py we're going to use a machine learning library known as Theano* *See Theano: A CPU and GPU Math Expression Compiler in Python, by James Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Ravzan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio (2010).

The examples which follow were run using Theano 0.6* *As I release this chapter, the current version of Theano has changed to version 0.7.

Note that the code in the script simply duplicates and parallels the discussion in this section.Note also that throughout the section I've explicitly specified the number of training epochs.

In practice, it's worth using early stopping, that is, tracking accuracy on the validation set, and stopping training when we are confident the validation accuracy has stopped improving.: &gt;&gt;&gt;

Using the validation data to decide when to evaluate the test accuracy helps avoid overfitting to the test data (see this earlier discussion of the use of validation data).

Your results may vary slightly, since the network's weights and biases are randomly initialized* *In fact, in this experiment I actually did three separate runs training a network with this architecture.

This $97.80$ percent accuracy is close to the $98.04$ percent accuracy obtained back in Chapter 3, using a similar network architecture and learning hyper-parameters.

Second, while the final layer in the earlier network used sigmoid activations and the cross-entropy cost function, the current network uses a softmax final layer, and the log-likelihood cost function.

I haven't made this switch for any particularly deep reason - mostly, I've done it because softmax plus log-likelihood cost is more common in modern image classification networks.

In this architecture, we can think of the convolutional and pooling layers as learning about local spatial structure in the input training image, while the later, fully-connected layer learns at a more abstract level, integrating global information from across the entire image.

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2)),

validation_data, test_data)

Can we improve on the $98.78$ percent classification accuracy?

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2)),

filter_shape=(40, 20, 5, 5),

poolsize=(2, 2)),

validation_data, test_data)

In fact, you can think of the second convolutional-pooling layer as having as input $12 \times 12$ 'images', whose 'pixels' represent the presence (or absence) of particular localized features in the original input image.

The output from the previous layer involves $20$ separate feature maps, and so there are $20 \times 12 \times 12$ inputs to the second convolutional-pooling layer.

In fact, we'll allow each neuron in this layer to learn from all $20 \times 5 \times 5$ input neurons in its local receptive field.

More informally: the feature detectors in the second convolutional-pooling layer have access to all the features from the previous layer, but only within their particular local receptive field* *This issue would have arisen in the first layer if the input images were in color.

In that case we'd have 3 input features for each pixel, corresponding to red, green and blue channels in the input image.

So we'd allow the feature detectors to have access to all color information, but only within a given local receptive field..

Problem Using the tanh activation function Several times earlier in the book I've mentioned arguments that the tanh function may be a better activation function than the sigmoid function.

Try training the network with tanh activations in the convolutional and fully-connected layers* *Note that you can pass activation_fn=tanh as a parameter to the ConvPoolLayer and FullyConnectedLayer classes..

Try plotting the per-epoch validation accuracies for both tanh- and sigmoid-based networks, all the way out to $60$ epochs.

If your results are similar to mine, you'll find the tanh networks train a little faster, but the final accuracies are very similar.

Can you get a similar training speed with the sigmoid, perhaps by changing the learning rate, or doing some rescaling* *You may perhaps find inspiration in recalling that $\sigma(z) = (1+\tanh(z/2))/2$.?

Try a half-dozen iterations on the learning hyper-parameters or network architecture, searching for ways that tanh may be superior to the sigmoid.

Personally, I did not find much advantage in switching to tanh, although I haven't experimented exhaustively, and perhaps you may find a way.

In any case, in a moment we will find an advantage in switching to the rectified linear activation function, and so we won't go any deeper into the use of tanh.

Using rectified linear units: The network we've developed at this point is actually a variant of one of the networks used in the seminal 1998 paper* *'Gradient-based learning applied to document recognition', by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner (1998).

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

filter_shape=(40, 20, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

However, across all my experiments I found that networks based on rectified linear units consistently outperformed networks based on sigmoid activation functions.

The reason for that recent adoption is empirical: a few people tried rectified linear units, often on the basis of hunches or heuristic arguments* *A common justification is that $\max(0, z)$ doesn't saturate in the limit of large $z$, unlike sigmoid neurons, and this helps rectified linear units continue learning.

A simple way of expanding the training data is to displace each training image by a single pixel, either up one pixel, down one pixel, left one pixel, or right one pixel.

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

filter_shape=(40, 20, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

Just to remind you of the flavour of some of the results in that earlier discussion: in 2003 Simard, Steinkraus and Platt* *Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis, by Patrice Simard, Dave Steinkraus, and John Platt (2003).

improved their MNIST performance to $99.6$ percent using a neural network otherwise very similar to ours, using two convolutional-pooling layers, followed by a hidden fully-connected layer with $100$ neurons.

There were a few differences of detail in their architecture - they didn't have the advantage of using rectified linear units, for instance - but the key to their improved performance was expanding the training data.

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

filter_shape=(40, 20, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

filter_shape=(40, 20, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

Using this, we obtain an accuracy of $99.60$ percent, which is a substantial improvement over our earlier results, especially our main benchmark, the network with $100$ hidden neurons, where we achieved $99.37$ percent.

In fact, I tried experiments with both $300$ and $1,000$ hidden neurons, and obtained (very slightly) better validation performance with $1,000$ hidden neurons.

Why we only applied dropout to the fully-connected layers: If you look carefully at the code above, you'll notice that we applied dropout only to the fully-connected section of the network, not to the convolutional layers.

But apart from that, they used few other tricks, including no convolutional layers: it was a plain, vanilla network, of the kind that, with enough patience, could have been trained in the 1980s (if the MNIST data set had existed), given enough computing power.

In particular, we saw that the gradient tends to be quite unstable: as we move from the output layer to earlier layers the gradient tends to either vanish (the vanishing gradient problem) or explode (the exploding gradient problem).

In particular, in our final experiments we trained for $40$ epochs using a data set $5$ times larger than the raw MNIST training data.

I've occasionally heard people adopt a deeper-than-thou attitude, holding that if you're not keeping-up-with-the-Joneses in terms of number of hidden layers, then you're not really doing deep learning.

To speed that process up you may find it helpful to revisit Chapter 3's discussion of how to choose a neural network's hyper-parameters, and perhaps also to look at some of the further reading suggested in that section.

Here's the code (discussion below)* *Note added November 2016: several readers have noted that in the line initializing self.w, I set scale=np.sqrt(1.0/n_out), when the arguments of Chapter 3 suggest a better initialization may be scale=np.sqrt(1.0/n_in).

np.random.normal(

loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),

dtype=theano.config.floatX),

dtype=theano.config.floatX),

I use the name inpt rather than input because input is a built-in function in Python, and messing with built-ins tends to cause unpredictable behavior and difficult-to-diagnose bugs.

So self.inpt_dropout and self.output_dropout are used during training, while self.inpt and self.output are used for all other purposes, e.g., evaluating accuracy on the validation and test data.

prev_layer, layer = self.layers[j-1], self.layers[j]

prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)

Now, this isn't a Theano tutorial, and so we won't get too deeply into what it means that these are symbolic variables* *The Theano documentation provides a good introduction to Theano.

0.5*lmbda*l2_norm_squared/num_training_batches

self.x:

training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

iteration = num_training_batches*epoch+minibatch_index

if iteration

print(&quot;Training mini-batch number {0}&quot;.format(iteration))

cost_ij = train_mb(minibatch_index)

if (iteration+1)

validation_accuracy = np.mean(

[validate_mb_accuracy(j) for j in xrange(num_validation_batches)])

print(&quot;Epoch {0}: validation accuracy {1:.2

epoch, validation_accuracy))

if validation_accuracy &gt;= best_validation_accuracy:

print(&quot;This is the best validation accuracy to date.&quot;)

best_validation_accuracy = validation_accuracy

best_iteration = iteration

if test_data:

test_accuracy = np.mean(

[test_mb_accuracy(j) for j in xrange(num_test_batches)])

print(&#39;The corresponding test accuracy is {0:.2

test_accuracy))

0.5*lmbda*l2_norm_squared/num_training_batches

In these lines we symbolically set up the regularized log-likelihood cost function, compute the corresponding derivatives in the gradient function, as well as the corresponding parameter updates.

With all these things defined, the stage is set to define the train_mb function, a Theano symbolic function which uses the updates to update the Network parameters, given a mini-batch index.

The remainder of the SGD method is self-explanatory - we simply iterate over the epochs, repeatedly training the network on mini-batches of training data, and computing the validation and test accuracies.

prev_layer, layer = self.layers[j-1], self.layers[j]

prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)

0.5*lmbda*l2_norm_squared/num_training_batches

self.x:

training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

iteration = num_training_batches*epoch+minibatch_index

if iteration % 1000 == 0:

print(&quot;Training mini-batch number {0}&quot;.format(iteration))

cost_ij = train_mb(minibatch_index)

if (iteration+1) % num_training_batches == 0:

validation_accuracy = np.mean(

[validate_mb_accuracy(j) for j in xrange(num_validation_batches)])

print(&quot;Epoch {0}: validation accuracy {1:.2%}&quot;.format(

epoch, validation_accuracy))

if validation_accuracy &gt;= best_validation_accuracy:

print(&quot;This is the best validation accuracy to date.&quot;)

best_validation_accuracy = validation_accuracy

best_iteration = iteration

if test_data:

test_accuracy = np.mean(

[test_mb_accuracy(j) for j in xrange(num_test_batches)])

print(&#39;The corresponding test accuracy is {0:.2%}&#39;.format(

test_accuracy))

activation_fn=sigmoid):

of filters, the number of input feature maps, the filter height, and the

poolsize is a tuple of length 2, whose entries are the y and

np.random.normal(loc=0, scale=np.sqrt(1.0/n_out), size=filter_shape),

dtype=theano.config.floatX),

np.random.normal(loc=0, scale=1.0, size=(filter_shape[0],)),

dtype=theano.config.floatX),

pooled_out + self.b.dimshuffle(&#39;x&#39;, 0, &#39;x&#39;, &#39;x&#39;))

np.random.normal(

loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),

dtype=theano.config.floatX),

dtype=theano.config.floatX),

Earlier in the book we discussed an automated way of selecting the number of epochs to train for, known as early stopping.

Hint: After working on this problem for a while, you may find it useful to see the discussion at this link.

Earlier in the chapter I described a technique for expanding the training data by applying (small) rotations, skewing, and translation.

Note: Unless you have a tremendous amount of memory, it is not practical to explicitly generate the entire expanded data set.

Show that rescaling all the weights in the network by a constant factor $c > 0$ simply rescales the outputs by a factor $c^{L-1}$, where $L$ is the number of layers.

Still, considering the problem will help you better understand networks containing rectified linear units.

Note: The word good in the second part of this makes the problem a research problem.

In 1998, the year MNIST was introduced, it took weeks to train a state-of-the-art workstation to achieve accuracies substantially worse than those we can achieve using a GPU and less than an hour of training.

With that said, the past few years have seen extraordinary improvements using deep nets to attack extremely difficult image recognition tasks.

They will identify the years 2011 to 2015 (and probably a few years beyond) as a time of huge breakthroughs, driven by deep convolutional nets.

The 2012 LRMD paper: Let me start with a 2012 paper* *Building high-level features using large scale unsupervised learning, by Quoc Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg Corrado, Jeff Dean, and Andrew Ng (2012).

Note that the detailed architecture of the network used in the paper differed in many details from the deep convolutional networks we've been studying.

Details about ImageNet are available in the original ImageNet paper, ImageNet: a large-scale hierarchical image database, by Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei (2009).:

If you're looking for a challenge, I encourage you to visit ImageNet's list of hand tools, which distinguishes between beading planes, block planes, chamfer planes, and about a dozen other types of plane, amongst other categories.

The 2012 KSH paper: The work of LRMD was followed by a 2012 paper of Krizhevsky, Sutskever and Hinton (KSH)* *ImageNet classification with deep convolutional neural networks, by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E.

By this top-$5$ criterion, KSH's deep convolutional network achieved an accuracy of $84.7$ percent, vastly better than the next-best contest entry, which achieved an accuracy of $73.8$ percent.

The input layer contains $3 \times 224 \times 224$ neurons, representing the RGB values for a $224 \times 224$ image.

The feature maps are split into two groups of $48$ each, with the first $48$ feature maps residing on one GPU, and the second $48$ feature maps residing on the other GPU.

Their respectives parameters are: (3) $384$ feature maps, with $3 \times 3$ local receptive fields, and $256$ input channels;

A Theano-based implementation has also been developed* *Theano-based large-scale visual recognition with multiple GPUs, by Weiguang Ding, Ruoyan Wang, Fei Mao, and Graham Taylor (2014)., with the code available here.

As in 2012, it involved a training set of $1.2$ million images, in $1,000$ categories, and the figure of merit was whether the top $5$ predictions included the correct category.

The winning team, based primarily at Google* *Going deeper with convolutions, by Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich (2014)., used a deep convolutional network with $22$ layers of neurons.

GoogLeNet achieved a top-5 accuracy of $93.33$ percent, a giant improvement over the 2013 winner (Clarifai, with $88.3$ percent), and the 2012 winner (KSH, with $84.7$ percent).

In 2014 a team of researchers wrote a survey paper about the ILSVRC competition* *ImageNet large scale visual recognition challenge, by Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C.

...the task of labeling images with 5 out of 1000 categories quickly turned out to be extremely challenging, even for some friends in the lab who have been working on ILSVRC and its classes for a while.

In the end I realized that to get anywhere competitively close to GoogLeNet, it was most efficient if I sat down and went through the painfully long training process and the subsequent careful annotation process myself...

Some images are easily recognized, while some images (such as those of fine-grained breeds of dogs, birds, or monkeys) can require multiple minutes of concentrated effort.

In other words, an expert human, working painstakingly, was with great effort able to narrowly beat the deep neural network.

In fact, Karpathy reports that a second human expert, trained on a smaller sample of images, was only able to attain a $12.0$ percent top-5 error rate, significantly below GoogLeNet's performance.

One encouraging practical set of results comes from a team at Google, who applied deep convolutional networks to the problem of recognizing street numbers in Google's Street View imagery* *Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks, by Ian J.

And they go on to make the broader claim: 'We believe with this model we have solved [optical character recognition] for short sequences [of characters] for many applications.'

For instance, a 2013 paper* *Intriguing properties of neural networks, by Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus (2013) showed that deep networks may suffer from what are effectively blind spots.

The existence of the adversarial negatives appears to be in contradiction with the network’s ability to achieve high generalization performance.

The explanation is that the set of adversarial negatives is of extremely low probability, and thus is never (or rarely) observed in the test set, yet it is dense (much like the rational numbers), and so it is found near virtually every test case.

For example, one recent paper* *Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images, by Anh Nguyen, Jason Yosinski, and Jeff Clune (2014).

shows that given a trained network it's possible to generate images which look to a human like white noise, but which the network classifies as being in a known category with a very high degree of confidence.

If you read the neural networks literature, you'll run into many ideas we haven't discussed: recurrent neural networks, Boltzmann machines, generative models, transfer learning, reinforcement learning, and so on, on and on $\ldots$ and on!

One way RNNs are currently being used is to connect neural networks more closely to traditional ways of thinking about algorithms, ways of thinking based on concepts such as Turing machines and (conventional) programming languages.

A 2014 paper developed an RNN which could take as input a character-by-character description of a (very, very simple!) Python program, and use that description to predict the output.

For example, an approach based on deep nets has achieved outstanding results on large vocabulary continuous speech recognition.

And another system based on deep nets has been deployed in Google's Android operating system (for related technical work, see Vincent Vanhoucke's 2012-2015 papers).

Many other ideas used in feedforward nets, ranging from regularization techniques to convolutions to the activation and cost functions used, are also useful in recurrent nets.

Deep belief nets, generative models, and Boltzmann machines: Modern interest in deep learning began in 2006, with papers explaining how to train a type of neural network known as a deep belief network (DBN)* *See A fast learning algorithm for deep belief nets, by Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh (2006), as well as the related work in Reducing the dimensionality of data with neural networks, by Geoffrey Hinton and Ruslan Salakhutdinov (2006)..

A generative model like a DBN can be used in a similar way, but it's also possible to specify the values of some of the feature neurons and then 'run the network backward', generating values for the input activations.

And the ability to do unsupervised learning is extremely interesting both for fundamental scientific reasons, and - if it can be made to work well enough - for practical applications.

Active areas of research include using neural networks to do natural language processing (see also this informative review paper), machine translation, as well as perhaps more surprising applications such as music informatics.

In many cases, having read this book you should be able to begin following recent work, although (of course) you'll need to fill in gaps in presumed background knowledge.

It combines deep convolutional networks with a technique known as reinforcement learning in order to learn to play video games well (see also this followup).

The idea is to use the convolutional network to simplify the pixel data from the game screen, turning it into a simpler set of features, which can be used to decide which action to take: 'go left', 'go down', 'fire', and so on.

What is particularly interesting is that a single network learned to play seven different classic video games pretty well, outperforming human experts on three of the games.

But looking past the surface gloss, consider that this system is taking raw pixel data - it doesn't even know the game rules!

Google CEO Larry Page once described the perfect search engine as understanding exactly what [your queries] mean and giving you back exactly what you want.

In this vision, instead of responding to users' literal queries, search will use machine learning to take vague user input, discern precisely what was meant, and take action on the basis of those insights.

Over the next few decades, thousands of companies will build products which use machine learning to make user interfaces that can tolerate imprecision, while discerning and acting on the user's true intent.

Inspired user interface design is hard, and I expect many companies will take powerful machine learning technology and use it to build insipid user interfaces.

Machine learning, data science, and the virtuous circle of innovation: Of course, machine learning isn't just being used to build intention-driven interfaces.

But I do want to mention one consequence of this fashion that is not so often remarked: over the long run it's possible the biggest breakthrough in machine learning won't be any single conceptual breakthrough.

If a company can invest 1 dollar in machine learning research and get 1 dollar and 10 cents back reasonably rapidly, then a lot of money will end up in machine learning research.

So, for example, Conway's law suggests that the design of a Boeing 747 aircraft will mirror the extended organizational structure of Boeing and its contractors at the time the 747 was designed.

If the application's dashboard is supposed to be integrated with some machine learning algorithm, the person building the dashboard better be talking to the company's machine learning expert.

I won't define 'deep ideas' precisely, but loosely I mean the kind of idea which is the basis for a rich field of enquiry.

The backpropagation algorithm and the germ theory of disease are both good examples.: think of things like the germ theory of disease, for instance, or the understanding of how antibodies work, or the understanding that the heart, lungs, veins and arteries form a complete cardiovascular system.

Instead of a monolith, we have fields within fields within fields, a complex, recursive, self-referential social structure, whose organization mirrors the connections between our deepest insights.

Deep learning is the latest super-special weapon I've heard used in such arguments* *Interestingly, often not by leading experts in deep learning, who have been quite restrained.

And there is paper after paper leveraging the same basic set of ideas: using stochastic gradient descent (or a close variation) to optimize a cost function.

Convolutional Neural Networks (CNNs) explained

CNNs for deep learning. Blog for this vid! #21 in Machine Leaning / Deep Learning for Programmers Playlist ..

Deep Learning using Matlab

Deep Learning using Matlab - In this lesson, we will learn how to train a deep neural network using Matlab. It is divided into three sections - 1) Challenges of ...

Lecture 13: Convolutional Neural Networks

Lecture 13 provides a mini tutorial on Azure and GPUs followed by research highlight "Character-Aware Neural Language Models." Also covered are CNN ...

WATCH as Trump Supporters In Iowa Blow Every Single One Of CNN’s Narratives Out Of The Water

ENTER to WIN at Get Next News Email Alerts! Subscribe to the channel: .

“DON’T TOUCH ME!” CNN COMMENTATOR FLIPS OUT ON TRUMP SUPPORTER! FULL SNOWFLAKE MELTDOWN!

Sub for more: | Lindsey Bruce for Liberty Writers reports, We have all dealt with them before, “tolerant” liberals who go off the deep ..

CMU Neural Nets for NLP 2017 (5): Convolutional Networks for Text

This lecture (by Graham Neubig) for CMU CS 11-747, Neural Networks for NLP (Fall 2017) covers: * Bag of Words, Bag of n-grams, and Convolution ...

Lecture 12 | Visualizing and Understanding

In Lecture 12 we discuss methods for visualizing and understanding the internal mechanisms of convolutional networks. We also discuss the use of ...

Your choice of Deep Net - Ep. 4 (Deep Learning SIMPLIFIED)

Deep Nets come in a large variety of structures and sizes, so how do you decide which kind to use? The answer depends on whether you are classifying objects ...

Gradient descent, how neural networks learn | Deep learning, chapter 2

Subscribe for more (part 3 will be on backpropagation): Funding provided by Amplify Partners and viewers like you

Teaching my computer to give me friends (I... I mean images!) (convolutional neural networks)

The 1.5-month-long hiatus is over! Note to self: Never lip-sync things that don't need to be lip-synced. It takes forever Here's HyperGAN, the tool I used to create ...