# AI News, Must Know Tips/Tricks in Deep Neural Networks

## Must Know Tips/Tricks in Deep Neural Networks

Deep Neural Networks, especially Convolutional Neural Networks (CNN), allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction.

These methods have dramatically improved the state-of-the-arts in visual object recognition, object detection, text recognition and many other domains such as drug discovery and genomics.

Thus, they collected and concluded many implementation details for DCNNs.  Here they will introduce these extensive implementation details, i.e., tricks or tips, for building and training your own deep networks.

## CS231n Convolutional Neural Networks for Visual Recognition

Each hidden layer is made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, and where neurons in a single layer function completely independently and do not share any connections.

In CIFAR-10, images are only of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have 32*32*3 = 3072 weights.

(Note that the word depth here refers to the third dimension of an activation volume, not to the depth of a full Neural Network, which can refer to the total number of layers in a network.) For example, the input images in CIFAR-10 are an input volume of activations, and the volume has dimensions 32x32x3 (width, height, depth respectively).

Moreover, the final output layer would for CIFAR-10 have dimensions 1x1x10, because by the end of the ConvNet architecture we will reduce the full image into a single vector of class scores, arranged along the depth dimension.

During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position.

Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network.

If you’re a fan of the brain/neuron analogies, every entry in the 3D output volume can also be interpreted as an output of a neuron that looks at only a small region in the input and shares parameters with all neurons to the left and right spatially (since these numbers all result from applying the same filter).

It is important to emphasize again this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in space (along width and height), but always full along the entire depth of the input volume.

If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter).

We discuss these next: We can compute the spatial size of the output volume as a function of the input volume size ($$W$$), the receptive field size of the Conv Layer neurons ($$F$$), the stride with which they are applied ($$S$$), and the amount of zero padding used ($$P$$) on the border.

In general, setting zero padding to be $$P = (F - 1)/2$$ when the stride is $$S = 1$$ ensures that the input volume and output volume will have the same size spatially.

For example, when the input has size $$W = 10$$, no zero-padding is used $$P = 0$$, and the filter size is $$F = 3$$, then it would be impossible to use stride $$S = 2$$, since $$(W - F + 2P)/S + 1 = (10 - 3 + 0) / 2 + 1 = 4.5$$, i.e.

Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of $$K = 96$$, the Conv layer output volume had size [55x55x96].

As a fun aside, if you read the actual paper it claims that the input images were 224x224, which is surely incorrect because (224 - 11)/4 + 1 is quite clearly not an integer.

It turns out that we can dramatically reduce the number of parameters by making one reasonable assumption: That if one feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2).

With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique set of weights (one for each depth slice), for a total of 96*11*11*3 = 34,848 unique weights, or 34,944 parameters (+96 biases).

In practice during backpropagation, every neuron in the volume will compute the gradient for its weights, but these gradients will be added up across each depth slice and only update a single set of weights per slice.

Notice that if all neurons in a single depth slice are using the same weight vector, then the forward pass of the CONV layer can in each depth slice be computed as a convolution of the neuron’s weights with the input volume (Hence the name: Convolutional Layer).

The activation map in the output volume (call it V), would then look as follows (only some of the elements are computed in this example): Remember that in numpy, the operation * above denotes elementwise multiplication between the arrays.

To construct a second activation map in the output volume, we would have: where we see that we are indexing into the second depth dimension in V (at index 1) because we are computing the second activation map, and that a different set of parameters (W1) is now used.

Since 3D volumes are hard to visualize, all the volumes (the input volume (in blue), the weight volumes (in red), the output volume (in green)) are visualized with each depth slice stacked in rows.

The input volume is of size $$W_1 = 5, H_1 = 5, D_1 = 3$$, and the CONV layer parameters are $$K = 2, F = 3, S = 2, P = 1$$.

The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias.

A common implementation pattern of the CONV layer is to take advantage of this fact and formulate the forward pass of a convolutional layer as one big matrix multiply as follows: This approach has the downside that it can use a lot of memory, since some values in the input volume are replicated multiple times in X_col.

For example, if you stack two 3x3 CONV layers on top of each other then you can convince yourself that the neurons on the 2nd layer are a function of a 5x5 patch of the input (we would say that the effective receptive field of these neurons is 5x5).

The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations.

More generally, the pooling layer: It is worth noting that there are only two commonly seen variations of the max pooling layer found in practice: A pooling layer with $$F = 3, S = 2$$ (also called overlapping pooling), and more commonly $$F = 2, S = 2$$.

Hence, during the forward pass of a pooling layer it is common to keep track of the index of the max activation (sometimes also called the switches) so that gradient routing is efficient during backpropagation.

Consider a ConvNet architecture that takes a 224x224x3 image, and then uses a series of CONV layers and POOL layers to reduce the image to an activations volume of size 7x7x512 (in an AlexNet architecture that we’ll see later, this is done by use of 5 pooling layers that downsample the input spatially by a factor of two each time, making the final spatial size 224/2/2/2/2/2 = 7).

Note that instead of a single vector of class scores of size [1x1x1000], we’re now getting an entire 6x6 array of class scores across the 384x384 image.

Evaluating the original ConvNet (with FC layers) independently across 224x224 crops of the 384x384 image in strides of 32 pixels gives an identical result to forwarding the converted ConvNet one time.

This trick is often used in practice to get better performance, where for example, it is common to resize an image to make it bigger, use a converted ConvNet to evaluate the class scores at many spatial positions and then average the class scores.

For example, note that if we wanted to use a stride of 16 pixels we could do so by combining the volumes received by forwarding the converted ConvNet twice: First over the original image and second over the image but with the image shifted spatially by 16 pixels along both width and height.

Second, if we suppose that all the volumes have $$C$$ channels, then it can be seen that the single 7x7 CONV layer would contain $$C \times (7 \times 7 \times C) = 49 C^2$$ parameters, while the three 3x3 CONV layers would only contain $$3 \times (C \times (3 \times 3 \times C)) = 27 C^2$$ parameters.

I like to summarize this point as “don’t be a hero”: Instead of rolling your own architecture for a problem, you should look at whatever architecture currently works best on ImageNet, download a pretrained model and finetune it on your data.

3x3 or at most 5x5), using a stride of $$S = 1$$, and crucially, padding the input volume with zeros in such way that the conv layer does not alter the spatial dimensions of the input.

In an alternative scheme where we use strides greater than 1 or don’t zero-pad the input in CONV layers, we would have to very carefully keep track of the input volumes throughout the CNN architecture and make sure that all strides and filters “work out”, and that the ConvNet architecture is nicely and symmetrically wired.

If the CONV layers were to not zero-pad the inputs and only perform valid convolutions, then the size of the volumes would reduce by a small amount after each CONV, and the information at the borders would be “washed away” too quickly.

For example, filtering a 224x224x3 image with three 3x3 CONV layers with 64 filters each and padding 1 would create three activation volumes of size [224x224x64].

The whole VGGNet is composed of CONV layers that perform 3x3 convolutions with stride 1 and pad 1, and of POOL layers that perform 2x2 max pooling with stride 2 (and no padding).

We can write out the size of the representation at each step of the processing and keep track of both the representation size and the total number of weights: As is common with Convolutional Networks, notice that most of the memory (and also compute time) is used in the early CONV layers, and that most of the parameters are in the last FC layers.

There are three major sources of memory to keep track of: Once you have a rough estimate of the total number of values (for activations, gradients, and misc), the number should be converted to size in GB.

Take the number of values, multiply by 4 to get the raw number of bytes (since every floating point is 4 bytes, or maybe by 8 for double precision), and then divide by 1024 multiple times to get the amount of memory in KB, MB, and finally GB.

## Must Know Tips/Tricks in Deep Neural Networks (by Xiu-Shen Wei)

We assume you already know the basic knowledge of deep learning, and here we will present the implementation details (tricks or tips) in Deep Neural Networks, especially CNN for image-related tasks, mainly in eight aspects: 1) data augmentation;

Since deep networks need to be trained on a huge number of training images to achieve satisfactory performance, if the original image data set contains limited training images, it is better to do data augmentation to boost the performance.

It only makes sense to apply this pre-processing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm.

In case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional pre-processing step.

Then, you can compute the covariance matrix that tells us about the correlation structure in the data: After that, you decorrelate the data by projecting the original (but zero-centered) data into the eigenbasis: The last transformation is whitening, which takes the data in the eigenbasis and divides every dimension by the eigenvalue to normalize the scale: Note that here it adds 1e-5 (or a small constant) to prevent division by zero.

One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input.

The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network.

It turns out that you can normalize the variance of each neuron's output to 1 by scaling its weight vector by the square root of its fan-in (i.e., its number of inputs), which is as follows: where “randn” is the aforementioned Gaussian and “n” is the number of its inputs.

[4] derives an initialization specifically for ReLUs, reaching the conclusion that the variance of neurons in the network should be as: which is the current recommendation for use in practice, as discussed in [4].

There are several ways of controlling the capacity of Neural Networks to prevent overfitting: Finally, from the tips above, you can get the satisfactory settings (e.g., data processing, architectures choices and details, etc.) for your own deep networks.

It is well known that an ensemble is usually significantly more accurate than a single learner, and ensemble methods have already achieved great success in many real-world tasks.

As discussed in a recent technique report [10], when deep CNNs are trained on these imbalanced training sets, the results show that imbalanced training data can potentially have a severely negative impact on overall performance in deep networks.

Because the original cultural event images are imbalanced, we merely extract crops from the classes which have a small number of training images, which on one hand can supply diverse data sources, and on the other hand can solve the class-imbalanced problem.

At the beginning of fine-tuning on your data set, you firstly fine-tune on the classes which have a large number of training samples (images/crops), and secondly, continue to fine-tune but on the classes with limited number samples.

## Applied Deep Learning - Part 4: Convolutional Neural Networks

In Part 2 we applied deep learning to real-world datasets, covering the 3 most commonly encountered problems as case studies: binary classification, multiclass classification and regression.

The code for this article is available here as a Jupyter notebook, feel free to download and try it out yourself.

The recent surge of interest in deep learning is due to the immense popularity and effectiveness of convnets.

The main advantage of CNN compared to its predecessors is that it automatically detects the important features without any human supervision.

We are dealing with a very powerful and efficient model which performs automatic feature extraction to achieve superhuman accuracy (yes CNN models now do image classification better than humans).

We perform a series convolution + pooling operations, followed by a number of fully connected layers.

In our case the convolution is applied on the input data using a convolution filter to produce a feature map.

Here the filter is at the top left, the output of the convolution operation “4” is shown in the resulting feature map.

In reality an image is represented as a 3D matrix with dimensions of height, width and depth, where depth corresponds to color channels (RGB).

A convolution filter has a specific height and width, like 3x3 or 5x5, and by design it covers the entire depth of its input so it needs to be 3D as well.

We perform multiple convolutions on an input, each using a different filter and resulting in a distinct feature map.

Let’s say we have a 32x32x3 image and we use a filter of size 5x5x3 (note that the depth of the convolution filter matches the depth of the image, both being 3).

When the filter is at a particular location it covers a small volume of the input, and we perform the convolution operation described above.

We slide the filter over the input like above and perform the convolution at every location aggregating the result in a feature map.

If we used 10 different filters we would have 10 feature maps of size 32x32x1 and stacking them along the depth dimension would give us the final output of the convolution layer: a volume of size 32x32x10, shown as the large blue box on the right.

Note that the height and width of the feature map are unchanged and still 32, it’s due to padding and we will elaborate on that shortly.

The animation shows the sliding operation at 4 locations, but in reality it’s performed over the entire input.

But keep in mind that any type of convolution involves a relu operation, without that the network won’t achieve its true potential.

We see that the size of the feature map is smaller than the input, because the convolution filter needs to be contained in the input.

Padding is commonly used in CNN to preserve the size of the feature maps, otherwise they would shrink at each layer, which is not desirable.

The 3D convolution figures we saw above used padding, that’s why the height and width of the feature map was the same as the input (both 32x32), and only the depth changed.

Pooling layers downsample each feature map independently, reducing the height and width, keeping the depth intact.

The most common type of pooling is max pooling which just takes the max value in the pooling window.

If the input to the pooling layer has the dimensionality 32x32x10, using the same pooling parameters described above, the result will be a 16x16x10 feature map.

Both the height and width of the feature map are halved, but the depth doesn’t change because pooling works independently on each depth slice the input.

We have 4 important hyperparameters to decide on: After the convolution + pooling layers we add a couple of fully connected layers to wrap up the CNN architecture.

Remember that the output of both convolution and pooling layers are 3D volumes, but a fully connected layer expects a 1D vector of numbers.

So we flatten the output of the final pooling layer to a vector and that becomes the input to the fully connected layer.

For example given an image, the convolution layer detects features such as two eyes, long ears, four legs, a short tail and so on.

The fully connected layers then act as a classifier on top of these features, and assign a probability for the input image being a dog.

The first layers detect edges, the next layers combine them to detect shapes, to following layers merge this information to infer that this is a nose.

The fully connected layers learn how to use these features produced by convolutions in order to correctly classify the images.

We will use the following architecture: 4 convolution + pooling layers, followed by 2 fully connected layers.

There are 4 new methods we haven’t seen before: Dropout is by far the most popular regularization technique for deep neural networks.

Even the state-of-the-art models which have 95% accuracy get a 2% accuracy boost just by adding dropout, which is a fairly substantial gain at that level.

The dropped-out neurons are resampled with probability p at every training step, so a dropped out neuron at one step can be active at the next one.

The hyperparameter p is called the dropout-rate and it’s typically a number around 0.5, corresponding to 50% of the neurons being dropped out.

The reason is that dropout prevents the network to be too dependent on a small number of neurons, and forces every neuron to be able to operate independently.

This might sound familiar from constraining the code size of the autoencoder in Part 3, in order to learn more intelligent representations.

But if every morning you tossed a coin to decide whether you will go to work or not, then your coworkers will need to adapt.

We will take a look at loss and accuracy curves, comparing training set performance against the validation set.

Training loss keeps going down but the validation loss starts increasing after around epoch 10.

The model is memorizing the training data, but it’s failing to generalize to new instances, and that’s why the validation performance goes worse.

But fortunately there is a solution to this problem which enables us to train deep models on small datasets, and it’s called data augmentation.

The common case in most machine learning applications, especially in image classification tasks is that obtaining new training data is not easy.

It enriches or “augments” the training data by generating new examples via random transformation of existing ones.

We need to generate realistic images, and the transformations should be learnable, simply adding noise won’t help.

Also, data augmentation is only performed on the training data, we don’t touch the validation or test set.

These are new training instances, applying transformations on the original image doesn’t change the fact that this is still a cat image.

The code for the model definition will not change at all, since we’re not changing the architecture of our model.

And validation accuracy jumped from 73% with no data augmentation to 81% with data augmentation, 11% improvement.

Second, we made the model transformation invariant, meaning the model saw a lot of shifted/rotated/scaled images so it’s able to recognize them better.

There are much more complicated models which perform better, for example Microsoft’s ResNet model was the winner of 2015 ImageNet challenge with 3.6% error rate, but the model has 152 layers!

As we discussed above the height and width correspond to the dimensions of the feature map, and each depth channel is a distinct feature map encoding independent features.

Instead of looking at a single feature map, it would be more interesting to visualize multiple feature maps from a convolution layer.

So let’s visualize the feature maps corresponding to the first convolution of each block, the red arrows in the figure below.

There is one catch though, we won’t actually visualize the filters themselves, but instead we will display the patterns each filter maximally responds to.

Lower layers encode/detect simple structures, as we go deeper the layers build on top of each other and learn to encode more complex patterns.

very thorough online free book about deep learning can be found here, with the CNN section available here.

## Convolutional Neural Networks (CNN): Softmax &#038; Cross-Entropy

That being said, learning about the softmax and cross-entropy functions can give you a tighter grasp of this section's topic.

You can forget about all the mathematical jargon in that definition for now, but what we learn from this is that only by including the softmax function are the values of both classes processed and made to add up to 1.

We used this function in order to assess the performance of our network, and by working to minimize this mean squared error we would practically be optimizing the network.

The mean squared error function can be used with convolutional neural networks, but an even better option would be applying the cross-entropy function after you had entered the softmax function.

As you can see here, for the first two images, both neural networks came up with similar end predictions, yet the probabilities were calculated with much less accuracy by the second neural network.

For the third image which shows that bizarre lion-looking creature, both neural networks were wrong in their predictions since this animal actually happens to be a dog, but the first neural network was more leaning towards the right prediction than the second one.

Mean Squared Error The way you derive the mean squared error is by calculating the squared values of the network's errors and then get the average of these across the table.

We'll examine here one of the core advantages, and if you want to learn about the remaining reasons for using cross-entropy, you can do so from the material you'll find mentioned at the end of this tutorial.

In both cases, the degree of change is similar in absolute terms, but in relative terms, the cross-entropy function makes your network much more capable of using this change as its guide in the desired direction, whereas the mean squared error function makes very little use of it.

You have a lot of extra material to sift through, so you can do that if you want to gain more insight on the topics we covered, or else, you can just move to the next section of this deep learning course.

Convolutional Neural Networks (CNNs) explained

CNNs for deep learning. Blog for this vid! #21 in Machine Leaning / Deep Learning for Programmers Playlist ..

Deep Learning using Matlab

Deep Learning using Matlab - In this lesson, we will learn how to train a deep neural network using Matlab. It is divided into three sections - 1) Challenges of ...

Lecture 13: Convolutional Neural Networks

Lecture 13 provides a mini tutorial on Azure and GPUs followed by research highlight "Character-Aware Neural Language Models." Also covered are CNN ...

“DON’T TOUCH ME!” CNN COMMENTATOR FLIPS OUT ON TRUMP SUPPORTER! FULL SNOWFLAKE MELTDOWN!

Sub for more: | Lindsey Bruce for Liberty Writers reports, We have all dealt with them before, “tolerant” liberals who go off the deep ..

CMU Neural Nets for NLP 2017 (5): Convolutional Networks for Text

This lecture (by Graham Neubig) for CMU CS 11-747, Neural Networks for NLP (Fall 2017) covers: * Bag of Words, Bag of n-grams, and Convolution ...

WATCH as Trump Supporters In Iowa Blow Every Single One Of CNN’s Narratives Out Of The Water

ENTER to WIN at Get Next News Email Alerts! Subscribe to the channel: .

Lecture 12 | Visualizing and Understanding

In Lecture 12 we discuss methods for visualizing and understanding the internal mechanisms of convolutional networks. We also discuss the use of ...

Your choice of Deep Net - Ep. 4 (Deep Learning SIMPLIFIED)

Deep Nets come in a large variety of structures and sizes, so how do you decide which kind to use? The answer depends on whether you are classifying objects ...

Gradient descent, how neural networks learn | Deep learning, chapter 2

Subscribe for more (part 3 will be on backpropagation): Funding provided by Amplify Partners and viewers like you

Teaching my computer to give me friends (I... I mean images!) (convolutional neural networks)

The 1.5-month-long hiatus is over! Note to self: Never lip-sync things that don't need to be lip-synced. It takes forever Here's HyperGAN, the tool I used to create ...