# AI News, Neural Networks for Beginners: Popular Types and Applications

## Neural Networks for Beginners: Popular Types and Applications

Recently there has been a great buzz around the words “neural network” in the field of computer science and it has attracted a great deal of attention from many people.

Each neuron multiplies an initial value by some weight, sums results with other values coming into the same neuron, adjusts the resulting number by the neuron’s bias, and then normalizes the output with an activation function.

key feature of neural networks is an iterative learning process in which records (rows) are presented to the network one at a time, and the weights associated with the input values are adjusted each time.

The network processes the records in the “training set” one at a time, using the weights and functions in the hidden layers, then compares the resulting outputs against the desired outputs.

We know that after training, each layer extracts higher and higher-level features of the dataset (input), until the final layer essentially makes a decision on what the input features refer to.

This approach is based on the observation that random initialization is a bad idea and that pre-training each layer with an unsupervised learning algorithm can allow for better initial weights.

stochastic corruption process randomly sets some of the inputs to zero, forcing the denoising autoencoder to predict missing (corrupted) values for randomly selected subsets of missing patterns.

Also, MLP neural network prediction accuracy depended greatly on neural network architecture, pre-processing of data, and the type of problem for which the network was developed.

The detector evaluates the input image at low resolution to quickly reject non-face regions and carefully process the challenging regions at higher resolution for accurate detection.

## Deep learning

In the last chapter we learned that deep neural networks are often much harder to train than shallow neural networks.

We'll also look at the broader picture, briefly reviewing recent progress on using deep nets for image recognition, speech recognition, and other applications.

We'll work through a detailed example - code and all - of using convolutional nets to solve the problem of classifying handwritten digits from the MNIST data set:

As we go we'll explore many powerful techniques: convolutions, pooling, the use of GPUs to do far more training than we did with our shallow networks, the algorithmic expansion of our training data (to reduce overfitting), the use of the dropout technique (also to reduce overfitting), the use of ensembles of networks, and others.

We conclude our discussion of image recognition with a survey of some of the spectacular recent progress using networks (particularly convolutional nets) to do image recognition.

We'll briefly survey other models of neural networks, such as recurrent neural nets and long short-term memory units, and how such models can be applied to problems in speech recognition, natural language processing, and other areas.

And we'll speculate about the future of neural networks and deep learning, ranging from ideas like intention-driven user interfaces, to the role of deep learning in artificial intelligence.

For the $28 \times 28$ pixel images we've been using, this means our network has $784$ ($= 28 \times 28$) input neurons.

Our earlier networks work pretty well: we've obtained a classification accuracy better than 98 percent, using training and test data from the MNIST handwritten digit data set.

But the seminal paper establishing the modern subject of convolutional networks was a 1998 paper, 'Gradient-based learning applied to document recognition', by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.

LeCun has since made an interesting remark on the terminology for convolutional nets: 'The [biological] neural inspiration in models like convolutional nets is very tenuous.

That's why I call them 'convolutional nets' not 'convolutional neural nets', and why we call the nodes 'units' and not 'neurons' '.

Despite this remark, convolutional nets use many of the same ideas as the neural networks we've studied up to now: ideas such as backpropagation, gradient descent, regularization, non-linear activation functions, and so on.

In a convolutional net, it'll help to think instead of the inputs as a $28 \times 28$ square of neurons, whose values correspond to the $28 \times 28$ pixel intensities we're using as inputs:

To be more precise, each neuron in the first hidden layer will be connected to a small region of the input neurons, say, for example, a $5 \times 5$ region, corresponding to $25$ input pixels.

So, for a particular hidden neuron, we might have connections that look like this: That region in the input image is called the local receptive field for the hidden neuron.

To illustrate this concretely, let's start with a local receptive field in the top-left corner: Then we slide the local receptive field over by one pixel to the right (i.e., by one neuron), to connect to a second hidden neuron:

Note that if we have a $28 \times 28$ input image, and $5 \times 5$ local receptive fields, then there will be $24 \times 24$ neurons in the hidden layer.

This is because we can only move the local receptive field $23$ neurons across (or $23$ neurons down), before colliding with the right-hand side (or bottom) of the input image.

In this chapter we'll mostly stick with stride length $1$, but it's worth knowing that people sometimes experiment with different stride lengths* *As was done in earlier chapters, if we're interested in trying different stride lengths then we can use validation data to pick out the stride length which gives the best performance.

The same approach may also be used to choose the size of the local receptive field - there is, of course, nothing special about using a $5 \times 5$ local receptive field.

In general, larger local receptive fields tend to be helpful when the input images are significantly larger than the $28 \times 28$ pixel MNIST images..

In other words, for the $j, k$th hidden neuron, the output is: \begin{eqnarray} \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4 w_{l,m} a_{j+l, k+m} \right).

Informally, think of the feature detected by a hidden neuron as the kind of input pattern that will cause the neuron to activate: it might be an edge in the image, for instance, or maybe some other type of shape.

To see why this makes sense, suppose the weights and bias are such that the hidden neuron can pick out, say, a vertical edge in a particular local receptive field.

To put it in slightly more abstract terms, convolutional networks are well adapted to the translation invariance of images: move a picture of a cat (say) a little ways, and it's still an image of a cat* *In fact, for the MNIST digit classification problem we've been studying, the images are centered and size-normalized.

One of the early convolutional networks, LeNet-5, used $6$ feature maps, each associated to a $5 \times 5$ local receptive field, to recognize MNIST digits.

Let's take a quick peek at some of the features which are learned* *The feature maps illustrated come from the final convolutional network we train, see here.:

Each map is represented as a $5 \times 5$ block image, corresponding to the $5 \times 5$ weights in the local receptive field.

By comparison, suppose we had a fully connected first layer, with $784 = 28 \times 28$ input neurons, and a relatively modest $30$ hidden neurons, as we used in many of the examples earlier in the book.

That, in turn, will result in faster training for the convolutional model, and, ultimately, will help us build deep networks using convolutional layers.

Incidentally, the name convolutional comes from the fact that the operation in Equation (125)\begin{eqnarray} \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4 w_{l,m} a_{j+l, k+m} \right) \nonumber\end{eqnarray}$('#margin_223550267310_reveal').click(function() {$('#margin_223550267310').toggle('slow', function() {});});

A little more precisely, people sometimes write that equation as $a^1 = \sigma(b + w * a^0)$, where $a^1$ denotes the set of output activations from one feature map, $a^0$ is the set of input activations, and $*$ is called a convolution operation.

In particular, I'm using 'feature map' to mean not the function computed by the convolutional layer, but rather the activation of the hidden neurons output from the layer.

In max-pooling, a pooling unit simply outputs the maximum activation in the $2 \times 2$ input region, as illustrated in the following diagram:

Note that since we have $24 \times 24$ neurons output from the convolutional layer, after pooling we have $12 \times 12$ neurons.

So if there were three feature maps, the combined convolutional and max-pooling layers would look like:

Here, instead of taking the maximum activation of a $2 \times 2$ region of neurons, we take the square root of the sum of the squares of the activations in the $2 \times 2$ region.

It's similar to the architecture we were just looking at, but has the addition of a layer of $10$ output neurons, corresponding to the $10$ possible values for MNIST digits ('0', '1', '2', etc):

Problem Backpropagation in a convolutional network The core equations of backpropagation in a network with fully-connected layers are (BP1)\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}$('#margin_511945174620_reveal').click(function() {$('#margin_511945174620').toggle('slow', function() {});});-(BP4)\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}$('#margin_896578903066_reveal').click(function() {$('#margin_896578903066').toggle('slow', function() {});});

Suppose we have a network containing a convolutional layer, a max-pooling layer, and a fully-connected output layer, as in the network discussed above.

The program we'll use to do this is called network3.py, and it's an improved version of the programs network.py and network2.py developed in earlier chapters* *Note also that network3.py incorporates ideas from the Theano library's documentation on convolutional neural nets (notably the implementation of LeNet-5), from Misha Denil's implementation of dropout, and from Chris Olah..

But now that we understand those details, for network3.py we're going to use a machine learning library known as Theano* *See Theano: A CPU and GPU Math Expression Compiler in Python, by James Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Ravzan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio (2010).

The examples which follow were run using Theano 0.6* *As I release this chapter, the current version of Theano has changed to version 0.7.

Note that the code in the script simply duplicates and parallels the discussion in this section.Note also that throughout the section I've explicitly specified the number of training epochs.

In practice, it's worth using early stopping, that is, tracking accuracy on the validation set, and stopping training when we are confident the validation accuracy has stopped improving.: &gt;&gt;&gt;

Using the validation data to decide when to evaluate the test accuracy helps avoid overfitting to the test data (see this earlier discussion of the use of validation data).

Your results may vary slightly, since the network's weights and biases are randomly initialized* *In fact, in this experiment I actually did three separate runs training a network with this architecture.

This $97.80$ percent accuracy is close to the $98.04$ percent accuracy obtained back in Chapter 3, using a similar network architecture and learning hyper-parameters.

Second, while the final layer in the earlier network used sigmoid activations and the cross-entropy cost function, the current network uses a softmax final layer, and the log-likelihood cost function.

I haven't made this switch for any particularly deep reason - mostly, I've done it because softmax plus log-likelihood cost is more common in modern image classification networks.

In this architecture, we can think of the convolutional and pooling layers as learning about local spatial structure in the input training image, while the later, fully-connected layer learns at a more abstract level, integrating global information from across the entire image.

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2)),

validation_data, test_data)

Can we improve on the $98.78$ percent classification accuracy?

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2)),

filter_shape=(40, 20, 5, 5),

poolsize=(2, 2)),

validation_data, test_data)

In fact, you can think of the second convolutional-pooling layer as having as input $12 \times 12$ 'images', whose 'pixels' represent the presence (or absence) of particular localized features in the original input image.

The output from the previous layer involves $20$ separate feature maps, and so there are $20 \times 12 \times 12$ inputs to the second convolutional-pooling layer.

In fact, we'll allow each neuron in this layer to learn from all $20 \times 5 \times 5$ input neurons in its local receptive field.

More informally: the feature detectors in the second convolutional-pooling layer have access to all the features from the previous layer, but only within their particular local receptive field* *This issue would have arisen in the first layer if the input images were in color.

In that case we'd have 3 input features for each pixel, corresponding to red, green and blue channels in the input image.

So we'd allow the feature detectors to have access to all color information, but only within a given local receptive field..

Problem Using the tanh activation function Several times earlier in the book I've mentioned arguments that the tanh function may be a better activation function than the sigmoid function.

Try training the network with tanh activations in the convolutional and fully-connected layers* *Note that you can pass activation_fn=tanh as a parameter to the ConvPoolLayer and FullyConnectedLayer classes..

Try plotting the per-epoch validation accuracies for both tanh- and sigmoid-based networks, all the way out to $60$ epochs.

If your results are similar to mine, you'll find the tanh networks train a little faster, but the final accuracies are very similar.

Can you get a similar training speed with the sigmoid, perhaps by changing the learning rate, or doing some rescaling* *You may perhaps find inspiration in recalling that $\sigma(z) = (1+\tanh(z/2))/2$.?

Try a half-dozen iterations on the learning hyper-parameters or network architecture, searching for ways that tanh may be superior to the sigmoid.

Personally, I did not find much advantage in switching to tanh, although I haven't experimented exhaustively, and perhaps you may find a way.

In any case, in a moment we will find an advantage in switching to the rectified linear activation function, and so we won't go any deeper into the use of tanh.

Using rectified linear units: The network we've developed at this point is actually a variant of one of the networks used in the seminal 1998 paper* *'Gradient-based learning applied to document recognition', by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner (1998).

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

filter_shape=(40, 20, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

However, across all my experiments I found that networks based on rectified linear units consistently outperformed networks based on sigmoid activation functions.

The reason for that recent adoption is empirical: a few people tried rectified linear units, often on the basis of hunches or heuristic arguments* *A common justification is that $\max(0, z)$ doesn't saturate in the limit of large $z$, unlike sigmoid neurons, and this helps rectified linear units continue learning.

A simple way of expanding the training data is to displace each training image by a single pixel, either up one pixel, down one pixel, left one pixel, or right one pixel.

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

filter_shape=(40, 20, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

Just to remind you of the flavour of some of the results in that earlier discussion: in 2003 Simard, Steinkraus and Platt* *Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis, by Patrice Simard, Dave Steinkraus, and John Platt (2003).

improved their MNIST performance to $99.6$ percent using a neural network otherwise very similar to ours, using two convolutional-pooling layers, followed by a hidden fully-connected layer with $100$ neurons.

There were a few differences of detail in their architecture - they didn't have the advantage of using rectified linear units, for instance - but the key to their improved performance was expanding the training data.

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

filter_shape=(40, 20, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

filter_shape=(40, 20, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

Using this, we obtain an accuracy of $99.60$ percent, which is a substantial improvement over our earlier results, especially our main benchmark, the network with $100$ hidden neurons, where we achieved $99.37$ percent.

In fact, I tried experiments with both $300$ and $1,000$ hidden neurons, and obtained (very slightly) better validation performance with $1,000$ hidden neurons.

Why we only applied dropout to the fully-connected layers: If you look carefully at the code above, you'll notice that we applied dropout only to the fully-connected section of the network, not to the convolutional layers.

But apart from that, they used few other tricks, including no convolutional layers: it was a plain, vanilla network, of the kind that, with enough patience, could have been trained in the 1980s (if the MNIST data set had existed), given enough computing power.

In particular, we saw that the gradient tends to be quite unstable: as we move from the output layer to earlier layers the gradient tends to either vanish (the vanishing gradient problem) or explode (the exploding gradient problem).

In particular, in our final experiments we trained for $40$ epochs using a data set $5$ times larger than the raw MNIST training data.

I've occasionally heard people adopt a deeper-than-thou attitude, holding that if you're not keeping-up-with-the-Joneses in terms of number of hidden layers, then you're not really doing deep learning.

To speed that process up you may find it helpful to revisit Chapter 3's discussion of how to choose a neural network's hyper-parameters, and perhaps also to look at some of the further reading suggested in that section.

Here's the code (discussion below)* *Note added November 2016: several readers have noted that in the line initializing self.w, I set scale=np.sqrt(1.0/n_out), when the arguments of Chapter 3 suggest a better initialization may be scale=np.sqrt(1.0/n_in).

np.random.normal(

loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),

dtype=theano.config.floatX),

dtype=theano.config.floatX),

I use the name inpt rather than input because input is a built-in function in Python, and messing with built-ins tends to cause unpredictable behavior and difficult-to-diagnose bugs.

So self.inpt_dropout and self.output_dropout are used during training, while self.inpt and self.output are used for all other purposes, e.g., evaluating accuracy on the validation and test data.

prev_layer, layer = self.layers[j-1], self.layers[j]

prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)

Now, this isn't a Theano tutorial, and so we won't get too deeply into what it means that these are symbolic variables* *The Theano documentation provides a good introduction to Theano.

0.5*lmbda*l2_norm_squared/num_training_batches

self.x:

training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

iteration = num_training_batches*epoch+minibatch_index

if iteration

print(&quot;Training mini-batch number {0}&quot;.format(iteration))

cost_ij = train_mb(minibatch_index)

if (iteration+1)

validation_accuracy = np.mean(

[validate_mb_accuracy(j) for j in xrange(num_validation_batches)])

print(&quot;Epoch {0}: validation accuracy {1:.2

epoch, validation_accuracy))

if validation_accuracy &gt;= best_validation_accuracy:

print(&quot;This is the best validation accuracy to date.&quot;)

best_validation_accuracy = validation_accuracy

best_iteration = iteration

if test_data:

test_accuracy = np.mean(

[test_mb_accuracy(j) for j in xrange(num_test_batches)])

print(&#39;The corresponding test accuracy is {0:.2

test_accuracy))

0.5*lmbda*l2_norm_squared/num_training_batches

In these lines we symbolically set up the regularized log-likelihood cost function, compute the corresponding derivatives in the gradient function, as well as the corresponding parameter updates.

With all these things defined, the stage is set to define the train_mb function, a Theano symbolic function which uses the updates to update the Network parameters, given a mini-batch index.

The remainder of the SGD method is self-explanatory - we simply iterate over the epochs, repeatedly training the network on mini-batches of training data, and computing the validation and test accuracies.

prev_layer, layer = self.layers[j-1], self.layers[j]

prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)

0.5*lmbda*l2_norm_squared/num_training_batches

self.x:

training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

iteration = num_training_batches*epoch+minibatch_index

if iteration % 1000 == 0:

print(&quot;Training mini-batch number {0}&quot;.format(iteration))

cost_ij = train_mb(minibatch_index)

if (iteration+1) % num_training_batches == 0:

validation_accuracy = np.mean(

[validate_mb_accuracy(j) for j in xrange(num_validation_batches)])

print(&quot;Epoch {0}: validation accuracy {1:.2%}&quot;.format(

epoch, validation_accuracy))

if validation_accuracy &gt;= best_validation_accuracy:

print(&quot;This is the best validation accuracy to date.&quot;)

best_validation_accuracy = validation_accuracy

best_iteration = iteration

if test_data:

test_accuracy = np.mean(

[test_mb_accuracy(j) for j in xrange(num_test_batches)])

print(&#39;The corresponding test accuracy is {0:.2%}&#39;.format(

test_accuracy))

activation_fn=sigmoid):

of filters, the number of input feature maps, the filter height, and the

poolsize is a tuple of length 2, whose entries are the y and

np.random.normal(loc=0, scale=np.sqrt(1.0/n_out), size=filter_shape),

dtype=theano.config.floatX),

np.random.normal(loc=0, scale=1.0, size=(filter_shape[0],)),

dtype=theano.config.floatX),

pooled_out + self.b.dimshuffle(&#39;x&#39;, 0, &#39;x&#39;, &#39;x&#39;))

np.random.normal(

loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),

dtype=theano.config.floatX),

dtype=theano.config.floatX),

Earlier in the book we discussed an automated way of selecting the number of epochs to train for, known as early stopping.

Hint: After working on this problem for a while, you may find it useful to see the discussion at this link.

Earlier in the chapter I described a technique for expanding the training data by applying (small) rotations, skewing, and translation.

Note: Unless you have a tremendous amount of memory, it is not practical to explicitly generate the entire expanded data set.

Show that rescaling all the weights in the network by a constant factor $c > 0$ simply rescales the outputs by a factor $c^{L-1}$, where $L$ is the number of layers.

Still, considering the problem will help you better understand networks containing rectified linear units.

Note: The word good in the second part of this makes the problem a research problem.

In 1998, the year MNIST was introduced, it took weeks to train a state-of-the-art workstation to achieve accuracies substantially worse than those we can achieve using a GPU and less than an hour of training.

With that said, the past few years have seen extraordinary improvements using deep nets to attack extremely difficult image recognition tasks.

They will identify the years 2011 to 2015 (and probably a few years beyond) as a time of huge breakthroughs, driven by deep convolutional nets.

The 2012 LRMD paper: Let me start with a 2012 paper* *Building high-level features using large scale unsupervised learning, by Quoc Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg Corrado, Jeff Dean, and Andrew Ng (2012).

Note that the detailed architecture of the network used in the paper differed in many details from the deep convolutional networks we've been studying.

Details about ImageNet are available in the original ImageNet paper, ImageNet: a large-scale hierarchical image database, by Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei (2009).:

If you're looking for a challenge, I encourage you to visit ImageNet's list of hand tools, which distinguishes between beading planes, block planes, chamfer planes, and about a dozen other types of plane, amongst other categories.

The 2012 KSH paper: The work of LRMD was followed by a 2012 paper of Krizhevsky, Sutskever and Hinton (KSH)* *ImageNet classification with deep convolutional neural networks, by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E.

By this top-$5$ criterion, KSH's deep convolutional network achieved an accuracy of $84.7$ percent, vastly better than the next-best contest entry, which achieved an accuracy of $73.8$ percent.

The input layer contains $3 \times 224 \times 224$ neurons, representing the RGB values for a $224 \times 224$ image.

The feature maps are split into two groups of $48$ each, with the first $48$ feature maps residing on one GPU, and the second $48$ feature maps residing on the other GPU.

Their respectives parameters are: (3) $384$ feature maps, with $3 \times 3$ local receptive fields, and $256$ input channels;

A Theano-based implementation has also been developed* *Theano-based large-scale visual recognition with multiple GPUs, by Weiguang Ding, Ruoyan Wang, Fei Mao, and Graham Taylor (2014)., with the code available here.

As in 2012, it involved a training set of $1.2$ million images, in $1,000$ categories, and the figure of merit was whether the top $5$ predictions included the correct category.

The winning team, based primarily at Google* *Going deeper with convolutions, by Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich (2014)., used a deep convolutional network with $22$ layers of neurons.

GoogLeNet achieved a top-5 accuracy of $93.33$ percent, a giant improvement over the 2013 winner (Clarifai, with $88.3$ percent), and the 2012 winner (KSH, with $84.7$ percent).

In 2014 a team of researchers wrote a survey paper about the ILSVRC competition* *ImageNet large scale visual recognition challenge, by Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C.

...the task of labeling images with 5 out of 1000 categories quickly turned out to be extremely challenging, even for some friends in the lab who have been working on ILSVRC and its classes for a while.

In the end I realized that to get anywhere competitively close to GoogLeNet, it was most efficient if I sat down and went through the painfully long training process and the subsequent careful annotation process myself...

Some images are easily recognized, while some images (such as those of fine-grained breeds of dogs, birds, or monkeys) can require multiple minutes of concentrated effort.

In other words, an expert human, working painstakingly, was with great effort able to narrowly beat the deep neural network.

In fact, Karpathy reports that a second human expert, trained on a smaller sample of images, was only able to attain a $12.0$ percent top-5 error rate, significantly below GoogLeNet's performance.

One encouraging practical set of results comes from a team at Google, who applied deep convolutional networks to the problem of recognizing street numbers in Google's Street View imagery* *Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks, by Ian J.

And they go on to make the broader claim: 'We believe with this model we have solved [optical character recognition] for short sequences [of characters] for many applications.'

For instance, a 2013 paper* *Intriguing properties of neural networks, by Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus (2013) showed that deep networks may suffer from what are effectively blind spots.

The existence of the adversarial negatives appears to be in contradiction with the network’s ability to achieve high generalization performance.

The explanation is that the set of adversarial negatives is of extremely low probability, and thus is never (or rarely) observed in the test set, yet it is dense (much like the rational numbers), and so it is found near virtually every test case.

For example, one recent paper* *Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images, by Anh Nguyen, Jason Yosinski, and Jeff Clune (2014).

shows that given a trained network it's possible to generate images which look to a human like white noise, but which the network classifies as being in a known category with a very high degree of confidence.

If you read the neural networks literature, you'll run into many ideas we haven't discussed: recurrent neural networks, Boltzmann machines, generative models, transfer learning, reinforcement learning, and so on, on and on $\ldots$ and on!

One way RNNs are currently being used is to connect neural networks more closely to traditional ways of thinking about algorithms, ways of thinking based on concepts such as Turing machines and (conventional) programming languages.

A 2014 paper developed an RNN which could take as input a character-by-character description of a (very, very simple!) Python program, and use that description to predict the output.

For example, an approach based on deep nets has achieved outstanding results on large vocabulary continuous speech recognition.

And another system based on deep nets has been deployed in Google's Android operating system (for related technical work, see Vincent Vanhoucke's 2012-2015 papers).

Many other ideas used in feedforward nets, ranging from regularization techniques to convolutions to the activation and cost functions used, are also useful in recurrent nets.

Deep belief nets, generative models, and Boltzmann machines: Modern interest in deep learning began in 2006, with papers explaining how to train a type of neural network known as a deep belief network (DBN)* *See A fast learning algorithm for deep belief nets, by Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh (2006), as well as the related work in Reducing the dimensionality of data with neural networks, by Geoffrey Hinton and Ruslan Salakhutdinov (2006)..

A generative model like a DBN can be used in a similar way, but it's also possible to specify the values of some of the feature neurons and then 'run the network backward', generating values for the input activations.

And the ability to do unsupervised learning is extremely interesting both for fundamental scientific reasons, and - if it can be made to work well enough - for practical applications.

Active areas of research include using neural networks to do natural language processing (see also this informative review paper), machine translation, as well as perhaps more surprising applications such as music informatics.

In many cases, having read this book you should be able to begin following recent work, although (of course) you'll need to fill in gaps in presumed background knowledge.

It combines deep convolutional networks with a technique known as reinforcement learning in order to learn to play video games well (see also this followup).

The idea is to use the convolutional network to simplify the pixel data from the game screen, turning it into a simpler set of features, which can be used to decide which action to take: 'go left', 'go down', 'fire', and so on.

What is particularly interesting is that a single network learned to play seven different classic video games pretty well, outperforming human experts on three of the games.

But looking past the surface gloss, consider that this system is taking raw pixel data - it doesn't even know the game rules!

Google CEO Larry Page once described the perfect search engine as understanding exactly what [your queries] mean and giving you back exactly what you want.

In this vision, instead of responding to users' literal queries, search will use machine learning to take vague user input, discern precisely what was meant, and take action on the basis of those insights.

Over the next few decades, thousands of companies will build products which use machine learning to make user interfaces that can tolerate imprecision, while discerning and acting on the user's true intent.

Inspired user interface design is hard, and I expect many companies will take powerful machine learning technology and use it to build insipid user interfaces.

Machine learning, data science, and the virtuous circle of innovation: Of course, machine learning isn't just being used to build intention-driven interfaces.

But I do want to mention one consequence of this fashion that is not so often remarked: over the long run it's possible the biggest breakthrough in machine learning won't be any single conceptual breakthrough.

If a company can invest 1 dollar in machine learning research and get 1 dollar and 10 cents back reasonably rapidly, then a lot of money will end up in machine learning research.

So, for example, Conway's law suggests that the design of a Boeing 747 aircraft will mirror the extended organizational structure of Boeing and its contractors at the time the 747 was designed.

If the application's dashboard is supposed to be integrated with some machine learning algorithm, the person building the dashboard better be talking to the company's machine learning expert.

I won't define 'deep ideas' precisely, but loosely I mean the kind of idea which is the basis for a rich field of enquiry.

The backpropagation algorithm and the germ theory of disease are both good examples.: think of things like the germ theory of disease, for instance, or the understanding of how antibodies work, or the understanding that the heart, lungs, veins and arteries form a complete cardiovascular system.

Instead of a monolith, we have fields within fields within fields, a complex, recursive, self-referential social structure, whose organization mirrors the connections between our deepest insights.

Deep learning is the latest super-special weapon I've heard used in such arguments* *Interestingly, often not by leading experts in deep learning, who have been quite restrained.

And there is paper after paper leveraging the same basic set of ideas: using stochastic gradient descent (or a close variation) to optimize a cost function.

## Training an Artificial Neural Network - Intro

Artificial neural networks are relatively crude electronic networks of 'neurons' based on the neural structure of the brain.  They process records one at a time, and 'learn' by comparing their classification of the record (which, at the outset, is largely arbitrary) with the known actual classification of the record.  The errors from the initial classification of the first record is fed back into the network, and used to modify the networks algorithm the second time around, and so on for many iterations.  Roughly speaking, a neuron in an artificial neural network is

there may be several hidden layers.  The final layer is the output layer, where there is one node for each class.  A single sweep forward through the network results in the assignment of a value to each output node, and the record is assigned to whichever class's node had the highest value.

Training an Artificial Neural Network In the training phase, the correct class for each record is known (this is termed supervised training), and the output nodes can therefore be assigned 'correct' values -- '1' for the node corresponding to the correct class, and '0' for the others.  (In practice it has been found better to use values of 0.9 and 0.1, respectively.)  It is thus possible to compare the network's calculated values for the output nodes to these 'correct' values, and calculate an error term for each node (the 'Delta' rule).  These error terms are then used to adjust the weights in the hidden layers so that, hopefully, the next time around the output values will be closer to the 'correct' values.  The Iterative Learning Process A

key feature of neural networks is an iterative learning process in which data cases (rows) are presented to the network one at a time, and the weights associated with the input values are adjusted each time.  After all cases are presented, the process often starts over again.

Errors are then propagated back through the system, causing the system to adjust the weights for application to the next record to be processed.  This process occurs over and over as the weights are continually tweaked.  During the training of a network the same set of data is processed many times as the connection weights are continually refined.  Note that some networks never learn.

## Multi-Layer Neural Networks with Sigmoid Function— Deep Learning for Rookies (2)

Welcome back to my second post of the series Deep Learning for Rookies (DLFR), by yours truly, a rookie ;) Feel free to refer back to my first post here or my blog if you find it hard to follow.

You’ll be able to brag about your understanding soon ;) Last time, we introduced the field of Deep Learning and examined a simple a neural network — perceptron……or a dinosaur……ok, seriously, a single-layer perceptron.

After all, most problems in the real world are non-linear, and as individual humans, you and I are pretty darn good at the decision-making of linear or binary problems like should I study Deep Learning or not without needing a perceptron.

Fast forward almost two decades to 1986, Geoffrey Hinton, David Rumelhart, and Ronald Williams published a paper “Learning representations by back-propagating errors”, which introduced: If you are completely new to DL, you should remember Geoffrey Hinton, who plays a pivotal role in the progress of DL.

Remember that we iterated the importance of designing a neural network so that the network can learn from the difference between the desired output (what the fact is) and actual output (what the network returns) and then send a signal back to the weights and ask the weights to adjust themselves?

Secondly, when we multiply each of the m features with a weight (w1, w2, …, wm) and sum them all together, this is a dot product: So here are the takeaways for now: The procedure of how input values are forward propagated into the hidden layer, and then from hidden layer to the output is the same as in Graph 1.

One thing to remember is: If the activation function is linear, then you can stack as many hidden layers in the neural network as you wish, and the final output is still a linear combination of the original input data.

So basically, a small change in any weight in the input layer of our perceptron network could possibly lead to one neuron to suddenly flip from 0 to 1, which could again affect the hidden layer’s behavior, and then affect the final outcome.

Non-linear just means that the output we get from the neuron, which is the dot product of some inputs x (x1, x2, …, xm) and weights w (w1, w2, …,wm) plus bias and then put into a sigmoid function, cannot be represented by a linear combination of the input x (x1, x2, …,xm).

This non-linear activation function, when used by each neuron in a multi-layer neural network, produces a new “representation” of the original data, and ultimately allows for non-linear decision boundary, such as XOR.

if our output value is on the lower flat area on the two corners, then it’s false or 0 since it’s not right to say the weather is both hot and cold or neither hot or cold (ok, I guess the weather could be neither hot or cold…you get what I mean though…right?).

You can memorize these takeaways since they’re facts, but I encourage you to google a bit on the internet and see if you can understand the concept better (it is natural that we take some time to understand these concepts).

From the XOR example above, you’ve seen that adding two hidden neurons in 1 hidden layer could reshape our problem into a different space, which magically created a way for us to classify XOR with a ridge.

Now, the computer can’t really “see” a digit like we humans do, but if we dissect the image into an array of 784 numbers like [0, 0, 180, 16, 230, …, 4, 77, 0, 0, 0], then we can feed this array into our neural network.

So if the neural network thinks the handwritten digit is a zero, then we should get an output array of [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], the first output in this array that senses the digit to be a zero is “fired” to be 1 by our neural network, and the rest are 0.

If the neural network thinks the handwritten digit is a 5, then we should get [0, 0, 0, 0, 0, 1, 0, 0, 0, 0].

Remember we mentioned that neural networks become better by repetitively training themselves on data so that they can adjust the weights in each layer of the network to get the final results/actual output closer to the desired output?

For the sake of argument, let’s imagine the following case in Graph 14, which I borrow from Michael Nielsen’s online book: After training the neural network with rounds and rounds of labeled data in supervised learning, assume the first 4 hidden neurons learned to recognize the patterns above in the left side of Graph 14.

Then, if we feed the neural network an array of a handwritten digit zero, the network should correctly trigger the top 4 hidden neurons in the hidden layer while the other hidden neurons are silent, and then again trigger the first output neuron while the rest are silent.

If you train the neural network with a new set of randomized weights, it might produce the following network instead (compare Graph 15 with Graph 14), since the weights are randomized and we never know which one will learn which or what pattern.

It involves subtracting the mean across every individual feature in the data, and has the geometric interpretation of centering the cloud of data around the origin along every dimension.

It only makes sense to apply this preprocessing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm. In

case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional preprocessing step.

Then, we can compute the covariance matrix that tells us about the correlation structure in the data: The (i,j) element of the data covariance matrix contains the covariance between i-th and j-th dimension of the data.

To decorrelate the data, we project the original (but zero-centered) data into the eigenbasis: Notice that the columns of U are a set of orthonormal vectors (norm of 1, and orthogonal to each other), so they can be regarded as basis vectors.

This is also sometimes refereed to as Principal Component Analysis (PCA) dimensionality reduction: After this operation, we would have reduced the original dataset of size [N x D] to one of size [N x 100], keeping the 100 dimensions of the data that contain the most variance.

The geometric interpretation of this transformation is that if the input data is a multivariable gaussian, then the whitened data will be a gaussian with zero mean and identity covariance matrix.

One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input.

Note that we do not know what the final value of every weight should be in the trained network, but with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative.

The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network.

The implementation for one weight matrix might look like W = 0.01* np.random.randn(D,H), where randn samples from a zero mean, unit standard deviation gaussian.

With this formulation, every neuron’s weight vector is initialized as a random vector sampled from a multi-dimensional gaussian, so the neurons point in random direction in the input space.

That is, the recommended heuristic is to initialize each neuron’s weight vector as: w = np.random.randn(n) / sqrt(n), where n is the number of its inputs.

The sketch of the derivation is as follows: Consider the inner product $$s = \sum_i^n w_i x_i$$ between the weights $$w$$ and input $$x$$, which gives the raw activation of a neuron before the non-linearity.

And since $$\text{Var}(aX) = a^2\text{Var}(X)$$ for a random variable $$X$$ and a scalar $$a$$, this implies that we should draw from unit gaussian and then scale it by $$a = \sqrt{1/n}$$, to make its variance $$1/n$$.

In this paper, the authors end up recommending an initialization of the form $$\text{Var}(w) = 2/(n_{in} + n_{out})$$ where $$n_{in}, n_{out}$$ are the number of units in the previous layer and the next layer.

A more recent paper on this topic, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification by He et al., derives an initialization specifically for ReLU neurons, reaching the conclusion that the variance of neurons in the network should be $$2.0/n$$.

This gives the initialization w = np.random.randn(n) * sqrt(2.0/n), and is the current recommendation for use in practice in the specific case of neural networks with ReLU neurons.

Another way to address the uncalibrated variances problem is to set all weight matrices to zero, but to break symmetry every neuron is randomly connected (with weights sampled from a small gaussian as above) to a fixed number of neurons below it.

For ReLU non-linearities, some people like to use small constant value such as 0.01 for all biases because this ensures that all ReLU units fire in the beginning and therefore obtain and propagate some gradient.

However, it is not clear if this provides a consistent improvement (in fact some results seem to indicate that this performs worse) and it is more common to simply use 0 bias initialization.

A recently developed technique by Ioffe and Szegedy called Batch Normalization alleviates a lot of headaches with properly initializing neural networks by explicitly forcing the activations throughout a network to take on a unit gaussian distribution at the beginning of the training.

In the implementation, applying this technique usually amounts to insert the BatchNorm layer immediately after fully connected layers (or convolutional layers, as we’ll soon see), and before non-linearities.

It is common to see the factor of $$\frac{1}{2}$$ in front because then the gradient of this term with respect to the parameter $$w$$ is simply $$\lambda w$$ instead of $$2 \lambda w$$.

Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: W += -lambda * W towards zero.

L1 regularization is another relatively common form of regularization, where for each weight $$w$$ we add the term $$\lambda \mid w \mid$$ to the objective.

Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint.

In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector $$\vec{w}$$ of every neuron to satisfy $$\Vert \vec{w} \Vert_2 &lt; Vanilla dropout in an example 3-layer Neural Network would be implemented as follows: In the code above, inside the train_step function we have performed dropout twice: on the first hidden layer and on the second hidden layer. It can also be shown that performing this attenuation at test time can be related to the process of iterating over all the possible binary masks (and therefore all the exponentially many sub-networks) and computing their ensemble prediction. Since test-time performance is so critical, it is always preferable to use inverted dropout, which performs the scaling at train time, leaving the forward pass at test time untouched. Inverted dropout looks as follows: There has a been a large amount of research after the first introduction of dropout that tries to understand the source of its power in practice, and its relation to the other regularization techniques. As we already mentioned in the Linear Classification section, it is not common to regularize the bias parameters because they do not interact with the data through multiplicative interactions, and therefore do not have the interpretation of controlling the influence of a data dimension on the final objective. For example, a binary classifier for each category independently would take the form: where the sum is over all categories \(j$$, and $$y_{ij}$$ is either +1 or -1 depending on whether the i-th example is labeled with the j-th attribute, and the score vector $$f_j$$ will be positive when the class is predicted to be present and negative otherwise.

A binary logistic regression classifier has only two classes (0,1), and calculates the probability of class 1 as: Since the probabilities of class 1 and 0 sum to one, the probability for class 0 is $$P(y = 0 \mid x; The expression above can look scary but the gradient on \(f$$ is in fact extremely simple and intuitive: $$\partial{L_i} / \partial{f_j} = y_{ij} - \sigma(f_j)$$ (as you can double check yourself by taking the derivatives).

The L2 norm squared would compute the loss for a single example of the form: The reason the L2 norm is squared in the objective is that the gradient becomes much simpler, without changing the optimal parameters since squaring is a monotonic operation.

For example, if you are predicting star rating for a product, it might work much better to use 5 independent classifiers for ratings of 1-5 stars instead of a regression loss.

If you’re certain that classification is not appropriate, use the L2 but be careful: For example, the L2 is more fragile and applying dropout in the network (especially in the layer right before the L2 loss) is not a great idea.

Lecture 6 | Training Neural Networks I

In Lecture 6 we discuss many practical issues for training modern neural networks. We discuss different activation functions, the importance of data ...

Neural Network train in MATLAB

This video explain how to design and train a Neural Network in MATLAB.

How to train neural Network in Matlab ??

Normalized Inputs and Initial Weights

This video is part of the Udacity course "Deep Learning". Watch the full course at

What is a Neural Network - Ep. 2 (Deep Learning SIMPLIFIED)

With plenty of machine learning tools currently available, why would you ever choose an artificial neural network over all the rest? This clip and the next could ...

Processing our own Data - Deep Learning with Neural Networks and TensorFlow part 5

Welcome to part five of the Deep Learning with Neural Networks and TensorFlow tutorials. Now that we've covered a simple example of an artificial neural ...

Neural Network Training (Part 2): Neural Network Error Calculation

From An important part of the training process is error calculation. In this video, see how errors are calculated

Neural Networks 6: solving XOR with a hidden layer

Neural Networks Learn Logic Gates?

Install Tensorflow first, then keras. Follow instructions here: Tensorflow: keras: I installed tf and .

Batch Size in a Neural Network explained

In this video, we explain the concept of the batch size used during training of an artificial neural network and also show how to specify the batch size in code with ...