# AI News, Satellite Image Segmentation: a Workflow with U-Net

## Satellite Image Segmentation: a Workflow with U-Net

Image Segmentation is a topic of machine learning where one needs to not only categorize what’s seen in an image, but to also do it on a per-pixel level.

The thing here is that despite competition winners shared some code to reproduce exactly their winning submission (which was released after I started working on my pipeline), this does not include a lot of the required things to be able to come up with that code in the first place if one want to apply the pipeline to another problem or dataset.

Moreover, building neural networks is an iterative process where one needs to start somewhere and slowly increase a metric / evaluation score by modifying the neural network architecture and hyperparameters (configuration).

First off, here is a glance at the training data of the competition: The DSTL’s Satellite Imagery Feature Detection Challenge is a challenge where participants need to code a model capable of doing those predictions — the images just above are taken from the dataset, it represents an (X, Y) pair example from the training data.

If red, green and blue represents 3 bands (RGB), the 20 bands contain a lot more information for the neural network to ponder upon, and hence ease learning and the quality of the predictions: it is aware of visual features that humans do not see naturally.

U-Net is like a convolutional autoencoder, But it also has skip-like connections with the feature maps located before the bottleneck (compressed embedding) layer, in such a way that in the decoder part some information comes from previous layers, bypassing the compressive bottleneck.

See the figure below, taken from the official paper of the U-Net: Thus, in the decoder, data is not only recovered from a compression, but is also concatenated with the information’s state before it was passed into the compression bottleneck so as to augment context for the next decoding layers to come.

That way, the neural networks still learns to generalize in the compressed latent representation (located at the bottom of the “U” shape in the figure), but also recovers its latent generalizations to a spatial representation with the proper per-pixel semantic alignment in the right part of the U of the U-Net.

The 3rd place winners used the 4 possible 90 degrees rotations, as well as using a mirrored version of those rotation, which can increase the training data 8-fold: this data transformation belongs to the D4 Dihedral group.

The four extra channels (from indexes) are shown below, along with the original image’s RGB human-visible channels for comparison: Here is the U-Net architecture from the 3rd place winners: The only open-source code we found online from winners was from the 3rd place winners.

I started working on the project before the winners’ official code was available publicly, so I made my own development and production pipeline going in that direction, which reveals useful not only for solving the problem, but to code a neural network that can eventually be transferred on other datasets.

The architecture I managed to develop was first derived from public open-source code pre-release before the end of the competition: https://www.kaggle.com/ceperaang/lb-0-42-ultimate-full-solution-run-on-your-hw It seems that a lot of participants developed their architectures on top of precisely this shared code, which has both been very helpful and acted as a bar raiser for participants of the competition to keep up in the leaderboard.

In the following image, our “U”-Net is flipped like a “⊂”-Net, it is automatically made from Keras’ visualization tool, hence why it seems skewed: Hyperopt is used here to automatically figure out the best neural network architecture to best fit the data of the given problem, so this humongous neural network has been grown automatically.

Then running hyperopt takes time, but it proceeds to do what could be compared to using genetic algorithms to perform breeding and natural selection, except that there is no breeding here: just a readjustment from the past trials to try new trials in a way that balances exploration of new architectures versus optimization of the architecture near local maximas of performance.

Once searched, an hyperspace can be refined to narrow down the range in which hyperparameters are tested in the event of having to restart the meta-optimization — and in case some hyperparameters were already too narrow (e.g.: best points are near the limit of the parameter), it is still possible to widen the range.

A team workflow can be interesting where beginners learn to manipulate the data and to launch the optimisation to slowly start modifying the hyperparameter space and eventually add new parameters based on new research.

More details here: https://github.com/Vooban/Smoothly-Blend-Image-Patches.Star Fork Running the improved 3rd place winners’ code, it is possible to get a score worth of the 2nd position because they coded their neural networks from scratch after the competition to make the code public and more useable.

I used the 3rd place winners’ post processing code, which rounds the prediction masks in a smoother way and which corrects a few bugs such as removing building predictions while water is also predicted for a given pixel, and on, but the neural network behind that is still quite custom.

On my side, I have used one neural architecture optimized with hyperopt on all classes at once, to then take this already-evolved architecture to train it on single classes at a time, so there is an imbalance in the bias/variance of the neural networks, each used on a single task rather than on all tasks at once as when meta-optimized.

The One Hundred Layers Tiramisu might not help for fitting on the DSTL’s dataset according to the 3rd place winner, but at least it has a lot of capacity potential in being applied to larger and richer datasets due to the use of the recently discovered densely connected convolutional blocks.

## Deep learning

In the last chapter we learned that deep neural networks are often much harder to train than shallow neural networks.

We'll also look at the broader picture, briefly reviewing recent progress on using deep nets for image recognition, speech recognition, and other applications.

We'll work through a detailed example - code and all - of using convolutional nets to solve the problem of classifying handwritten digits from the MNIST data set:

As we go we'll explore many powerful techniques: convolutions, pooling, the use of GPUs to do far more training than we did with our shallow networks, the algorithmic expansion of our training data (to reduce overfitting), the use of the dropout technique (also to reduce overfitting), the use of ensembles of networks, and others.

We conclude our discussion of image recognition with a survey of some of the spectacular recent progress using networks (particularly convolutional nets) to do image recognition.

We'll briefly survey other models of neural networks, such as recurrent neural nets and long short-term memory units, and how such models can be applied to problems in speech recognition, natural language processing, and other areas.

And we'll speculate about the future of neural networks and deep learning, ranging from ideas like intention-driven user interfaces, to the role of deep learning in artificial intelligence.

For the $28 \times 28$ pixel images we've been using, this means our network has $784$ ($= 28 \times 28$) input neurons.

Our earlier networks work pretty well: we've obtained a classification accuracy better than 98 percent, using training and test data from the MNIST handwritten digit data set.

But the seminal paper establishing the modern subject of convolutional networks was a 1998 paper, 'Gradient-based learning applied to document recognition', by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.

LeCun has since made an interesting remark on the terminology for convolutional nets: 'The [biological] neural inspiration in models like convolutional nets is very tenuous.

That's why I call them 'convolutional nets' not 'convolutional neural nets', and why we call the nodes 'units' and not 'neurons' '.

Despite this remark, convolutional nets use many of the same ideas as the neural networks we've studied up to now: ideas such as backpropagation, gradient descent, regularization, non-linear activation functions, and so on.

In a convolutional net, it'll help to think instead of the inputs as a $28 \times 28$ square of neurons, whose values correspond to the $28 \times 28$ pixel intensities we're using as inputs:

To be more precise, each neuron in the first hidden layer will be connected to a small region of the input neurons, say, for example, a $5 \times 5$ region, corresponding to $25$ input pixels.

So, for a particular hidden neuron, we might have connections that look like this: That region in the input image is called the local receptive field for the hidden neuron.

To illustrate this concretely, let's start with a local receptive field in the top-left corner: Then we slide the local receptive field over by one pixel to the right (i.e., by one neuron), to connect to a second hidden neuron:

Note that if we have a $28 \times 28$ input image, and $5 \times 5$ local receptive fields, then there will be $24 \times 24$ neurons in the hidden layer.

This is because we can only move the local receptive field $23$ neurons across (or $23$ neurons down), before colliding with the right-hand side (or bottom) of the input image.

In this chapter we'll mostly stick with stride length $1$, but it's worth knowing that people sometimes experiment with different stride lengths* *As was done in earlier chapters, if we're interested in trying different stride lengths then we can use validation data to pick out the stride length which gives the best performance.

The same approach may also be used to choose the size of the local receptive field - there is, of course, nothing special about using a $5 \times 5$ local receptive field.

In general, larger local receptive fields tend to be helpful when the input images are significantly larger than the $28 \times 28$ pixel MNIST images..

In other words, for the $j, k$th hidden neuron, the output is: \begin{eqnarray} \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4 w_{l,m} a_{j+l, k+m} \right).

Informally, think of the feature detected by a hidden neuron as the kind of input pattern that will cause the neuron to activate: it might be an edge in the image, for instance, or maybe some other type of shape.

To see why this makes sense, suppose the weights and bias are such that the hidden neuron can pick out, say, a vertical edge in a particular local receptive field.

To put it in slightly more abstract terms, convolutional networks are well adapted to the translation invariance of images: move a picture of a cat (say) a little ways, and it's still an image of a cat* *In fact, for the MNIST digit classification problem we've been studying, the images are centered and size-normalized.

One of the early convolutional networks, LeNet-5, used $6$ feature maps, each associated to a $5 \times 5$ local receptive field, to recognize MNIST digits.

Let's take a quick peek at some of the features which are learned* *The feature maps illustrated come from the final convolutional network we train, see here.:

Each map is represented as a $5 \times 5$ block image, corresponding to the $5 \times 5$ weights in the local receptive field.

By comparison, suppose we had a fully connected first layer, with $784 = 28 \times 28$ input neurons, and a relatively modest $30$ hidden neurons, as we used in many of the examples earlier in the book.

That, in turn, will result in faster training for the convolutional model, and, ultimately, will help us build deep networks using convolutional layers.

Incidentally, the name convolutional comes from the fact that the operation in Equation (125)\begin{eqnarray} \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4 w_{l,m} a_{j+l, k+m} \right) \nonumber\end{eqnarray}$('#margin_223550267310_reveal').click(function() {$('#margin_223550267310').toggle('slow', function() {});});

A little more precisely, people sometimes write that equation as $a^1 = \sigma(b + w * a^0)$, where $a^1$ denotes the set of output activations from one feature map, $a^0$ is the set of input activations, and $*$ is called a convolution operation.

In particular, I'm using 'feature map' to mean not the function computed by the convolutional layer, but rather the activation of the hidden neurons output from the layer.

In max-pooling, a pooling unit simply outputs the maximum activation in the $2 \times 2$ input region, as illustrated in the following diagram:

Note that since we have $24 \times 24$ neurons output from the convolutional layer, after pooling we have $12 \times 12$ neurons.

So if there were three feature maps, the combined convolutional and max-pooling layers would look like:

Here, instead of taking the maximum activation of a $2 \times 2$ region of neurons, we take the square root of the sum of the squares of the activations in the $2 \times 2$ region.

It's similar to the architecture we were just looking at, but has the addition of a layer of $10$ output neurons, corresponding to the $10$ possible values for MNIST digits ('0', '1', '2', etc):

Problem Backpropagation in a convolutional network The core equations of backpropagation in a network with fully-connected layers are (BP1)\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}$('#margin_511945174620_reveal').click(function() {$('#margin_511945174620').toggle('slow', function() {});});-(BP4)\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}$('#margin_896578903066_reveal').click(function() {$('#margin_896578903066').toggle('slow', function() {});});

Suppose we have a network containing a convolutional layer, a max-pooling layer, and a fully-connected output layer, as in the network discussed above.

The program we'll use to do this is called network3.py, and it's an improved version of the programs network.py and network2.py developed in earlier chapters* *Note also that network3.py incorporates ideas from the Theano library's documentation on convolutional neural nets (notably the implementation of LeNet-5), from Misha Denil's implementation of dropout, and from Chris Olah..

But now that we understand those details, for network3.py we're going to use a machine learning library known as Theano* *See Theano: A CPU and GPU Math Expression Compiler in Python, by James Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Ravzan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio (2010).

The examples which follow were run using Theano 0.6* *As I release this chapter, the current version of Theano has changed to version 0.7.

Note that the code in the script simply duplicates and parallels the discussion in this section.Note also that throughout the section I've explicitly specified the number of training epochs.

In practice, it's worth using early stopping, that is, tracking accuracy on the validation set, and stopping training when we are confident the validation accuracy has stopped improving.: &gt;&gt;&gt;

Using the validation data to decide when to evaluate the test accuracy helps avoid overfitting to the test data (see this earlier discussion of the use of validation data).

Your results may vary slightly, since the network's weights and biases are randomly initialized* *In fact, in this experiment I actually did three separate runs training a network with this architecture.

This $97.80$ percent accuracy is close to the $98.04$ percent accuracy obtained back in Chapter 3, using a similar network architecture and learning hyper-parameters.

Second, while the final layer in the earlier network used sigmoid activations and the cross-entropy cost function, the current network uses a softmax final layer, and the log-likelihood cost function.

I haven't made this switch for any particularly deep reason - mostly, I've done it because softmax plus log-likelihood cost is more common in modern image classification networks.

In this architecture, we can think of the convolutional and pooling layers as learning about local spatial structure in the input training image, while the later, fully-connected layer learns at a more abstract level, integrating global information from across the entire image.

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2)),

validation_data, test_data)

Can we improve on the $98.78$ percent classification accuracy?

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2)),

filter_shape=(40, 20, 5, 5),

poolsize=(2, 2)),

validation_data, test_data)

In fact, you can think of the second convolutional-pooling layer as having as input $12 \times 12$ 'images', whose 'pixels' represent the presence (or absence) of particular localized features in the original input image.

The output from the previous layer involves $20$ separate feature maps, and so there are $20 \times 12 \times 12$ inputs to the second convolutional-pooling layer.

In fact, we'll allow each neuron in this layer to learn from all $20 \times 5 \times 5$ input neurons in its local receptive field.

More informally: the feature detectors in the second convolutional-pooling layer have access to all the features from the previous layer, but only within their particular local receptive field* *This issue would have arisen in the first layer if the input images were in color.

In that case we'd have 3 input features for each pixel, corresponding to red, green and blue channels in the input image.

So we'd allow the feature detectors to have access to all color information, but only within a given local receptive field..

Problem Using the tanh activation function Several times earlier in the book I've mentioned arguments that the tanh function may be a better activation function than the sigmoid function.

Try training the network with tanh activations in the convolutional and fully-connected layers* *Note that you can pass activation_fn=tanh as a parameter to the ConvPoolLayer and FullyConnectedLayer classes..

Try plotting the per-epoch validation accuracies for both tanh- and sigmoid-based networks, all the way out to $60$ epochs.

If your results are similar to mine, you'll find the tanh networks train a little faster, but the final accuracies are very similar.

Can you get a similar training speed with the sigmoid, perhaps by changing the learning rate, or doing some rescaling* *You may perhaps find inspiration in recalling that $\sigma(z) = (1+\tanh(z/2))/2$.?

Try a half-dozen iterations on the learning hyper-parameters or network architecture, searching for ways that tanh may be superior to the sigmoid.

Personally, I did not find much advantage in switching to tanh, although I haven't experimented exhaustively, and perhaps you may find a way.

In any case, in a moment we will find an advantage in switching to the rectified linear activation function, and so we won't go any deeper into the use of tanh.

Using rectified linear units: The network we've developed at this point is actually a variant of one of the networks used in the seminal 1998 paper* *'Gradient-based learning applied to document recognition', by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner (1998).

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

filter_shape=(40, 20, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

However, across all my experiments I found that networks based on rectified linear units consistently outperformed networks based on sigmoid activation functions.

The reason for that recent adoption is empirical: a few people tried rectified linear units, often on the basis of hunches or heuristic arguments* *A common justification is that $\max(0, z)$ doesn't saturate in the limit of large $z$, unlike sigmoid neurons, and this helps rectified linear units continue learning.

A simple way of expanding the training data is to displace each training image by a single pixel, either up one pixel, down one pixel, left one pixel, or right one pixel.

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

filter_shape=(40, 20, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

Just to remind you of the flavour of some of the results in that earlier discussion: in 2003 Simard, Steinkraus and Platt* *Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis, by Patrice Simard, Dave Steinkraus, and John Platt (2003).

improved their MNIST performance to $99.6$ percent using a neural network otherwise very similar to ours, using two convolutional-pooling layers, followed by a hidden fully-connected layer with $100$ neurons.

There were a few differences of detail in their architecture - they didn't have the advantage of using rectified linear units, for instance - but the key to their improved performance was expanding the training data.

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

filter_shape=(40, 20, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

filter_shape=(20, 1, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

filter_shape=(40, 20, 5, 5),

poolsize=(2, 2),

activation_fn=ReLU),

Using this, we obtain an accuracy of $99.60$ percent, which is a substantial improvement over our earlier results, especially our main benchmark, the network with $100$ hidden neurons, where we achieved $99.37$ percent.

In fact, I tried experiments with both $300$ and $1,000$ hidden neurons, and obtained (very slightly) better validation performance with $1,000$ hidden neurons.

Why we only applied dropout to the fully-connected layers: If you look carefully at the code above, you'll notice that we applied dropout only to the fully-connected section of the network, not to the convolutional layers.

But apart from that, they used few other tricks, including no convolutional layers: it was a plain, vanilla network, of the kind that, with enough patience, could have been trained in the 1980s (if the MNIST data set had existed), given enough computing power.

In particular, we saw that the gradient tends to be quite unstable: as we move from the output layer to earlier layers the gradient tends to either vanish (the vanishing gradient problem) or explode (the exploding gradient problem).

In particular, in our final experiments we trained for $40$ epochs using a data set $5$ times larger than the raw MNIST training data.

I've occasionally heard people adopt a deeper-than-thou attitude, holding that if you're not keeping-up-with-the-Joneses in terms of number of hidden layers, then you're not really doing deep learning.

To speed that process up you may find it helpful to revisit Chapter 3's discussion of how to choose a neural network's hyper-parameters, and perhaps also to look at some of the further reading suggested in that section.

Here's the code (discussion below)* *Note added November 2016: several readers have noted that in the line initializing self.w, I set scale=np.sqrt(1.0/n_out), when the arguments of Chapter 3 suggest a better initialization may be scale=np.sqrt(1.0/n_in).

np.random.normal(

loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),

dtype=theano.config.floatX),

dtype=theano.config.floatX),

I use the name inpt rather than input because input is a built-in function in Python, and messing with built-ins tends to cause unpredictable behavior and difficult-to-diagnose bugs.

So self.inpt_dropout and self.output_dropout are used during training, while self.inpt and self.output are used for all other purposes, e.g., evaluating accuracy on the validation and test data.

prev_layer, layer = self.layers[j-1], self.layers[j]

prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)

Now, this isn't a Theano tutorial, and so we won't get too deeply into what it means that these are symbolic variables* *The Theano documentation provides a good introduction to Theano.

# define the (regularized) cost function, symbolic gradients, and updates

0.5*lmbda*l2_norm_squared/num_training_batches

self.x:

training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

iteration = num_training_batches*epoch+minibatch_index

if iteration

print(&quot;Training mini-batch number {0}&quot;.format(iteration))

cost_ij = train_mb(minibatch_index)

if (iteration+1)

validation_accuracy = np.mean(

[validate_mb_accuracy(j) for j in xrange(num_validation_batches)])

print(&quot;Epoch {0}: validation accuracy {1:.2

epoch, validation_accuracy))

if validation_accuracy &gt;= best_validation_accuracy:

print(&quot;This is the best validation accuracy to date.&quot;)

best_validation_accuracy = validation_accuracy

best_iteration = iteration

if test_data:

test_accuracy = np.mean(

[test_mb_accuracy(j) for j in xrange(num_test_batches)])

print(&#39;The corresponding test accuracy is {0:.2

test_accuracy))

# define the (regularized) cost function, symbolic gradients, and updates

0.5*lmbda*l2_norm_squared/num_training_batches

In these lines we symbolically set up the regularized log-likelihood cost function, compute the corresponding derivatives in the gradient function, as well as the corresponding parameter updates.

With all these things defined, the stage is set to define the train_mb function, a Theano symbolic function which uses the updates to update the Network parameters, given a mini-batch index.

The remainder of the SGD method is self-explanatory - we simply iterate over the epochs, repeatedly training the network on mini-batches of training data, and computing the validation and test accuracies.

prev_layer, layer = self.layers[j-1], self.layers[j]

prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)

# define the (regularized) cost function, symbolic gradients, and updates

0.5*lmbda*l2_norm_squared/num_training_batches

self.x:

training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],

self.y:

test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

self.x:

test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]

iteration = num_training_batches*epoch+minibatch_index

if iteration % 1000 == 0:

print(&quot;Training mini-batch number {0}&quot;.format(iteration))

cost_ij = train_mb(minibatch_index)

if (iteration+1) % num_training_batches == 0:

validation_accuracy = np.mean(

[validate_mb_accuracy(j) for j in xrange(num_validation_batches)])

print(&quot;Epoch {0}: validation accuracy {1:.2%}&quot;.format(

epoch, validation_accuracy))

if validation_accuracy &gt;= best_validation_accuracy:

print(&quot;This is the best validation accuracy to date.&quot;)

best_validation_accuracy = validation_accuracy

best_iteration = iteration

if test_data:

test_accuracy = np.mean(

[test_mb_accuracy(j) for j in xrange(num_test_batches)])

print(&#39;The corresponding test accuracy is {0:.2%}&#39;.format(

test_accuracy))

activation_fn=sigmoid):

of filters, the number of input feature maps, the filter height, and the

poolsize is a tuple of length 2, whose entries are the y and

np.random.normal(loc=0, scale=np.sqrt(1.0/n_out), size=filter_shape),

dtype=theano.config.floatX),

np.random.normal(loc=0, scale=1.0, size=(filter_shape[0],)),

dtype=theano.config.floatX),

pooled_out + self.b.dimshuffle(&#39;x&#39;, 0, &#39;x&#39;, &#39;x&#39;))

np.random.normal(

loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),

dtype=theano.config.floatX),

dtype=theano.config.floatX),

Earlier in the book we discussed an automated way of selecting the number of epochs to train for, known as early stopping.

Hint: After working on this problem for a while, you may find it useful to see the discussion at this link.

Earlier in the chapter I described a technique for expanding the training data by applying (small) rotations, skewing, and translation.

Note: Unless you have a tremendous amount of memory, it is not practical to explicitly generate the entire expanded data set.

Show that rescaling all the weights in the network by a constant factor $c > 0$ simply rescales the outputs by a factor $c^{L-1}$, where $L$ is the number of layers.

Still, considering the problem will help you better understand networks containing rectified linear units.

Note: The word good in the second part of this makes the problem a research problem.

In 1998, the year MNIST was introduced, it took weeks to train a state-of-the-art workstation to achieve accuracies substantially worse than those we can achieve using a GPU and less than an hour of training.

With that said, the past few years have seen extraordinary improvements using deep nets to attack extremely difficult image recognition tasks.

They will identify the years 2011 to 2015 (and probably a few years beyond) as a time of huge breakthroughs, driven by deep convolutional nets.

The 2012 LRMD paper: Let me start with a 2012 paper* *Building high-level features using large scale unsupervised learning, by Quoc Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg Corrado, Jeff Dean, and Andrew Ng (2012).

Note that the detailed architecture of the network used in the paper differed in many details from the deep convolutional networks we've been studying.

Details about ImageNet are available in the original ImageNet paper, ImageNet: a large-scale hierarchical image database, by Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei (2009).:

If you're looking for a challenge, I encourage you to visit ImageNet's list of hand tools, which distinguishes between beading planes, block planes, chamfer planes, and about a dozen other types of plane, amongst other categories.

The 2012 KSH paper: The work of LRMD was followed by a 2012 paper of Krizhevsky, Sutskever and Hinton (KSH)* *ImageNet classification with deep convolutional neural networks, by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E.

By this top-$5$ criterion, KSH's deep convolutional network achieved an accuracy of $84.7$ percent, vastly better than the next-best contest entry, which achieved an accuracy of $73.8$ percent.

The input layer contains $3 \times 224 \times 224$ neurons, representing the RGB values for a $224 \times 224$ image.

The feature maps are split into two groups of $48$ each, with the first $48$ feature maps residing on one GPU, and the second $48$ feature maps residing on the other GPU.

Their respectives parameters are: (3) $384$ feature maps, with $3 \times 3$ local receptive fields, and $256$ input channels;

A Theano-based implementation has also been developed* *Theano-based large-scale visual recognition with multiple GPUs, by Weiguang Ding, Ruoyan Wang, Fei Mao, and Graham Taylor (2014)., with the code available here.

As in 2012, it involved a training set of $1.2$ million images, in $1,000$ categories, and the figure of merit was whether the top $5$ predictions included the correct category.

The winning team, based primarily at Google* *Going deeper with convolutions, by Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich (2014)., used a deep convolutional network with $22$ layers of neurons.

GoogLeNet achieved a top-5 accuracy of $93.33$ percent, a giant improvement over the 2013 winner (Clarifai, with $88.3$ percent), and the 2012 winner (KSH, with $84.7$ percent).

In 2014 a team of researchers wrote a survey paper about the ILSVRC competition* *ImageNet large scale visual recognition challenge, by Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C.

...the task of labeling images with 5 out of 1000 categories quickly turned out to be extremely challenging, even for some friends in the lab who have been working on ILSVRC and its classes for a while.

In the end I realized that to get anywhere competitively close to GoogLeNet, it was most efficient if I sat down and went through the painfully long training process and the subsequent careful annotation process myself...

Some images are easily recognized, while some images (such as those of fine-grained breeds of dogs, birds, or monkeys) can require multiple minutes of concentrated effort.

In other words, an expert human, working painstakingly, was with great effort able to narrowly beat the deep neural network.

In fact, Karpathy reports that a second human expert, trained on a smaller sample of images, was only able to attain a $12.0$ percent top-5 error rate, significantly below GoogLeNet's performance.

One encouraging practical set of results comes from a team at Google, who applied deep convolutional networks to the problem of recognizing street numbers in Google's Street View imagery* *Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks, by Ian J.

And they go on to make the broader claim: 'We believe with this model we have solved [optical character recognition] for short sequences [of characters] for many applications.'

For instance, a 2013 paper* *Intriguing properties of neural networks, by Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus (2013) showed that deep networks may suffer from what are effectively blind spots.

The existence of the adversarial negatives appears to be in contradiction with the network’s ability to achieve high generalization performance.

The explanation is that the set of adversarial negatives is of extremely low probability, and thus is never (or rarely) observed in the test set, yet it is dense (much like the rational numbers), and so it is found near virtually every test case.

For example, one recent paper* *Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images, by Anh Nguyen, Jason Yosinski, and Jeff Clune (2014).

shows that given a trained network it's possible to generate images which look to a human like white noise, but which the network classifies as being in a known category with a very high degree of confidence.

If you read the neural networks literature, you'll run into many ideas we haven't discussed: recurrent neural networks, Boltzmann machines, generative models, transfer learning, reinforcement learning, and so on, on and on $\ldots$ and on!

One way RNNs are currently being used is to connect neural networks more closely to traditional ways of thinking about algorithms, ways of thinking based on concepts such as Turing machines and (conventional) programming languages.

A 2014 paper developed an RNN which could take as input a character-by-character description of a (very, very simple!) Python program, and use that description to predict the output.

For example, an approach based on deep nets has achieved outstanding results on large vocabulary continuous speech recognition.

And another system based on deep nets has been deployed in Google's Android operating system (for related technical work, see Vincent Vanhoucke's 2012-2015 papers).

Many other ideas used in feedforward nets, ranging from regularization techniques to convolutions to the activation and cost functions used, are also useful in recurrent nets.

Deep belief nets, generative models, and Boltzmann machines: Modern interest in deep learning began in 2006, with papers explaining how to train a type of neural network known as a deep belief network (DBN)* *See A fast learning algorithm for deep belief nets, by Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh (2006), as well as the related work in Reducing the dimensionality of data with neural networks, by Geoffrey Hinton and Ruslan Salakhutdinov (2006)..

A generative model like a DBN can be used in a similar way, but it's also possible to specify the values of some of the feature neurons and then 'run the network backward', generating values for the input activations.

And the ability to do unsupervised learning is extremely interesting both for fundamental scientific reasons, and - if it can be made to work well enough - for practical applications.

Active areas of research include using neural networks to do natural language processing (see also this informative review paper), machine translation, as well as perhaps more surprising applications such as music informatics.

In many cases, having read this book you should be able to begin following recent work, although (of course) you'll need to fill in gaps in presumed background knowledge.

It combines deep convolutional networks with a technique known as reinforcement learning in order to learn to play video games well (see also this followup).

The idea is to use the convolutional network to simplify the pixel data from the game screen, turning it into a simpler set of features, which can be used to decide which action to take: 'go left', 'go down', 'fire', and so on.

What is particularly interesting is that a single network learned to play seven different classic video games pretty well, outperforming human experts on three of the games.

But looking past the surface gloss, consider that this system is taking raw pixel data - it doesn't even know the game rules!

Google CEO Larry Page once described the perfect search engine as understanding exactly what [your queries] mean and giving you back exactly what you want.

In this vision, instead of responding to users' literal queries, search will use machine learning to take vague user input, discern precisely what was meant, and take action on the basis of those insights.

Over the next few decades, thousands of companies will build products which use machine learning to make user interfaces that can tolerate imprecision, while discerning and acting on the user's true intent.

Inspired user interface design is hard, and I expect many companies will take powerful machine learning technology and use it to build insipid user interfaces.

Machine learning, data science, and the virtuous circle of innovation: Of course, machine learning isn't just being used to build intention-driven interfaces.

But I do want to mention one consequence of this fashion that is not so often remarked: over the long run it's possible the biggest breakthrough in machine learning won't be any single conceptual breakthrough.

If a company can invest 1 dollar in machine learning research and get 1 dollar and 10 cents back reasonably rapidly, then a lot of money will end up in machine learning research.

So, for example, Conway's law suggests that the design of a Boeing 747 aircraft will mirror the extended organizational structure of Boeing and its contractors at the time the 747 was designed.

If the application's dashboard is supposed to be integrated with some machine learning algorithm, the person building the dashboard better be talking to the company's machine learning expert.

I won't define 'deep ideas' precisely, but loosely I mean the kind of idea which is the basis for a rich field of enquiry.

The backpropagation algorithm and the germ theory of disease are both good examples.: think of things like the germ theory of disease, for instance, or the understanding of how antibodies work, or the understanding that the heart, lungs, veins and arteries form a complete cardiovascular system.

Instead of a monolith, we have fields within fields within fields, a complex, recursive, self-referential social structure, whose organization mirrors the connections between our deepest insights.

Deep learning is the latest super-special weapon I've heard used in such arguments* *Interestingly, often not by leading experts in deep learning, who have been quite restrained.

And there is paper after paper leveraging the same basic set of ideas: using stochastic gradient descent (or a close variation) to optimize a cost function.

One of our core aspirations at OpenAI is to develop algorithms and techniques that endow computers with an understanding of our world.

To train a generative model we first collect a large amount of data in some domain (e.g., think millions of images, sentences, or sounds, etc.) and then train a model to generate data like it.

The intuition behind this approach follows a famous quote from Richard Feynman: “What I cannot create, I do not understand.” The trick is that the neural networks we use as generative models have a number of parameters significantly smaller than the amount of data we train them on, so the models are forced to discover and efficiently internalize the essence of the data in order to generate it.

Suppose we have some large collection of images, such as the 1.2 million images in the ImageNet dataset (but keep in mind that this could eventually be a large collection of images or videos from the internet or robots).

This network takes as input 100 random numbers drawn from a uniform distribution (we refer to these as a code, or latent variables, in red) and outputs an image (in this case 64x64x3 images on the right, in green).

But in addition to that — and here's the trick — we can also backpropagate through both the discriminator and the generator to find how we should change the generator's parameters to make its 200 samples slightly more confusing for the discriminator.

These two networks are therefore locked in a battle: the discriminator is trying to distinguish real images from fake images and the generator is trying to create images that make the discriminator think they are real.

both cases the samples from the generator start out noisy and chaotic, and over time converge to have more plausible image statistics: This is exciting — these neural networks are learning what the visual world looks like!

Eventually, the model may discover many more complex regularities: that there are certain types of backgrounds, objects, textures, that they occur in certain likely arrangements, or that they transform in certain ways over time in videos, etc.

In the example image below, the blue region shows the part of the image space that, with a high probability (over some threshold) contains real images, and black dots indicate our data points (each is one image in our dataset).

Now, our model also describes a distribution \hat{p}_{\theta}(x) (green) that is defined implicitly by taking points from a unit Gaussian distribution (red) and mapping them through a (deterministic) neural network — our generative model (yellow).

Therefore, you can imagine the green distribution starting out random and then the training process iteratively changing the parameters \theta to stretch and squeeze it to better match the blue distribution.

First, as mentioned above GANs are a very promising family of generative models because, unlike other methods, they produce very clean and sharp images and learn codes that contain valuable information about these textures.

These techniques allow us to scale up GANs and obtain nice 128x128 ImageNet samples: Our CIFAR-10 samples also look very sharp - Amazon Mechanical Turk workers can distinguish our samples from real data with an error rate of 21.3% (50% would be random guessing): In addition to generating pretty pictures, we introduce an approach for semi-supervised learning with GANs that involves the discriminator producing an additional output indicating the label of the input.

On MNIST, for example, we achieve 99.14% accuracy with only 10 labeled examples per class with a fully connected neural network — a result that’s very close to the best known results with fully supervised approaches using all 60,000 labeled examples.

The core contribution of this work, termed inverse autoregressive flow (IAF), is a new approach that, unlike previous work, allows us to parallelize the computation of rich approximate posteriors, and make them almost arbitrarily flexible.

A regular GAN achieves the objective of reproducing the data distribution in the model, but the layout and organization of the code space is underspecified — there are many possible solutions to mapping the unit Gaussian to images and the one we end up with might be intricate and highly entangled.

It's clear from the five provided examples (along each row) that the resulting dimensions in the code capture interpretable dimensions, and that the model has perhaps understood that there are camera angles, facial variations, etc., without having been told that these features exist and are important: The next two recent projects are in a reinforcement learning (RL) setting (another area of focus at OpenAI), but they both involve a generative model component.

Additional presently known applications include image denoising, inpainting, super-resolution, structured prediction, exploration in reinforcement learning, and neural network pretraining in cases where labeled data is expensive.

In the section on linear classification we computed scores for different visual categories given the image using the formula $$s = W x$$, where $$W$$ was a matrix and $$x$$ was an input column vector containing all pixel data of the image.

In the case of CIFAR-10, $$x$$ is a [3072x1] column vector, and $$W$$ is a [10x3072] matrix, so that the output scores is a vector of 10 class scores.

There are several choices we could make for the non-linearity (which we’ll study below), but this one is a common choice and simply thresholds all activations that are below zero to zero.

Notice that the non-linearity is critical computationally - if we left it out, the two matrices could be collapsed to a single matrix, and therefore the predicted class scores would again be a linear function of the input.

three-layer neural network could analogously look like $$s = W_3 \max(0, W_2 \max(0, W_1 x))$$, where all of $$W_3, W_2, W_1$$ are parameters to be learned.

The area of Neural Networks has originally been primarily inspired by the goal of modeling biological neural systems, but has since diverged and become a matter of engineering and achieving good results in Machine Learning tasks.

Approximately 86 billion neurons can be found in the human nervous system and they are connected with approximately 10^14 - 10^15 synapses.

The idea is that the synaptic strengths (the weights $$w$$) are learnable and control the strength of influence (and its direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another.

Based on this rate code interpretation, we model the firing rate of the neuron with an activation function $$f$$, which represents the frequency of the spikes along the axon.

Historically, a common choice of activation function is the sigmoid function $$\sigma$$, since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1.

An example code for forward-propagating a single neuron might look as follows: In other words, each neuron performs a dot product with the input and its weights, adds the bias and applies the non-linearity (or activation function), in this case the sigmoid $$\sigma(x) = 1/(1+e^{-x})$$.

As we saw with linear classifiers, a neuron has the capacity to “like” (activation near one) or “dislike” (activation near zero) certain linear regions of its input space.

With this interpretation, we can formulate the cross-entropy loss as we have seen in the Linear Classification section, and optimizing it would lead to a binary Softmax classifier (also known as logistic regression).

The regularization loss in both SVM/Softmax cases could in this biological view be interpreted as gradual forgetting, since it would have the effect of driving all synaptic weights $$w$$ towards zero after every parameter update.

The sigmoid non-linearity has the mathematical form $$\sigma(x) = 1 / (1 + e^{-x})$$ and is shown in the image above on the left.

The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1).

Also note that the tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: $$\tanh(x) = 2 \sigma(2x) -1$$.

Other types of units have been proposed that do not have the functional form $$f(w^Tx + b)$$ where a non-linearity is applied on the dot product between the weights and the data.

TLDR: “What neuron type should I use?” Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of “dead” units in a network.

For regular neural networks, the most common layer type is the fully-connected layer in which neurons between two adjacent layers are fully pairwise connected, but neurons within a single layer share no connections.

Working with the two example networks in the above picture: To give you some context, modern Convolutional Networks contain on orders of 100 million parameters and are usually made up of approximately 10-20 layers (hence deep learning).

The full forward pass of this 3-layer neural network is then simply three matrix multiplications, interwoven with the application of the activation function: In the above code, W1,W2,W3,b1,b2,b3 are the learnable parameters of the network.

Notice also that instead of having a single input column vector, the variable x could hold an entire batch of training data (where each input example would be a column of x) and then all examples would be efficiently evaluated in parallel.

Neural Networks work well in practice because they compactly express nice, smooth functions that fit well with the statistical properties of data we encounter in practice, and are also easy to learn using our optimization algorithms (e.g.

Similarly, the fact that deeper networks (with multiple hidden layers) can work better than a single-hidden-layer networks is an empirical observation, despite the fact that their representational power is equal.

As an aside, in practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more.

We could train three separate neural networks, each with one hidden layer of some size and obtain the following classifiers: In the diagram above, we can see that Neural Networks with more neurons can express more complicated functions.

For example, the model with 20 hidden neurons fits all the training data but at the cost of segmenting the space into many disjoint red and green decision regions.

The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It’s clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e.

Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss.

In practice, what you find is that if you train a small network the final loss can display a good amount of variance - in some cases you get lucky and converge to a good place but in some cases you get trapped in one of the bad minima.

How to Make an Image Classifier - Intro to Deep Learning #6

We're going to make our own Image Classifier for cats & dogs in 40 lines of Python! First we'll go over the history of image classification, then we'll dive into the ...

How to Learn from Little Data - Intro to Deep Learning #17

One-shot learning! In this last weekly video of the course, i'll explain how memory augmented neural networks can help achieve one-shot classification for a ...

Convolutional Neural Network (CNN) | Convolutional Neural Networks With TensorFlow | Edureka

TensorFlow Training - ) This Edureka "Convolutional Neural Network Tutorial" video (Blog: ..

Intro and preprocessing - Using Convolutional Neural Network to Identify Dogs vs Cats p. 1

In this tutorial, we're going to be running through taking raw images that have been labeled for us already, and then feeding them through a convolutional neural ...

Neural Network train and form recognition

This video disscusses the application of Neural Network in a simple form recognition task using binary images. The different steps from dataset peparation, ...

Neural Network train in MATLAB

This video explain how to design and train a Neural Network in MATLAB.

How to Convert Text to Images - Intro to Deep Learning #16

Generative Adversarial Networks are back! We'll use the cutting edge StackGAN architecture to let us generate images from text descriptions alone. This is pretty ...

Build a Neural Net in 4 Minutes

How does a Neural network work? Its the basis of deep learning and the reason why image recognition, chatbots, self driving cars, and language translation ...

How CNN (Convolutional Neural Networks - Deep Learning) algorithm works

In this video I present a simple example of a CNN (Convolutional Neural Network) applied to image classification of digits. CNN is one of the well known Deep ...

Deep Learning with MATLAB: Training a Neural Network from Scratch with MATLAB

This demo uses MATLAB® to train a CNN from scratch for classifying images of four different animal types: cat, dog, deer, and frog. Images are used from the ...