AI News, BOOK REVIEW: How to rapidly test dozens of deep learning models in Python

How to rapidly test dozens of deep learning models in Python

Let’s develop a neural network assembly line that allows us to easily experiment with numerous model configurations.

Here’s are the primary hyperparameters that govern neural networks: We can package all of these in a hash table: Before we begin experimenting with various model architectures, let’s quickly visualize the data to see what we’re working with (data).

compile a neural network given a hyperparameter hash table: We can quickly test a few baseline models now that we have a fast, flexible way of constructing and compiling neural networks.

This allows us to draw quick inferences about what hyperparameters seem to be working best: Using the function above, I discovered deeper and wider architectures are necessary to obtain high performance on the data after evaluating over a dozen model architectures with 5-fold cross validation.

In the section on linear classification we computed scores for different visual categories given the image using the formula \( s = W x \), where \(W\) was a matrix and \(x\) was an input column vector containing all pixel data of the image.

In the case of CIFAR-10, \(x\) is a [3072x1] column vector, and \(W\) is a [10x3072] matrix, so that the output scores is a vector of 10 class scores.

There are several choices we could make for the non-linearity (which we’ll study below), but this one is a common choice and simply thresholds all activations that are below zero to zero.

Notice that the non-linearity is critical computationally - if we left it out, the two matrices could be collapsed to a single matrix, and therefore the predicted class scores would again be a linear function of the input.

three-layer neural network could analogously look like \( s = W_3 \max(0, W_2 \max(0, W_1 x)) \), where all of \(W_3, W_2, W_1\) are parameters to be learned.

The area of Neural Networks has originally been primarily inspired by the goal of modeling biological neural systems, but has since diverged and become a matter of engineering and achieving good results in Machine Learning tasks.

Approximately 86 billion neurons can be found in the human nervous system and they are connected with approximately 10^14 - 10^15 synapses.

The idea is that the synaptic strengths (the weights \(w\)) are learnable and control the strength of influence (and its direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another.

Based on this rate code interpretation, we model the firing rate of the neuron with an activation function \(f\), which represents the frequency of the spikes along the axon.

Historically, a common choice of activation function is the sigmoid function \(\sigma\), since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1.

An example code for forward-propagating a single neuron might look as follows: In other words, each neuron performs a dot product with the input and its weights, adds the bias and applies the non-linearity (or activation function), in this case the sigmoid \(\sigma(x) = 1/(1+e^{-x})\).

As we saw with linear classifiers, a neuron has the capacity to “like” (activation near one) or “dislike” (activation near zero) certain linear regions of its input space.

With this interpretation, we can formulate the cross-entropy loss as we have seen in the Linear Classification section, and optimizing it would lead to a binary Softmax classifier (also known as logistic regression).

The regularization loss in both SVM/Softmax cases could in this biological view be interpreted as gradual forgetting, since it would have the effect of driving all synaptic weights \(w\) towards zero after every parameter update.

The sigmoid non-linearity has the mathematical form \(\sigma(x) = 1 / (1 + e^{-x})\) and is shown in the image above on the left.

The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1).

Also note that the tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: \( \tanh(x) = 2 \sigma(2x) -1 \).

Other types of units have been proposed that do not have the functional form \(f(w^Tx + b)\) where a non-linearity is applied on the dot product between the weights and the data.

TLDR: “What neuron type should I use?” Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of “dead” units in a network.

For regular neural networks, the most common layer type is the fully-connected layer in which neurons between two adjacent layers are fully pairwise connected, but neurons within a single layer share no connections.

Working with the two example networks in the above picture: To give you some context, modern Convolutional Networks contain on orders of 100 million parameters and are usually made up of approximately 10-20 layers (hence deep learning).

The full forward pass of this 3-layer neural network is then simply three matrix multiplications, interwoven with the application of the activation function: In the above code, W1,W2,W3,b1,b2,b3 are the learnable parameters of the network.

Notice also that instead of having a single input column vector, the variable x could hold an entire batch of training data (where each input example would be a column of x) and then all examples would be efficiently evaluated in parallel.

Neural Networks work well in practice because they compactly express nice, smooth functions that fit well with the statistical properties of data we encounter in practice, and are also easy to learn using our optimization algorithms (e.g.

Similarly, the fact that deeper networks (with multiple hidden layers) can work better than a single-hidden-layer networks is an empirical observation, despite the fact that their representational power is equal.

As an aside, in practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more.

We could train three separate neural networks, each with one hidden layer of some size and obtain the following classifiers: In the diagram above, we can see that Neural Networks with more neurons can express more complicated functions.

For example, the model with 20 hidden neurons fits all the training data but at the cost of segmenting the space into many disjoint red and green decision regions.

The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It’s clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e.

Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss.

In practice, what you find is that if you train a small network the final loss can display a good amount of variance - in some cases you get lucky and converge to a good place but in some cases you get trapped in one of the bad minima.

A guide to an efficient way to build neural network architectures- Part I: Hyper-parameter selection and tuning for Dense Networks using Hyperas on Fashion-MNIST

That is what is said when people give lessons on life, well it might be true for life but for training neural networks this is what i believe This stems from the fact that many people i met when encountered with the question as to why did they choose certain values in their neural network architecture have generally replied saying “well just intuition and a little hit and trial”, this didn’t seem appropriate as this doesn’t help us know how efficient our model actually is and if there is any architecture combination which could help us achieve better generalized results.

We load the data from the keras.datasets We are only provided with a train and test set as we also need a validation set we use scikit-learn’s train-test split to split train data obtained into 80% train and 20% validation As we would be working at first with dense neural networks we need to do a little pre-processing of the data before we input them to the networks.

Changing the shape of the train validate and test data from 28x28 format to a list of 784 values- Normalizing our input so that the input values range from 0 to 1 rather than 0 to 255, Standardization can also be done Also since we have used categorical cross-entropy we need to convert our output to one-hot encoded form this can be done easily as follows It is a good practice to start with a basic model at first, and then keep trying to improve it in every step.

Our baseline model would be a softmax-classifier which would take in as input, list of pixel values and would output the class-label For the above model the loss and accuracy values are as given below To further improve the model we need to know about which are the hyper-parameters we can tune in our dense network, the coming sections will give you a brief overview of the hyper-parameters and how they affect learning but before that lets understand why do we need hyper-parameter tuning.

We have two options here one is the scikit-learn GridSearchCV way which can be used by using the scikit-learn API wrapper in Keras, this method is does not make use of GPU acceleration and hence cannot achieve great speeds even after parallelizing, the other method which we will be discussing here is Hyperas which enables us to take advantage of GPU acceleration enabling you you train models atleast 10x faster and is also an easy way to approach hyper-parameter tuning.

To use Hyperas you will first need to install the package using pip Then in your project you will need to add the following import statements Refer to the FAQ section at if you encounter any error Once you are done with the above to carry out hyper-parameter optimization you will need 3 code snippets You need to create a function which directly loads train and validation data from the source or if you have done any pre-processing it is recommended to store the data after pre-processing in a pickle/numpy/hdf5/csv file and write code in the data function to access data from that file.

model.add(Activation({{choice([‘relu’, ‘sigmoid’])}}))- this line is used to convey the fact that we wish to tune the activation function parameter and find the best fit among ReLU and Sigmoid model.add(Dropout({{uniform(0, 1)}}))- this line is used to convey the fact that we wish to tune the value of the keep probability of Dropout and find the best fit among the range of real numbers between 0 and 1.

The values contained in choice are the values we wish to tune our hyper-parameter with and range contained within uniform is the range of real numbers within which we expect the best value of our hyper-parameter to be The above code snippet is used to tune the number of layers, to know if a two or three level dense network architecture would be a good choice.

Hyperparameter tuning for machine learning models.

When creating a machine learning model, you'll be presented with design choices as to how to define your model architecture.

In true machine learning fashion, we'll ideally ask the machine to perform this exploration and select the optimal model architecture automatically.

Model parameters are learned during training when we optimize a loss function using something like gradient descent.The process for learning parameter values is shown generally below.

The ultimate goal for any machine learning model is to learn from examples in such a manner that the model is capable of generalizing the learning to new instances which it has not yet seen.

At a very basic level, you should train on a subset of your total dataset, holding out the remaining data for evaluation to gauge the model's ability to generalize - in other words, "how well will my model do on data which it hasn't directly learned from during training?"

The introduction of a validation dataset allows us to evaluate the model on different data than it was trained on and select the best model architecture, while still holding out a subset of the data for the final evaluation at the end of our model development.

You can also leverage more advanced techniques such as K-fold cross validation in order to essentially combine training and validation data for both learning the model parameters and evaluating the model without introducing data leakage.

Recall that I previously mentioned that the hyperparameter tuning methods relate to how we sample possible model architecture candidates from the space of possible hyperparameter values.

In the following visualization, the $x$ and $y$ dimensions represent two hyperparameters, and the $z$ dimension represents the model's score (defined by some evaluation metric) for the architecture defined by $x$ and $y$.

With this technique, we simply build a model for each possible combination of all of the hyperparameter values provided, evaluating each model, and selecting the architecture which produces the best results.

Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets.

- Bergstra, 2012 In the following example, we're searching over a hyperparameter space where the one hyperparameter has significantly more influence on optimizing the model score - the distributions shown on each axis represent the model's score.

This model will essentially serve to use the hyperparameter values $\lambda_{1,...i}$ and corresponding scores $v_{1,...i}$ we've observed thus far to approximate a continuous score function over the hyperparameter space.

This approximated function also includes the degree of certainty of our estimate, which we can use to identify the candidate hyperparameter values that would yield the largest expected improvement over the current score.

37 Reasons why your Neural Network is not working

Check if the input data you are feeding the network makes sense.

So print/display a couple of batches of input and target output and make sure they are OK.

Try passing random numbers instead of actual data and see if the error behaves the same way.

Also make sure shuffling input samples works the same way for output labels.

Maybe the non-random part of the relationship between the input and output is too small compared to the random part (one could argue that stock prices are like this).

The cutoff point is up for debate, as this paper got above 50% accuracy on MNIST using 50% corrupted labels.

If your dataset hasn’t been shuffled and has a particular order to it (ordered by label) this could negatively impact the learning.

Lecture 6 | Training Neural Networks I

In Lecture 6 we discuss many practical issues for training modern neural networks. We discuss different activation functions, the importance of data ...

Neural Network Performance Prediction for Early Stopping

In the neural network domain, methods for hyperparameter optimization and meta-modeling are computationally expensive due to the need to train a large ...

Deep Neural Network Hyperparameter Optimization wtih Genetic Algorithms

2017 Rice Data Science Conference: "Deep Neural Network Hyperparameter Optimization wtih Genetic Algorithms" Speakers: Jacob Balma, Cray, Inc.; Aaron ...

Lecture 8 - Neural Network Hyperparameter Optimisation

Learn with Daniel, Saksham and DataSoc about how to design neural networks, including how to perform hyperparameter tuning, how to select the best number ...

How to Learn from Little Data - Intro to Deep Learning #17

One-shot learning! In this last weekly video of the course, i'll explain how memory augmented neural networks can help achieve one-shot classification for a ...

Lecture 16 | Adversarial Examples and Adversarial Training

In Lecture 16, guest lecturer Ian Goodfellow discusses adversarial examples in deep learning. We discuss why deep networks and other machine learning ...

1. Hopfield Nets

Video from Coursera - University of Toronto - Course: Neural Networks for Machine Learning:

Lecture 9 | CNN Architectures

In Lecture 9 we discuss some common architectures for convolutional neural networks. We discuss architectures which performed well in the ImageNet ...

Sequential Model - Keras

Here we go over the sequential model, the basic building block of doing anything that's related to Deep Learning in Keras. (this is super important to understand ...

Lecture 8 | Deep Learning Software

In Lecture 8 we discuss the use of different software packages for deep learning, focusing on TensorFlow and PyTorch. We also discuss some differences ...