AI News, BOOK REVIEW: Bayesian Deep Learning Part II: Bridging PyMC3 and Lasagne to build a Hierarchical Neural Network

Bayesian Deep Learning Part II: Bridging PyMC3 and Lasagne to build a Hierarchical Neural Network

I blogged about Bayesian Deep Learning with PyMC3 where I built a simple hand-coded Bayesian Neural Network and fit it on a toy data set.

Today, we will build a more interesting model using Lasagne, a flexible Theano library for constructing various types of Neural Networks.

As you may know, PyMC3 is also using Theano so having the Artifical Neural Network (ANN) be built in Lasagne, but placing Bayesian priors on our parameters and then using variational inference (ADVI) in PyMC3 to estimate the model should be possible.

We'll then use mini-batch ADVI to fit the model on the MNIST handwritten digit data set.

Then, we will follow up on another idea expressed in my last blog post -- hierarchical ANNs.

Finally, due to the power of Lasagne, we can just as easily build a Hierarchical Bayesian Convolution ANN with max-pooling layers to achieve 98% accuracy on MNIST. Most

set: MNIST¶We will be using the classic MNIST data set of handwritten digits.

Contrary to my previous blog post which was limited to a toy data set, MNIST is an actually challenging ML task (of course not quite as challening as e.g.

# We first define a download function, supporting both Python 2 and 3.

# We then define functions for loading MNIST images and labels.

# For convenience, they also download the requested files if needed.

# The inputs are vectors now, we reshape them to monochrome 2D images,

# following the shape convention: (examples, channels, rows, columns)

data = data.reshape(-1, 1, 28, 28)

# The inputs come as bytes, we convert them to float32 in range [0,1].

# (Actually to range [0, 255/256], for compatibility to the version

# The labels are vectors of integers now, that's exactly what we want.

# We can now download and read the training and test set images and labels.

# We reserve the last 10000 training examples for validation.

# We just return all the arrays in order, as expected in main().

Building a theano.shared variable with a subset of the data to make construction of the model faster. #

I opened a GitHub issue on Lasagne's repo and a few days later, PR695 was merged which allowed for an ever nicer integration fo the two, as I show below.

the Lasagne function to create an ANN with 2 fully connected hidden layers with 800 neurons each, this is pure Lasagne code taken almost directly from the tutorial.

The trick comes in when creating the layer with lasagne.layers.DenseLayer where we can pass in a function init which has to return a Theano expression to be used as the weight and bias matrices.

# Add a fully-connected layer of 800 units, using the linear rectifier, and

# Finally, we'll add the fully-connected output layer, of 10 softmax units:

Because PyMC3 requires every random variable to have a different name, we're creating a class instead which creates uniquely named priors. The

priors act as regularizers here to try and keep the weights of the ANN small.

It's mathematically equivalent to putting a L2 loss term that penalizes large weights into the objective function, as is commonly done. In[6]:

you compare what we have done so far to the previous blog post, it's apparent that using Lasagne is much more comfortable.

We don't have to manually keep track of the shapes of the individual matrices, nor do we have to handle the underlying matrix math to make it all fit together. Next

are some functions to set up mini-batch ADVI, you can find more information in the prior blog post. In[25]:

# Return random data samples of set size batchsize each iteration

Neural Network: Learning Regularization from data¶The connection between the standard deviation of the weight prior to the strengh of the L2 penalization term leads to an interesting idea.

In Bayesian modeling it is quite common to just place hyperpriors in cases like this and learn the optimal regularization to apply from the data.

they all are pretty different suggesting that it makes sense to change the amount of regularization that gets applied at each layer of the network. Convolutional

Neural Network¶This is pretty nice but everything so far would have also been pretty simple to implement directly in PyMC3 as I have shown in my previous post.

# Another convolution with 32 5x5 kernels, and another 2x2 pooling:

# Finally, we'll add the fully-connected output layer, of 10 softmax units:

I also tried this with the hierarchical model but it achieved lower accuracy (95%), I assume due to overfitting. Lets

As our predictions are categories, we can't simply compute the posterior predictive standard deviation.

I'm not quite sure if this is the best way to do this, leave a comment if there's a more established method that I don't know about. In[17]:

bridging Lasagne and PyMC3 and by using mini-batch ADVI to train a Bayesian Neural Network on a decently sized and complex data set (MNIST) we took a big step towards practical Bayesian Deep Learning on real-world problems. Kudos

By relying on a commonly used language (Python) and abstracting the computational backend (Theano) we were able to quite easily leverage the power of that ecosystem and use PyMC3 in a manner that was never thought about when creating it.

New Lasagne feature: arbitrary expressions as layer parameters

This post is another collaboration with Jan Schlüter from the OFAI (@f0k on GitHub), a fellow MIR researcher and one of the lead developers of Lasagne.

He recently added a cool new feature that we wanted to highlight: enabling the use of arbitrary Theano expressions as layer parameters.

One of the key design principles of Lasagne is transparency: we try not to hide Theano or numpy behind an additional layer of abstractions and encapsulation, but rather expose their functionality and data types and try to follow their conventions.

In keeping with this philosophy, Jan recently added a feature that we’ve been discussing early on in designing the API (#11): it allows any learnable layer parameter to be specified as a mathematical expression evaluating to a correctly-shaped tensor.

This new feature makes it possible to constrain network parameters in various, potentially creative ways.

You might also be tempted to try sticking a ReLU in there (T.maximum(w, 0)), but note that applying the linear rectifier to the weight matrix would lead to many of the underlying weights getting stuck at negative values, as the linear rectifier has zero gradient for negative inputs!

There are plenty of other creative uses, such as constraining weights to be positive semi-definite (for whatever reason): There are only a couple of limitations to using Theano expressions as layer parameters.

In frameworks building on hard-coded layer implementations rather than an automatic expression compiler, all these examples would require writing custom backpropagation code.

Deep Learning with Python: Theano "for" Loops – the "scan" Module |

This playlist/video has been uploaded for Marketing purposes and contains only introductory videos. For the entire video course and code, visit ...

Machine Learning: Going Deeper with Python and Theano - PyCon SG 2015

Speaker: Martin Andrews Many recent advances in computer vision, speech recognition and Natural Language Processing (NLP) have come from Deep ...