AI News, Classifying plankton with deep neural networks

Classifying plankton with deep neural networks

The National Data Science Bowl, a data science competition where the goal was to classify images of plankton, has just ended.

We decided to participate together because we are all very interested in deep learning, and a collaborative effort to solve a practical problem is a great way to learn.

There were seven of us, so over the course of three months, we were able to try a plethora of different things, including a bunch of recently published techniques, and a couple of novelties.

The images obtained using the camera were already processed by a segmentation algorithm to identify and isolate individual organisms, and then cropped accordingly.

It is also differentiable, which means that models trained with gradient-based methods (such as neural networks) can optimize it directly - it is unnecessary to use a surrogate loss function.

Although the two are obviously correlated, we paid special attention to this because it was often the case that significant improvements to the log loss would barely affect the classification accuracy of the models.

Image classification problems are often approached using convolutional neural networks these days, and with good reason: they achieve record-breaking performance on some really difficult tasks.

Deep learning approaches are often said to require enormous amounts of data to work well, but recently this notion has been challenged, and our results in this competition also indicate that this is not necessarily true.

Judicious use of techniques to prevent overfitting such as dropout, weight decay, data augmentation, pre-training, pseudo-labeling and parameter sharing, has enabled us to train very large models with up to 27 million parameters on this dataset.

We performed very little pre-processing, other than rescaling the images in various ways and then performing global zero mean unit variance (ZMUV) normalization, to improve the stability of training and increase the convergence speed.

Rescaling the images was necessary because they vary in size a lot: the smallest ones are less than 40 by 40 pixels, whereas the largest ones are up to 400 by 400 pixels.

Unfortunately, centering and rescaling the images based on image moments did not improve results, but they turned out to be useful as additional features for classification (see below).

We ended up with some pretty extreme augmentation parameters: We augmented the data on-demand during training (realtime augmentation), which allowed us to combine the image rescaling and augmentation into a single affine transform.

the output feature maps are the same size as the input feature maps) and overlapping pooling with window size 3 and stride 2.

We started with a fairly shallow models by modern standards (~ 6 layers) and gradually added more layers when we noticed it improved performance (it usually did).

We experimented with strided convolutions with 7x7 filters in the first two layers for a while, inspired by the work of He et al., but we were unable to achieve the same performance with this in our networks.

I applied the same stack of convolutional layers to several rotated and flipped versions of the same input image, concatenated the resulting feature representations, and fed those into a stack of dense layers.

Cyclic pooling also allowed us to reduce the batch size by a factor of 4: instead of having batches of 128 images, each batch now contained 32 images and was then turned into a batch with an effective size of 128 again inside the network, by stacking the original batch in 4 orientations.

We tried several pooling functions over the course of the competition, as well as different positions in the network for the pooling operation (just before the output layer, between hidden layers, …).

We also considered having the model do 8-way pooling by including flipped versions of each rotated image copy (dihedral pooling, after dihedral groups).

Later on we spent some time implementing CUDA kernels for the roll operations and their gradients, because networks with many rolled layers were getting pretty slow to train.

In most of the models we evaluated, we only inserted convolutional roll operations after the pooling layers, because this reduced the size of the feature maps that needed to be copied and stacked together.

Note that it is perfectly possible to build a cyclic pooling convnet without any roll operations, but it’s not possible to have roll operations in a network without cyclic pooling.

Instead of taking the maximum of the input and zero, y = max(x, 0), leaky ReLUs take the maximum of the input and a scaled version of the input, y = max(x, a*x).

We started out using networks with 2 or 3 spatial pooling layers, and we initially had some trouble getting networks with more pooling stages to work well.

We started out with the traditional approach of 2x2 max-pooling, but eventually switched to 3x3 max-pooling with stride 2 (which we’ll refer to as 3x3s2), mainly because it allowed us to use a larger input size while keeping the same feature map size at the topmost convolutional layer, and without increasing the computational cost significantly.

As an example, a network with 80x80 input and 4 2x2 pooling stages will have feature maps of size 5x5 at the topmost convolutional layer.

To allow the network to learn this, we experimented with combinations of different rescaling strategies within the same network, by combining multiple networks with different rescaled inputs together into ‘multiscale’ networks.

What worked best was to combine a network with inputs rescaled based on image size, and a smaller network with inputs rescaled by a fixed factor.

Here are some examples of types of features we evaluated (the ones we ended up using are in bold): The image size, the features based on image moments and the Haralick texture features were the ones that stood out the most in terms of performance.

To deal with variance due to the random weight initialization, we trained each feature network 10 times and blended the copies with uniform weights.

The input shape is (32, 1, 95, 95), in bc01 order (batch size, number of channels, height, width).

We trained most of the models with about 215000 gradient steps and eventually settled on a discrete learning rate schedule with two 10-fold decreases (following Krizhevsky et al.), after about 180000 and 205000 gradient steps respectively.

In line with the literature, we found that pre-training a network serves as an excellent regularizer (much higher train error, slightly better validation score), but the validation results with test-time augmentation (see below) were consistently slightly worse for some reason.

We did not try a denoising autoencoder approach for two reasons: first of all, according to the results described by Masci et al., the max- and unpooling approach produces way better filters than the denoising approach, and the further improvement of combining these approaches is negligible.

Possibly, by the time the randomly initialized dense layers are in a suitable parameter range, the network has already forgotten a substantial amount of the information it acquired during the pre-training phase.

We experimented both with hard targets (one-hot coded) and soft targets (predicted probabilities), but quickly settled on soft targets as these gave much better results.

Another notable difference is that knowledge distillation is mainly intended for training smaller and faster networks that work nearly as well as bigger models, whereas we used it to train bigger models that perform better than the original model(s).

When pseudo-labeled test data is added to the training set, the network is optimized (or constrained) to generate predictions similar to the pseudo-labels for all possible variations and transformations of the data resulting from augmentation.

We saw the biggest gains in the beginning (up to 0.015 improvement on the leaderboard), but even in the end we were able to improve on very large ensembles of (bagged) models (between 0.003 - 0.009).

After a while, we looked for better ways to tile the augmentation parameter space, and settled on a quasi-random set of 70 transformations, using slightly more modest augmentation parameter ranges than those used for training.

To improve the score of the ensemble further, we replaced some of the models by an average of 5 models (including the original one), where each model was trained on a different subset of the data.

Here are a few other things we tried, with varying levels of success: Here’s a non-exhaustive list of things that we found to reduce overfitting (including the obvious ones): We also monitored the classification accuracy of our models during the competition.

Train an Image Classifier with TensorFlow for Poets - Machine Learning Recipes #6

Monet or Picasso? In this episode, we'll train our own image classifier, using TensorFlow for Poets. Along the way, I'll introduce Deep Learning, and add context ...

NIPS: Oral Session 4 - Jason Yosinski

How transferable are features in deep neural networks? Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on ...

Build a TensorFlow Image Classifier in 5 Min

In this episode we're going to train our own image classifier to detect Darth Vader images. The code for this repository is here: ...

Lecture 11 | Detection and Segmentation

In Lecture 11 we move beyond image classification, and show how convolutional networks can be applied to other core computer vision tasks. We show how ...

OSI Model Explained | Real World Example

The OSI Model Explained What Is this OSI Model? The OSI model (Open System Interconnect) is a theoretical stack of 7 layers that can be used as a reference to ...

NVIDIA's AI produces Fake Human Photos with Unbelievable Quality | QPT

NVIDIA has developed a new machine learning methodology for generating unique and realistic looking faces using GAN i-e Generative Adversarial Network.

Lecture 6 | Training Neural Networks I

In Lecture 6 we discuss many practical issues for training modern neural networks. We discuss different activation functions, the importance of data ...

Photoshop Elements Kaleidoscope using Layers and Blend Modes in 2018 15 14 13 12 11 Tutorial

How to create a Kaleidoscope effect from a photo in Adobe Photoshop Elements. ▻ Get my Complete Training Courses Online ...

Visualizing weights & intermediate layer outputs of CNN in Keras

This video explains how we can visualize the configuration of the model as well as the configuration of each layer. It also shows the way to visualize the filters ...

Convolutional Neural Networks - Ep. 8 (Deep Learning SIMPLIFIED)

Out of all the current Deep Learning applications, machine vision remains one of the most popular. Since Convolutional Neural Nets (CNN) are one of the best ...