AI News, Classifying plankton with deep neural networks
Classifying plankton with deep neural networks
The National Data Science Bowl, a data science competition where the goal was to classify images of plankton, has just ended.
We decided to participate together because we are all very interested in deep learning, and a collaborative effort to solve a practical problem is a great way to learn.
There were seven of us, so over the course of three months, we were able to try a plethora of different things, including a bunch of recently published techniques, and a couple of novelties.
The images obtained using the camera were already processed by a segmentation algorithm to identify and isolate individual organisms, and then cropped accordingly.
It is also differentiable, which means that models trained with gradient-based methods (such as neural networks) can optimize it directly - it is unnecessary to use a surrogate loss function.
Although the two are obviously correlated, we paid special attention to this because it was often the case that significant improvements to the log loss would barely affect the classification accuracy of the models.
Image classification problems are often approached using convolutional neural networks these days, and with good reason: they achieve record-breaking performance on some really difficult tasks.
Deep learning approaches are often said to require enormous amounts of data to work well, but recently this notion has been challenged, and our results in this competition also indicate that this is not necessarily true.
Judicious use of techniques to prevent overfitting such as dropout, weight decay, data augmentation, pre-training, pseudo-labeling and parameter sharing, has enabled us to train very large models with up to 27 million parameters on this dataset.
We performed very little pre-processing, other than rescaling the images in various ways and then performing global zero mean unit variance (ZMUV) normalization, to improve the stability of training and increase the convergence speed.
Rescaling the images was necessary because they vary in size a lot: the smallest ones are less than 40 by 40 pixels, whereas the largest ones are up to 400 by 400 pixels.
Unfortunately, centering and rescaling the images based on image moments did not improve results, but they turned out to be useful as additional features for classification (see below).
We ended up with some pretty extreme augmentation parameters: We augmented the data on-demand during training (realtime augmentation), which allowed us to combine the image rescaling and augmentation into a single affine transform.
the output feature maps are the same size as the input feature maps) and overlapping pooling with window size 3 and stride 2.
We started with a fairly shallow models by modern standards (~ 6 layers) and gradually added more layers when we noticed it improved performance (it usually did).
We experimented with strided convolutions with 7x7 filters in the first two layers for a while, inspired by the work of He et al., but we were unable to achieve the same performance with this in our networks.
I applied the same stack of convolutional layers to several rotated and flipped versions of the same input image, concatenated the resulting feature representations, and fed those into a stack of dense layers.
Cyclic pooling also allowed us to reduce the batch size by a factor of 4: instead of having batches of 128 images, each batch now contained 32 images and was then turned into a batch with an effective size of 128 again inside the network, by stacking the original batch in 4 orientations.
We tried several pooling functions over the course of the competition, as well as different positions in the network for the pooling operation (just before the output layer, between hidden layers, …).
We also considered having the model do 8-way pooling by including flipped versions of each rotated image copy (dihedral pooling, after dihedral groups).
Later on we spent some time implementing CUDA kernels for the roll operations and their gradients, because networks with many rolled layers were getting pretty slow to train.
In most of the models we evaluated, we only inserted convolutional roll operations after the pooling layers, because this reduced the size of the feature maps that needed to be copied and stacked together.
Note that it is perfectly possible to build a cyclic pooling convnet without any roll operations, but it’s not possible to have roll operations in a network without cyclic pooling.
Instead of taking the maximum of the input and zero, y = max(x, 0), leaky ReLUs take the maximum of the input and a scaled version of the input, y = max(x, a*x).
We started out using networks with 2 or 3 spatial pooling layers, and we initially had some trouble getting networks with more pooling stages to work well.
We started out with the traditional approach of 2x2 max-pooling, but eventually switched to 3x3 max-pooling with stride 2 (which we’ll refer to as 3x3s2), mainly because it allowed us to use a larger input size while keeping the same feature map size at the topmost convolutional layer, and without increasing the computational cost significantly.
As an example, a network with 80x80 input and 4 2x2 pooling stages will have feature maps of size 5x5 at the topmost convolutional layer.
To allow the network to learn this, we experimented with combinations of different rescaling strategies within the same network, by combining multiple networks with different rescaled inputs together into ‘multiscale’ networks.
What worked best was to combine a network with inputs rescaled based on image size, and a smaller network with inputs rescaled by a fixed factor.
Here are some examples of types of features we evaluated (the ones we ended up using are in bold): The image size, the features based on image moments and the Haralick texture features were the ones that stood out the most in terms of performance.
To deal with variance due to the random weight initialization, we trained each feature network 10 times and blended the copies with uniform weights.
The input shape is (32, 1, 95, 95), in bc01 order (batch size, number of channels, height, width).
We trained most of the models with about 215000 gradient steps and eventually settled on a discrete learning rate schedule with two 10-fold decreases (following Krizhevsky et al.), after about 180000 and 205000 gradient steps respectively.
In line with the literature, we found that pre-training a network serves as an excellent regularizer (much higher train error, slightly better validation score), but the validation results with test-time augmentation (see below) were consistently slightly worse for some reason.
We did not try a denoising autoencoder approach for two reasons: first of all, according to the results described by Masci et al., the max- and unpooling approach produces way better filters than the denoising approach, and the further improvement of combining these approaches is negligible.
Possibly, by the time the randomly initialized dense layers are in a suitable parameter range, the network has already forgotten a substantial amount of the information it acquired during the pre-training phase.
We experimented both with hard targets (one-hot coded) and soft targets (predicted probabilities), but quickly settled on soft targets as these gave much better results.
Another notable difference is that knowledge distillation is mainly intended for training smaller and faster networks that work nearly as well as bigger models, whereas we used it to train bigger models that perform better than the original model(s).
When pseudo-labeled test data is added to the training set, the network is optimized (or constrained) to generate predictions similar to the pseudo-labels for all possible variations and transformations of the data resulting from augmentation.
We saw the biggest gains in the beginning (up to 0.015 improvement on the leaderboard), but even in the end we were able to improve on very large ensembles of (bagged) models (between 0.003 - 0.009).
After a while, we looked for better ways to tile the augmentation parameter space, and settled on a quasi-random set of 70 transformations, using slightly more modest augmentation parameter ranges than those used for training.
To improve the score of the ensemble further, we replaced some of the models by an average of 5 models (including the original one), where each model was trained on a different subset of the data.
Here are a few other things we tried, with varying levels of success: Here’s a non-exhaustive list of things that we found to reduce overfitting (including the obvious ones): We also monitored the classification accuracy of our models during the competition.
This is a function of several factors, but the two most important ones are the size of the new dataset (small or big), and its similarity to the original dataset (e.g.
Keeping in mind that ConvNet features are more generic in early layers and more original-dataset-specific in later layers, here are some common rules of thumb for navigating the 4 major scenarios: Practical advice.
How to Retrain an Image Classifier for New Categories
Modern image recognition models have millions of parameters.
requires a lot of labeled training data and a lot of computing power (hundreds
of this by taking a piece of a model that has alreay been trained on a related
extraction capabilities from powerful image classifiers trained on ImageNet
for many applications, works with moderate amounts of training data (thousands,
not millions of labeled images), and can be run in as little as thirty
example script on your own images, and will explain some of the options you have
Image by Kelly Sikkema Before you start any training, you'll need a set of images to teach the network about
how to prepare your own images, but to make it easy we've created an archive
of flower photos, run these commands: Once you have the images, you can download the example code from GitHub (it
is not part of the library installation): In the simplest cases the retrainer can then be run like this (takes
You can get a full listing with: This script loads the pre-trained module and trains a new classifier on top for
magic of transfer learning is that lower layers that have been trained to distinguish
The script can take thirty minutes or more to complete, depending on the speed of
This penultimate layer has been trained to output a set of values
our final layer retraining can work on new classes is that it turns out the
Because every image is reused multiple times during training and calculating each
bottleneck takes a significant amount of time, it speeds things up to cache
By default they're stored in the /tmp/bottleneck directory, and if
you rerun the script they'll be reused so you don't have to wait for this part
Once the bottlenecks are complete, the actual training of the top layer of the network
what percent of the images used in the current training batch were labeled
the training accuracy is based on images that the network has been able to
learn from so the network can overfit to the noise in the training data.
measure of the performance of the network is to measure its performance on a
data set not contained in the training data -- this is measured by the validation
low, that means the network is overfitting and memorizing particular features
so you can tell if the learning is working by keeping an eye on whether
the loss keeps trending downwards, ignoring the short-term noise.
By default this script will run 4,000 training steps.
at random from the training set, finds their bottlenecks from the cache, and
compared against the actual labels to update the final layer's weights through
reported accuracy improve, and after all the steps are done, a final test accuracy
evaluation is run on a set of images kept separate from the training and
value of between 90% and 95%, though the exact value will vary from run to
percent of the images in the test set that are given the correct label after
The script includes TensorBoard summaries that make it easier to understand, debug, and optimize the retraining.
For example, you can visualize the graph and statistics, such as how the weights or accuracy varied during training.
To launch TensorBoard, run this command during or after retraining: Once TensorBoard is running, navigate your web browser to localhost:6006
The script will write out the new model trained on your categories to /tmp/output_graph.pb,
read in, so you can start using your new model immediately.
the top layer, you will need to specify the new name in the script, for example
color values in the fixed range [0,1], so you do not need to set the --input_mean
You should see a list of flower labels, in most cases with daisy on top (though
If you find the default Inception V3 module is too large or slow for your application,
If you've managed to get the script working on the flower example images, you can
Here's what the folder structure of the flowers archive looks like, to give you and
The first place to start is by looking at the images you've gathered, since the most
common issues we see with training come from the data that's being fed in.
For example, if you take all your photos indoors against a blank wall and
your users are trying to recognize objects outdoors, you probably won't see good
end up basing its prediction on the background color, not the features of the
only things you'll ever be asked to categorize are the classes of object you know
example: pictures tagged #daisy might also include people and characters named
If you're happy with your images, you can take a look at improving your results by
common way of improving the results of image training is by deforming, cropping,
of expanding the effective size of the training data thanks to all the possible
variations of the same images, and tends to help the network learn to cope
that the bottleneck caching is no longer useful, since input images are never reused
it's recommended you try this as a way of polishing your model only after you
mirror half of the images horizontally, which makes sense as long as those
images as a check to make sure that overfitting isn't occurring, since if we
split is to put 80% of the images into the main training set, keep 10% aside
test set, since they are likely to merely reflect more general problems in the
By default the script uses an image feature extraction module with a pretrained instance
it provides high accuracy results with moderate running time for the retraining
On the other hand, if you intend to deploy your model on mobile devices or other resource-constrained
with the module URL, for example: This will create a 9 MB model file in /tmp/output_graph.pb with a model that uses
number of weights (and hence the file size and speed) shrinks with the square
will need to specify the image size that your model expects, for example: For more information on deploying the retrained model to a mobile device, see the
How to use transfer learning and fine-tuning in Keras and Tensorflow to build an image recognition system and classify (almost) any object
In the last post, I covered how to use Keras to recognize any of the 1000 object categories in the ImageNet visual recognition challenge.
Razavian et al (2014) showed that by simply using the features extracted using the weights from an ImageNet ILSVRC trained model, they achieved state-of-the-art or near state-of-the-art performance on a large variety of computer vision tasks.
For our image recognition system, the task you have is to decide which transformations make sense for your data (for example, X-ray images should probably not be rotated by more than 45 degrees because that would mean there was an error in the image acquisition step).
We’ll use Kaggle’s Dogs vs Cats dataset as our example, and setup our data with a training directory and a validation directory in this manner: Let’s start by preparing our data generators: Recall from our previous blog post on image recognition the importance of the preprocessing step.
We can plot the training accuracies and loss using the history object Now that we have a saved keras.model we can modify the same predict() function we wrote in the last blog post to predict the class of a local image file or any file via a web URL.
Even after only 2 epochs, the performance is pretty high: *model compatible with keras==1.2.2 python predict.py --image_url https://goo.gl/Xws7Tp --model dc.model python predict.py --image_url https://goo.gl/6TRUol --model dc.model Stay tuned for the next post in the series:
Pre-trained convolutional neural networks as feature extractors toward improved malaria parasite detection in thin blood smear images
Malaria is a mosquito-borne blood disease caused by the Plasmodium parasites transmitted through the bite of the female Anopheles mosquito.
The diagnostic accuracy heavily depends on human expertise and can be adversely impacted by the inter-observer variability and the liability imposed by large-scale diagnoses in disease-endemic/resource-constrained regions (Mitiku, Mengistu &
In the process of applying machine learning (ML) methods to medical data analysis, meaningful feature representation lies at the core of their success to accomplish desired results.
To overcome challenges of devising hand-engineered features that capture variations in the underlying data, Deep Learning (DL), also known as deep hierarchical learning, is used with significant success (LeCun, Bengio &
A model named Xception was proposed that uses depth-wise separable convolutions (Chollet, 2016) to outperform the Inception-V3 model (Szegedy et al., 2016) on the ImageNet (Deng et al., 2009) data classification task.
With scarcity for annotated medical imagery, Transfer Learning (TL) methods are used where pre-trained DL models are either fine-tuned on the underlying data or used as feature extractors to aid in visual recognition tasks (Razavian et al., 2014).
(2014), it is recognized that CNNs trained on large-scale datasets could serve as feature extractors for a wide range of computer vision tasks to aid in improved performance, as compared to state-of-the-art methods (Bousetouane &
At present, researchers across the world have begun to apply DL tools and obtain promising results in a wide variety of medical image analyses/understanding tasks (Rajaraman et al., 2017;
(2018) employed a customized CNN model for analyzing videos containing a focus stack of the field of views of Leishman stained slide images toward the process of automated parasite detection.
Evaluation on patient-level provides a more realistic performance evaluation of the predictive models as the images in the independent test set represent truly unseen images for the training process, with no information about staining variations or other artifacts leaking into the training data.
The important contributions of this work are as follows: (a) presentation of a comparative analysis of the performance of customized and pre-trained DL models as feature extractors toward classifying parasitized and uninfected cells, (b) cross-validating the performance of the predictive models at the patient level to reduce bias and generalization errors, (c) analysis and selection of the optimal layer in the pre-trained models to extract features from the underlying data, and (d) testing for the presence/absence of a statistically significant difference in the performance of customized and pre-trained CNN models under study.
- On Wednesday, January 16, 2019
NVIDIA's AI produces Fake Human Photos with Unbelievable Quality | QPT
NVIDIA has developed a new machine learning methodology for generating unique and realistic looking faces using GAN i-e Generative Adversarial Network.
NIPS: Oral Session 4 - Jason Yosinski
How transferable are features in deep neural networks? Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on ...
Image Retrieval using Very deep 16 layer CNN
In this project i am using VGG-16 pretrained network from VGG group to extract high level representation features(4096 dimension) for all the images in my ...
OSI Model Explained | Real World Example
The OSI Model Explained What Is this OSI Model? The OSI model (Open System Interconnect) is a theoretical stack of 7 layers that can be used as a reference to ...
Lecture 11 | Detection and Segmentation
In Lecture 11 we move beyond image classification, and show how convolutional networks can be applied to other core computer vision tasks. We show how ...
Build a TensorFlow Image Classifier in 5 Min
In this episode we're going to train our own image classifier to detect Darth Vader images. The code for this repository is here: ...
Intermediate convolution layer visualization-VGG-16
This video shows a GUI tool for visualizing intermediate convolution layer Of a CNN model. VGG-16 model trained on imagenet is used for demonstration here.
Feeding your own data set into the CNN model in Keras
This video explains how we can feed our own data set into the network. It shows one of the approach for reading the images into a matrix and labeling those ...
Seven Layers of OSI Model
The OSI, or Open System Interconnection, model defines a networking framework for implementing protocols in seven layers. Control is passed from one layer to ...
Neural Networks 8: hidden units = features