AI News, Recognizing and Localizing Endangered Right Whales with Extremely Deep Neural Networks

Recognizing and Localizing Endangered Right Whales with Extremely Deep Neural Networks

In this post I’ll share my experience and explain my approach for the Kaggle Right Whale challenge.

As part of an ongoing preservation effort, experienced marine scientists track them across the ocean to understand their behaviors, and monitor their health condition.

It starts with photographing these whales during aerial surveys, selecting and importing the photos into a catalog, and finally the photos are compared against the known whales inside the catalog by trained researchers.

reasonable local validation set was essential to evaluate how the model will perform on the testing set and estimation of public / private score on Kaggle.

Note that putting the whales with 1 image into validation set would result in the classifier not be able to predict those whales at all!

Unlike many commonly cited classification tasks which is to classify images into different species (bird, tree leaves, dogs in ImageNet), this task is to classify images of the same species into different individuals.

I made use of AWS EC2 near the very end of the competition, which will be explained further in Section 5.2 Who needs a heater when your machine is crunching numbers all the time!

All my approaches were based on deep convolutional neural network (CNN), as I initially believed that human is no match to machine in extracting image feature.

This naive approach yielded a validation score of just ~5.8 (logloss, lower the better) which was barely better than a random guess.

My hypothesis for the low score was that the whale labels did not provide a strong enough training signal in this relatively small dataset.

The saliency map suggested that the network was “looking at” the ocean waves instead of the whale head to identify the whale.

localization CNN took the original photos as input and output a bounding box around the whale head, and the classifier was fed the cropped image.

treated the localization problem as a regression problem, so that the objective of the localizer CNN is to minimize the mean squared error (MSE) between the predicted and actual bounding box.

The bounding boxes were represented by x, y, width and height and were normalized into (0, 1) by dividing with the image size.

To calculate the bounding box of the transformed image, I created a boolean mask denoting the bounding box, applied the transformation to this mask, and extracted the normalized bounding box from the mask.

I suspected that MSE with normalized coordinates was not ideal for regressing bounding boxes, but I could not find any alternative objective function from related literatures.

So I further evaluated the localizer with interaction over union (IOU) which is the ratio of the area of intersection the predicted and actual bounding boxes and area of their union.

At this point, it was clear that the main performance bottleneck is that the classifier was not able to focus on the actual discriminating part of the whales (i.e.

While the architecture of this approach looked very similar to the previous one, the fact that the images were aligned had a huge implication for the classifier – the classifier no longer need to learn features which are invariant to extreme translation and rotation.

Obviously, it was not possible to perform similar alignment with just 2 points, but it was reasonable to assume that accuracy can be improved if there were more more annotation keypoints, as that would allow more non-linear transformation.

However I did not ended up using the locally-connected convolutional layers in my models because simply the implementation in TheanoLinear doesn’t seem to be compatible the Theano version I am using.

Inspired by the recent papers related to face image recognition, I replaced the 2 stacks of fully connected layers with a global averaging layer, and used stride=2 convolution instead of max-pooling when reducing feature maps size.

Since the aligner was optimized with the MSE objective function similar to the previous approach, I observed similar slow convergence after about 10% of the training time.

I inverse applied the affine transformation to the predicted bonnet and blowhead coordinates and simply took the average of those coordinates.

I empirically found that heavy augmentation prevented the network to converge, and lighter augmentation did not lead to overfitting.

The success of deep learning is usually attributed to the highly non-linear nature of neural network with stacks of layers.

However the ResNet authors observed an counter-intuitive phenomenon – simply adding more layers to a neural network will increase training error.

The first ResNet-based network I experimented with was somewhat similar to the paper’s CIFAR10 network with n=3, resulting in 19 layers with 9 shortcut layers.

I chose the CIFAR10 network structure first because a) I needed to verify if my implementation was correct at all, b) the images the classifier would be fed in were aligned already so it should not require a highly nonlinear and huge network.

So I followed the advice in section 4.2 and regularized the network by reducing the number of filters, and the network overfitted much later in the training process.

For example, if residual learning is so effective, would learning the residual of the residual error be even more effective (shortcut of shortcut layer)?

If the degradation problem is largely overcome, are the existing regularization techniques (maxout, dropout, l2 etc.) still applicable?

So one week before the deadline, I hacked together a system that allowed me to easily train a model on AWS EC2 GPU instances (g2.xlarge) as if I was training it locally, by running this command.

You can find the source code of this system on Github (felixlaumon/docker-deeplearning) The final submission was an ensemble of 6 models: The outputs of the global averaging layer were extracted and a simple logistic regression classifier were trained on the concatenated features.

funny sidenote – 24 hours before the final deadline, I discovered that the logistic regression classifier was overfitting horrendously because the model accuracy on the training set was 100% and logloss was 0.

As mentioned before, one of the main challenges was the uneven distribution of number of images per whale, and the limited number of images in general.

The objective of the classifier was to maximize the euclidean distance of the feature vectors that contains the different whales, and minimize the distance with same whales.

was particularly confident that ST-CNN would work well because it achieved start-of-the-art performance on the CUB-200-2011 bird classification dataset using multiple localizers.

I believed my explanation before applied here as well – the whale labels alone did not provide a strong enough training signal.

So in my next attempt, I tried to supervise the localization net by adding a crude error term to the objective function – the MSE of the predicted affine transformation matrix and the actual matrix generated by the bonnet and blowhead annotation.

One approach I did not try was 1) pre-train localization network to learn the affine transformation that would align the image to the whale’s blowhead and bonnet, 2) follow normal procedure to train the whole ST-CNN.

Transferring learned features to a slightly different network was a much more common use case because my goal was to optimize the number of filters and number of layers I

unsupervised physics based whale detector, detector based on principle components of color channel, histogram similarity, mask based regression Finally, I’d like to thank Kaggle for hosting this compeitition, and MathWorks for sponsoring and providing a free copy of MATLAB for all participants.

Lecture 5 | Convolutional Neural Networks

In Lecture 5 we move from fully-connected neural networks to convolutional neural networks. We discuss some of the key historical milestones in the ...

Classifying Handwritten Digits with TF.Learn - Machine Learning Recipes #7

Last time we wrote an image classifier using TensorFlow for Poets. This time, we'll write a basic one using TF.Learn. To make it easier for you to try this out, ...

Blue Planet II : The Prequel

This world-exclusive introduction to the show is narrated by series presenter Sir David Attenborough and set to an exclusive track developed by Hans Zimmer ...

Introduction To Marine Life Course: Whales, Dolphins & Porpoises

This course gives students of all ages a wonderful introduction to the marine life of British Columbia. Building on the Aquarium's successful research and ...

CoreML in ARKit

A simple project that displays 3D labels on top of objects. Uses iOS 11 beta, ARKit, and CoreML. 🤓 Github: ..

AI learning from the best history books and creating its own text. On its own.

Its a rnn running on the top 100 history books. (Why Not?) Periodically it generates text which is cool and will become readable at about ~0.8 loss. We're going to ...

Deep Learning - Convolutional Neural Networks - Architectural Zoo - Christian S. Perone

Palestrante: Christian S. Perone Slides: Descrição: Um breve ..

Tail of Toxics - Improving Chemical Safety Without Animals

Learn more and take action at Please watch and share this animation about the flaws of animal tests, and see the modern testing ..

Ellen's Helping Out with Homework!

Is there anything she can't do? Ellen offered to help her viewers with their homework. This is how it turned out!