AI News, Recognizing and Localizing Endangered Right Whales with Extremely Deep Neural Networks

Recognizing and Localizing Endangered Right Whales with Extremely Deep Neural Networks

In this post I’ll share my experience and explain my approach for the Kaggle Right Whale challenge.

As part of an ongoing preservation effort, experienced marine scientists track them across the ocean to understand their behaviors, and monitor their health condition.

It starts with photographing these whales during aerial surveys, selecting and importing the photos into a catalog, and finally the photos are compared against the known whales inside the catalog by trained researchers.

reasonable local validation set was essential to evaluate how the model will perform on the testing set and estimation of public / private score on Kaggle.

Note that putting the whales with 1 image into validation set would result in the classifier not be able to predict those whales at all!

Unlike many commonly cited classification tasks which is to classify images into different species (bird, tree leaves, dogs in ImageNet), this task is to classify images of the same species into different individuals.

I made use of AWS EC2 near the very end of the competition, which will be explained further in Section 5.2 Who needs a heater when your machine is crunching numbers all the time!

All my approaches were based on deep convolutional neural network (CNN), as I initially believed that human is no match to machine in extracting image feature.

This naive approach yielded a validation score of just ~5.8 (logloss, lower the better) which was barely better than a random guess.

My hypothesis for the low score was that the whale labels did not provide a strong enough training signal in this relatively small dataset.

The saliency map suggested that the network was “looking at” the ocean waves instead of the whale head to identify the whale.

localization CNN took the original photos as input and output a bounding box around the whale head, and the classifier was fed the cropped image.

treated the localization problem as a regression problem, so that the objective of the localizer CNN is to minimize the mean squared error (MSE) between the predicted and actual bounding box.

The bounding boxes were represented by x, y, width and height and were normalized into (0, 1) by dividing with the image size.

To calculate the bounding box of the transformed image, I created a boolean mask denoting the bounding box, applied the transformation to this mask, and extracted the normalized bounding box from the mask.

I suspected that MSE with normalized coordinates was not ideal for regressing bounding boxes, but I could not find any alternative objective function from related literatures.

So I further evaluated the localizer with interaction over union (IOU) which is the ratio of the area of intersection the predicted and actual bounding boxes and area of their union.

At this point, it was clear that the main performance bottleneck is that the classifier was not able to focus on the actual discriminating part of the whales (i.e.

While the architecture of this approach looked very similar to the previous one, the fact that the images were aligned had a huge implication for the classifier – the classifier no longer need to learn features which are invariant to extreme translation and rotation.

Obviously, it was not possible to perform similar alignment with just 2 points, but it was reasonable to assume that accuracy can be improved if there were more more annotation keypoints, as that would allow more non-linear transformation.

However I did not ended up using the locally-connected convolutional layers in my models because simply the implementation in TheanoLinear doesn’t seem to be compatible the Theano version I am using.

Inspired by the recent papers related to face image recognition, I replaced the 2 stacks of fully connected layers with a global averaging layer, and used stride=2 convolution instead of max-pooling when reducing feature maps size.

Since the aligner was optimized with the MSE objective function similar to the previous approach, I observed similar slow convergence after about 10% of the training time.

I inverse applied the affine transformation to the predicted bonnet and blowhead coordinates and simply took the average of those coordinates.

I empirically found that heavy augmentation prevented the network to converge, and lighter augmentation did not lead to overfitting.

The success of deep learning is usually attributed to the highly non-linear nature of neural network with stacks of layers.

However the ResNet authors observed an counter-intuitive phenomenon – simply adding more layers to a neural network will increase training error.

The first ResNet-based network I experimented with was somewhat similar to the paper’s CIFAR10 network with n=3, resulting in 19 layers with 9 shortcut layers.

I chose the CIFAR10 network structure first because a) I needed to verify if my implementation was correct at all, b) the images the classifier would be fed in were aligned already so it should not require a highly nonlinear and huge network.

So I followed the advice in section 4.2 and regularized the network by reducing the number of filters, and the network overfitted much later in the training process.

For example, if residual learning is so effective, would learning the residual of the residual error be even more effective (shortcut of shortcut layer)?

If the degradation problem is largely overcome, are the existing regularization techniques (maxout, dropout, l2 etc.) still applicable?

So one week before the deadline, I hacked together a system that allowed me to easily train a model on AWS EC2 GPU instances (g2.xlarge) as if I was training it locally, by running this command.

You can find the source code of this system on Github (felixlaumon/docker-deeplearning) The final submission was an ensemble of 6 models: The outputs of the global averaging layer were extracted and a simple logistic regression classifier were trained on the concatenated features.

funny sidenote – 24 hours before the final deadline, I discovered that the logistic regression classifier was overfitting horrendously because the model accuracy on the training set was 100% and logloss was 0.

As mentioned before, one of the main challenges was the uneven distribution of number of images per whale, and the limited number of images in general.

The objective of the classifier was to maximize the euclidean distance of the feature vectors that contains the different whales, and minimize the distance with same whales.

was particularly confident that ST-CNN would work well because it achieved start-of-the-art performance on the CUB-200-2011 bird classification dataset using multiple localizers.

I believed my explanation before applied here as well – the whale labels alone did not provide a strong enough training signal.

So in my next attempt, I tried to supervise the localization net by adding a crude error term to the objective function – the MSE of the predicted affine transformation matrix and the actual matrix generated by the bonnet and blowhead annotation.

One approach I did not try was 1) pre-train localization network to learn the affine transformation that would align the image to the whale’s blowhead and bonnet, 2) follow normal procedure to train the whole ST-CNN.

Transferring learned features to a slightly different network was a much more common use case because my goal was to optimize the number of filters and number of layers I

unsupervised physics based whale detector, detector based on principle components of color channel, histogram similarity, mask based regression Finally, I’d like to thank Kaggle for hosting this compeitition, and MathWorks for sponsoring and providing a free copy of MATLAB for all participants.

NOAA Right Whale Recognition, Winner's Interview: 2nd place, Felix Lau

With fewer than 500 North Atlantic right whales left in the world's oceans, knowing the health and status of each whale is integral to the efforts of researchers working to protect the species from extinction.

In the NOAA Right Whale Recognition challenge, 470 players on 364 teams competed to build a model that could identify any individual, living North Atlantic right whale from its aerial photographs.

Felix Lau entered the competition with the goal of practicing new techniques in deep learning, and ended up taking second place.

was inspired by the new deep learning ideas and techniques proposed in the last few months, and I wanted to apply them to a real problem.

In particular, the classifier had troubles focusing on the “whale face” on its own, as I suspect the “whale name” alone was not a strong enough training signal.

The goal is to output the x-y coordinates of the 2 key points (blowhead and bonnet) to create aligned whale face images.

For frameworks, I used Theano and Lasagne to build up the neural networks, nolearn for the training loop, my nolearn_utils for real-time augmentation, scikit-learn for ensembling, and scikit-image for image processing.

I spent about 20% of my time building up a baseline submission, 30% building up the infrastructure and code refactoring, 50% experimenting with alternative approaches.

bonnet and blowhead coordinates, type of the callosity pattern as in the 1st place approach) proved to be a very important trick for this competition.

Image classification with Keras and deep learning

was walking down our long, curvy driveway where at the bottom of the hill I saw my dad laying out Christmas lights which would later decorate our house, bushes, and trees, transforming our home into a Christmas wonderland.

For the next few hours, my dad patiently helped me untangle the knotted ball of Christmas lights, lay them out, and then watched as I haphazardly threw the lights over the bushes and trees (that were many times my size), ruining any methodical, well-planned decorating blueprint he had so tirelessly designed.

Even if you’re busy, don’t have the time, or simply don’t care about deep learning (the subject matter of today’s tutorial), slow down and give this blog post a read, if for nothing else than for my dad.

This blog post is part two in our three-part series of building a Not Santa deep learning classifier (i.e., a deep learning model that can recognize if Santa Claus is in an image or not): In the first part of this tutorial, we’ll examine our “Santa” and “Not Santa” datasets.

In order to train our Not Santa deep learning model, we require two sets of images: Last week we used our Google Images hack to quickly grab training images for deep learning networks.

then randomly sampled 461 images that do not contain Santa (Figure 1, right) from the UKBench dataset, a collection of ~10,000 images used for building and evaluating Content-based Image Retrieval (CBIR) systems (i.e., image search engines).

If you are interested in taking a deep dive into deep learning, please take a look at my book, Deep Learning for Computer Vision with Python, where I discuss deep learning in detail (and with lots of code + practical, hands-on implementations as well).

Whenever I defined a new Convolutional Neural Network architecture I like to: The build  method, as the name suggests, takes a number of parameters, each of which I discuss below: We define our model  on Line 14.

Our final code block handles flattening out the volume into a set of fully-connected layers: On Line 33 we take the output of the preceding MaxPooling2D  layer and flatten it into a single vector.

Open up a new file, name it , and insert the following code (or simply follow along with the code download): On Lines 2-18 we import required packages.

From there, we parse command line arguments: Here we have two required command line arguments, --dataset  and --model , as well as an optional path to our accuracy/loss chart, --plot .

Next, we’ll set some training variables, initialize lists, and gather paths to images: On Lines 32-34 we define the number of training epochs, initial learning rate, and batch size.

Now let’s pre-process the images: This loop simply loads and resizes each image to a fixed 28×28 pixels (the spatial dimensions required for LeNet), and appends the image array to the data  list (Lines 49-52) followed by extracting the class label  from the imagePath  on Lines 56-58.

We are able to perform this class label extraction since our dataset directory structure is organized in  the following fashion: Therefore, an example imagePath  would be: After extracting the label  from the imagePath , the result is: I

prefer organizing deep learning image datasets in this manner as it allows us to efficiently organize our dataset and parse out class labels without having to use a separate index/lookup file.

Next, we’ll scale images and create the training and testing splits: On Line 61 we further pre-process our input data by scaling the data points from [0, 255] (the minimum and maximum RGB values of the image) to the range [0, 1].

Training our network is initiated on Lines 87-89 where we call model.fit_generator , supplying our data augmentation object, training/testing data, and the number of epochs we wish to train for.

Finally, let’s plot the results and see how our deep learning image classifier performed: Using matplotlib, we build our plot and save the plot to disk using the --plot  command line argument which contains the path + filename.

section of this blog post to download the code + images), open up a terminal and execute the following command: As you can see, the network trained for 25 epochs and we achieved high accuracy (97.40% testing accuracy) and low loss that follows the training loss, as is apparent from the plot below: The next step is to evaluate our Not Santa model on example images not part of the training/testing splits.

Next, we’ll parse our command line arguments: We require two command line arguments: our --model  and a input --image (i.e., the image we are going to classify).

From there we’ll load the Not Santa image classifier model and make a prediction: This block is pretty self-explanatory, but since this is where the heavy lifting of this script is performed, let’s take a second and understand what’s going on under the hood.

The label  and  proba are used on Line 37 to build the label text to show at the image as you’ll see in the top left corner of the output images below.

For some example images (where Santa is already small), resizing the input image down to 28×28 pixels effectively reduces Santa down to a tiny red/white blob that is only 2-3 pixels in size.

In these types of situations it’s likely that our LeNet model is just predicting when there is a significant amount of red and white localized together in our input image (and likely green as well, as red, green, and white are Christmas colors).

However, using larger resolution images would also require us to utilize a deeper network architecture, which in turn would mean that we need to gather additional training data and utilize a more computationally expensive training process.

Deep Learning with Tensorflow: Part 2 — Image classification

Hi everybody, welcome back to my Tenserflow series, this is part 2. I already described the logic and functionality of neural networks and Tenserflow in the first part as well as I showed you how to set up your coding environment.

Using the Inception-v3 model, we’ll start classifying images using Google’s pre-trained ImageNet dataset and later move on to build our own classifier.

The Inception v3 model is a deep convolutional neural network, which has been pre-trained for the ImageNet Large Visual Recognition Challenge using data from 2012, and it can differentiate between 1,000 different classes, like “cat”, “dishwasher” or “plane”.

If the model runs correctly, the script will produce the following output: If you wish to supply other images, you may do so by editing the --image_file argument: That wasn’t that hard, was it?

Since we’re classifying if an object is a triangle, square, plus or circle, we need to add a ‘training_dataset’ directory and fill it with four subfolders, named after the class’ label.

The retrained labels, -graphs and the training summary will be saved in a folder named tf_files , in case you want to take a look at it.

(Optional): If you have added some images or subfolders for new classes, but you don’t want to call the and then the , you can combine the input by typing: Although the classification does work most of the time, there are some issues: Inception is trained for single-label image classification, which means that multi-label classification is not possible.

Basically, the script loads the pre-trained Inception v3 model, removes the old top layer, and trains a new one on the geometric shapes classes you wanted to add.

Classifying Handwritten Digits with TF.Learn - Machine Learning Recipes #7

Last time we wrote an image classifier using TensorFlow for Poets. This time, we'll write a basic one using TF.Learn. To make it easier for you to try this out, ...

Lecture 5 | Convolutional Neural Networks

In Lecture 5 we move from fully-connected neural networks to convolutional neural networks. We discuss some of the key historical milestones in the ...

Blue Planet II : The Prequel

This world-exclusive introduction to the show is narrated by series presenter Sir David Attenborough and set to an exclusive track developed by Hans Zimmer ...

Cloud AI: How to Get Started Injecting AI Into Your Applications (Cloud Next '18)

Every organization's goal in adopting AI is to solve real-world problems. How can developers and data scientists be more productive using a comprehensive set ...

Introduction To Marine Life Course: Whales, Dolphins & Porpoises

This course gives students of all ages a wonderful introduction to the marine life of British Columbia. Building on the Aquarium's successful research and ...

On My Block | Official Trailer [HD] | Netflix

On My Block, co-created by Lauren Iungerich (creator of Awkward) and Eddie Gonzalez & Jeremy Haft, is a coming of age comedy about four bright and ...

Lesson 7: Practical Deep Learning for Coders

EXOTIC CNN ARCHITECTURES; RNN FROM SCRATCH This is the last lesson of part 1 of Practical Deep Learning For Coders! This lesson is in two parts: 1) ...

Spinosaurus fishes for prey | Planet Dinosaur | BBC

Check out BBC Earth on BBC online - John Hurts tells the stories of the biggest, deadliest and weirdest Dinosaurs ever to walk ..

First Responder - CompTIA Security+ SY0-401: 2.5

Security+ Training Course Index: Professor Messer's Course Notes: Frequently Asked .