AI News, Urban Sound Classification using Neural Network

Urban Sound Classification using Neural Network

In this blog post, we will learn techniques to classify urban sounds into categories using machine learning.

For example, in the textual dataset, each word in the corpus becomes feature and tf-idf score becomes its value.

Today, we will first see what features can be extracted from sound data and how easy it is to extract such features in Python using open source library called Librosa.

Urban Sound Classification, Part 1

Now we have our training and testing set ready, let’s implement two layers neural network in Tensorflow to classify each sound clip into a different category.

The cross-entropy cost function will be minimised using gradient descent optimizer, the code provided below initialize cost function and optimizer.

Now let’s train neural network model, visualise whether cost is decreasing with each epoch and make prediction on the test set, using following code:

In this tutorial, we saw how to extract features from a sound dataset and train a two layer neural network model in Tensorflow to categories sounds.

Getting Started with Audio Data Analysis using Deep Learning (with case study)

When you get started with data science, you start simple.

Also the body language of the person can show you many more features about a person, because actions speak louder than words!

In this article, I intend to cover an overview of audio / voice processing with a case study so that you would get a hands-on introduction to solving audio processing problems.

Even when you think you are in a quiet environment, you tend to catch much more subtle sounds, like the rustling of leaves or the splatter of rain.

Examples of these formats are If you give a thought on what an audio looks like, it is nothing but a wave like format of data, where the amplitude of audio change with respect to time.

As with all unstructured data formats, audio data has a couple of preprocessing steps which have to be followed before it is presented for analysis..

Another way of representing audio data is by converting it into a different domain of data representation, namely the frequency domain.

When we sample an audio data, we require much more data points to represent the whole data and also, the sampling rate should be as high as possible.

Source Here, we separate one audio signal into 3 different pure signals, which can now be represented as three unique values in frequency domain.

Now the next step is to extract features from this audio representations, so that our algorithm can work on these features and perform the task it is designed for.

The dataset contains 8732 sound excerpts (<=4s) of urban sounds from 10 classes, namely: Here’s a sound excerpt from the dataset.

To install librosa, just type this in command line Now we can run the following code to load the data When you load the data, it gives you two objects;

Let us now visually inspect our data and see if we can find patterns in the data We can see that it may be difficult to differentiate between jackhammer and drilling, but it is still easy to discern between dog_barking and drilling.

We will do a similar approach as we did for Age detection problem, to see the class distributions and just predict the max occurrence of all test cases as that class.

in your hand, I hope you could try your own algorithms in Urban Sound challenge, or try solving your own audio problems in daily life.

Like Facebook Page for updates

Convolutional Neural Networks (ConvNets or CNNs) are a category of Neural Networks that have proven very effective in areas such as image recognition and classification.

In Figure 1 above, a ConvNet is able to recognize scenes and the system is able to suggest relevant captions (“a soccer player is kicking a soccer ball”) while Figure 2 shows an example of ConvNets being used for recognizing everyday objects, humans and animals.

There have been several new architectures proposed in the recent years which are improvements over the LeNet, but they all use the main concepts from the LeNet and are relatively easier to understand if you have a clear understanding of the former.

The Convolutional Neural Network in Figure 3 is similar in architecture to the original LeNet and classifies an input image into four categories: dog, cat, boat or bird (the original LeNet was used mainly for character recognition tasks).

As evident from the figure above, on receiving a boat image as input, the network correctly assigns the highest probability for boat (0.94) among all four categories.

There are four main operations in the ConvNet shown in Figure 3 above: These operations are the basic building blocks of every Convolutional Neural Network, so understanding how these work is an important step to developing a sound understanding of ConvNets.

The primary purpose of Convolution in case of a ConvNet is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. We will not go into the mathematical details of Convolution here, but will try to understand how it works over images.

Consider a 5 x 5 image whose pixel values are only 0 and 1 (note that for a grayscale image, pixel values range from 0 to 255, the green matrix below is a special case where pixel values are only 0 and 1):

Then, the Convolution of the 5 x 5 image and the 3 x 3 matrix can be computed as shown in the animation in Figure 5 below: Take a moment to understand how the computation above is being done.

We slide the orange matrix over our original image (green) by 1 pixel (also called ‘stride’) and for every position, we compute element wise multiplication (between the two matrices) and add the multiplication outputs to get the final integer which forms a single element of the output matrix (pink).

In practice, a CNN learns the values of these filters on its own during the training process (although we still need to specify parameters such as number of filters, filter size, architecture of the network etc.

In case of Max Pooling, we define a spatial neighborhood (for example, a 2×2 window) and take the largest element from the rectified feature map within that window.

In the network shown in Figure 11, pooling operation is applied separately to each feature map (notice that, due to this, we get three output maps from three input maps).

Together these layers extract the useful features from the images, introduce non-linearity in our network and reduce feature dimension while aiming to make the features somewhat equivariant to scale and translation [18].

The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset.

For example, the image classification task we set out to perform has four possible outputs as shown in Figure 14 below (note that Figure 14 does not show connections between the nodes in the fully connected layer)

This is ensured by using the Softmax as the activation function in the output layer of the Fully Connected Layer. The Softmax function takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sum to one.

When a new (unseen) image is input into the ConvNet, the network would go through the forward propagation step and output a probability for each class (for a new image, the output probabilities are calculated using the weights which have been optimized to correctly classify all the previous training examples).

In general, the more convolution steps we have, the more complicated features our network will be able to learn to recognize. For example, in Image Classification a ConvNet may learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features, such as facial shapes in higher layers [14].

these features were learnt using a Convolutional Deep Belief Network and the figure is included here just for demonstrating the idea (this is only an example: real life convolution filters may detect objects that have no meaning to humans).

The input image contains 1024 pixels (32 x 32 image) and the first Convolution layer (Convolution Layer 1) is formed by convolution of six unique 5 × 5 (stride 1) filters with the input image.

For a more thorough understanding of some of these concepts, I would encourage you to go through the notes from Stanford’s course on ConvNets as well as other excellent resources mentioned under References below. If you face any issues understanding any of the above concepts or have questions / suggestions, feel free to leave a comment below.

Discover Feature Engineering, How to Engineer Features and How to Get Good at It

Feature engineering is an informal topic, but one that is absolutely known and agreed to be key to success in applied machine learning.

feature engineering is another topic which doesn’t seem to merit any review papers or books, or even chapters in books, but it is absolutely vital to ML success.

The flexibility of good features will allow you to use less complex models that are faster to run, easier to understand and easier to maintain.

With good features, you are closer to the underlying problem and a representation of all the data you have available and could use to best characterize that underlying problem.

on winning the Flight Quest challenge on Kaggle Here is how I define feature engineering: Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.

You can see the dependencies in this definition: feature engineering is manually designing what the input x’s should be — Tomasz Malisiewicz, answer to “What is feature engineering?”

In this context, feature engineering asks: what is the best representation of the sample data to learn a solution to your problem?

would think to myself “I’m doing feature engineering now” and I would pursue the question “How can I decompose or aggregate raw data to better describe the underlying problem?” The goal was right, but the approach was one of many.

Feature importance scores can also provide you with information that you can use to extract or construct new features, similar but different to those that have been estimated to be useful.

More complex predictive modeling algorithms perform feature importance and selection internally while constructing their model.

Common examples include image, audio, and textual data, but could just as easily include tabular data with millions of attributes.

Key to feature extraction is that the methods are automatic (although may need to be designed and constructed from simpler methods) and solve the problem of unmanageably high dimensional data, most typically used for analog observations stored in digital formats.

Feature selection algorithms may use a scoring method to rank and choose features, such as correlation or other feature importance methods.

More advanced methods may search subsets of features by trial and error, creating and evaluating models automatically in pursuit of the objectively most predictive sub-group of features.

Stepwise regression is an example of an algorithm that automatically performs feature selection as part of the model construction process.

Regularization methods like LASSO and ridge regression may also be considered algorithms with feature selection baked in, as they actively seek to remove or discount the contribution of features as part of the model building process.

This requires spending a lot of time with actual sample data (not aggregates) and thinking about the underlying form of the problem, structures in the data and how best to expose them to predictive modeling algorithms.

With tabular data, it often means a mixture of aggregating or combining features to create new features, and decomposing or splitting features to create new features.

With image data, it can often mean enormous amounts of time prescribing automatic filters to pick out relevant structures.

This is the part of feature engineering that is often talked the most about as an artform, the part that is attributed the importance and signalled as the differentiator in competitive machine learning.

They have been shown to automatically and in a unsupervised or semi-supervised way, learn abstract representations of features (a compressed form), that in turn have supported state-of-the-art results in domains such as speech recognition, image classification, object recognition and other areas.

They cannot (yet, or easily) inform you and the process on how to create more similar and different features like those that are doing well, on a given problem or on similar problems in the future.

The process of applied machine learning (for lack of a better name) that in a broad brush sense involves lots of activities.

Up front is problem definition, next is  data selection and preparation, in the middle is model preparation, evaluation and tuning and at the end is the presentation of results.

It might look something like the following: The traditional idea of “Transforming Data” from a raw state to a state suitable for modeling is where feature engineering fits in.

You can see that before feature engineering, we are munging out data into a format we can even look at, and just before that we are collating and denormalizing data from databases into some kind of central picture.

It suggests a strong interaction with modeling, reminding us of the interplay of devising features and testing them against the coalface of our test harness and final performance measures.

This also suggests we may need to leave the data in a form suitable for the chosen modeling algorithm, such as normalize or standardize the features as a final step.

This sounds like a preprocessing step, it probably is, but it helps us consider what types of finishing touches are needed to the data before effective modeling.

The process might look as follows: You need a well defined problem so that you know when to stop this process and move on to trying other models, other model configurations, ensembles of models, and so on.

You could create a new binary feature called “Has_Color” and assign it a value of “1” when an item has a color and “0” when the color is unknown.

These additional features could be used instead of the Item_Color feature (if you wanted to try a simpler linear model) or in addition to it (if you wanted to get more out of something like a decision tree).

If you suspect there are relationships between times and other attributes, you can decompose a date-time into constituent parts that may allow models to discover and exploit these relationships.

You could create a new ordinal feature called Part_Of_Day with 4 values Morning, Midday, Afternoon, Night with whatever hour boundaries you think are relevant.

You can use similar approaches to pick out time of week relationships, time of month relationships and various structures of seasonality across a year.

That magic domain number could be used to create a new binary feature Item_Above_4kg with a value of “1” for our example of 6289 grams.

In this case you may want to go back to the data collection step and create new features in addition to this aggregate and try to expose more temporal structure in the purchases, like perhaps seasonality.

The simple structure allowed the team to use highly performant but very simple linear methods to achieve the winning predictive model.

The paper provides details of how specific temporal and other non-linearities in the problem structure were reduced to simple composite binary indicators.

The heritage health prize was a 3 million dollar prize awarded to the team who could best predict which patients would be admitted to hospital within the next year.

If you are working with digital representations of analog observations like images, video, sound or text, you might like to dive deeper into some feature extraction literature.

Creating a Modern OCR Pipeline Using Computer Vision and DeepLearning

In this post we will take you behind the scenes on how we built a state-of-the-art Optical Character Recognition (OCR) pipeline for our mobile document scanner.

Our mobile document scanner only outputs an image — any text in the image is just a set of pixels as far as the computer is concerned, and can’t be copy-pasted, searched for, or any of the other things you can do with text.

Once OCR is run, we can then enable the following features for our Dropbox Business users: When we built the first version of the mobile document scanner, we used a commercial off-the-shelf OCR library, in order to do product validation before diving too deep into creating our own machine learning-based OCR system.

Second, the commercial system was tuned for the traditional OCR world of images from flat bed scanners, whereas our operating scenario was much tougher, because mobile phone photos are far more unconstrained, with crinkled or curved documents, shadows and uneven lighting, blurriness and reflective highlights, etc.

Traditionally, OCR systems were heavily pipelined, with hand-built and highly-tuned modules taking advantage of all kinds of conditions they could assume to be true for images captured using a flatbed scanner.

For example, one module might find lines of text, then the next module would find words and segment letters, then another module might apply different techniques to each piece of a character to figure out what the character is, etc.

The last few years has seen the successful application of deep learning to numerous problems in computer vision that have given us powerful new tools for tackling OCR without having to replicate the complex processing pipelines of the past, relying instead on large quantities of data to have the system automatically learn how to do many of the previously manually-designed steps.

We use a wide variety of safety precautions with such user-donated data, including never keeping donated data on local machines in permanent storage, maintaining extensive auditing, requiring strong authentication to access any of it, and more.

Most current machine learning techniques are strongly-supervised, meaning that they require explicit manual labeling of input data so that the algorithms can learn to make predictions themselves.

DropTurk contains a standard list of annotation task UI templates that we can rapidly assemble and customize for new datasets and labeling tasks, which enables us to annotate our datasets quite fast.

DropTurk UI for adding ground truth data for word images Our DropTurk platform includes dashboards to get an overview of past jobs, watch the progress of current jobs, and access the results securely.

DropTurk Dashboard Using DropTurk, we collected both a word-level dataset, which has images of individual words and their annotated text, as well as a full document-level dataset, which has images of full documents (like receipts) and fully transcribed text.

The visual features that are output by the CNN are then fed as a sequence to a Bidirectional LSTM (Long Short Term Memory) — common in speech recognition systems — which make sense of our word “pieces,” and finally arrives at a text prediction using a Connectionist Temporal Classification (CTC) layer.

However, in most computer vision problems it’s currently too difficult to generate realistic-enough images for training algorithms: the variety of imaging environments and transformations is too varied to effectively simulate.

(One promising area of current research is Generative Adversarial Networks (GANs), which seem to be well-suited to generating realistic data.) Fortunately, our problem in this case is a perfect match for using synthetic data, since the types of images we need to generate are quite constrained and can thus be rendered automatically.

Synthetically generated word images We started simply with all three, with words coming from a collection of Project Gutenberg books from the 19th century, about a thousand fonts we collected, and some simple distortions like rotations, underlines, and blurs.

It tracked everything needed for machine learning reproducibility, such as a unique git hash for the code that was used, pointers to S3 with generated data sets and results, evaluation results, graphs, a high-level description of the goal of that experiment, and more.

Precision refers to the fraction of words returned by the deep net that were actually correct, while recall refers to the fraction of evaluation data that is correctly predicted by the deep net.

Screenshot from early end-to-end experiments in our lab notebook At a certain point our synthetic data pipeline was resulting in a Single Word Accuracy (SWA) percentage in the high-80s on our OCR benchmark set, and we decided we were done with that portion.

However, most documents don’t just have a handful of words — they have hundreds or even thousands of them, i.e., a few orders of magnitude more objects than most neural network-based object detection systems were capable of finding at the time.

Another important consideration was that traditional computer vision approaches using feature detectors might be easier to debug, as neural networks as notoriously opaque and have internal representations that are hard to understand and interpret.

This requires the word detector to thus sometimes include more than one word in a single detection box, or chop a single word in half if it is too long to fit the deep net’s input size.

Another bit of trickiness is dealing with images with white text on dark backgrounds, as opposed to dark text on white backgrounds, forcing our MSER detector to be able to handle both scenarios.

We then used this confidence score to bucket predictions in three ways: We also had to deal with issues caused by the previously mentioned fixed receptive image size of the Word Deep Net: namely, that a single “word” window might actually contain multiple words or only part of a very long word.

Finally, now that we had a fully working end-to-end system, we generated more than ten million synthetic words and trained our neural net for a very large number of iterations to squeeze out as much accuracy as we could.

Our jail infrastructure allows us to efficiently set up expensive resources a single time at startup, such as loading our trained models, then have these resources be cloned into a jail to satisfy a single OCR request.

We ended up essentially rewriting OpenCV’s C++ MSER implementation in a more modular way to avoid duplicating slow work when doing two passes (to be able to handle both black on white text as well as white on black text);

We then used these donated images, being very careful about their privacy, to do a qualitative blackbox test of both OCR systems end-to-end, and were elated to find that we indeed performed the same or better than the older commercial OCR SDK, allowing us to ramp up our system to 100% of Dropbox Business users.

Next, we tested whether fine-tuning our trained deep net on these donated documents versus our hand chosen fine tuning image suite helped accuracy.

We built an orientation predictor using another deep net based on the Inception Resnet v2 architecture, changed the final layer to predict orientation, collected an orientation training and validation data set, and fine-tuned from an ImageNet-trained model biased towards our own needs.

Finally, we were surprised to find some tough issues with the PDF file format containing our scanned OCRed hidden layer in Apple’s native Preview application. Most PDF renderers respect spaces embedded in the text for copy and paste, but Apple’s Preview application performs its own heuristics to determine word boundaries based on text position.

In all, this entire round of researching, productionization, and refinement took about 8 months, at the end of which we had built and deployed a state-of-the-art OCR pipeline to millions of users using modern computer vision and deep neural network techniques.

Open Source TensorFlow Models (Google I/O '17)

Come to this talk for a tour of the latest open source TensorFlow models for Image Classification, Natural Language Processing, and Computer Generated ...

Lesson 1: Deep Learning 2018

NB: Please go to to view this video since there is important updated information there. If you have questions, use the forums at ..

Convolutional Neural Networks - The Math of Intelligence (Week 4)

Convolutional Networks allow us to classify images, generate them, and can even be applied to other types of data. We're going to build one in numpy that can ...

Lec03 Feature Extraction with Python (Hands on)

Introduction to Python2.7 for visual computing, reading images, displaying images, computing features and saving computed matrices and files for later use.

What's New with TensorFlow? (Cloud Next '18)

As fast as Machine Learning and AI are evolving, so is TensorFlow. In this session you'll get a tour of everything new in TensorFlow, from the release of ...

Effective TensorFlow for Non-Experts (Google I/O '17)

TensorFlow is Google's machine learning framework. In this talk, you will learn how to use TensorFlow effectively. TensorFlow offers high level interfaces like ...

Blue Planet II : The Prequel

New David Attenborough series Dynasties coming soon! Watch the first trailer here: --~-- This world-exclusive ..

Lecture 8: Recurrent Neural Networks and Language Models

Lecture 8 covers traditional language models, RNNs, and RNN language models. Also reviewed are important training problems and tricks, RNNs for other ...

Lecture 9: Machine Translation and Advanced Recurrent LSTMs and GRUs

Lecture 9 recaps the most important concepts and equations covered so far followed by machine translation and fancy RNN models tackling MT. Key phrases: ...

Discover Trove Webinar

Watch this webinar and discover how to get the most out of the hugely popular discover service, Trove. You'll learn search tips and strategies and be stepped ...