AI News, Learning Plan - Artificial Intelligence Applications on Microsoft Azure
- On Wednesday, June 6, 2018
- By Read More
Learning Plan - Artificial Intelligence Applications on Microsoft Azure
The amount of visual data being produced is growing at a truly incredible pace, largely due to the amount of sensors and cameras in the world (which by some estimates is larger than the amount of humans on the planet) and the ubiquity of cheap data storage.
Prior to the deep learning revolution, practitioners attempting to tackle computer vision problems like object detection would need years of experience to learn how to identify the necessary features from visual data to solve the problem at hand.
CNTK enables developers to build custom deep learning solutions that can scale across multiple GPUs and systems, and can be embedded in small mobile devices, enabling the full end-to-end training-to-production pipeline for deploying deep learning algorithms in the wild.
Essentials of Deep Learning: Exploring Unsupervised Deep Learning Algorithms for Computer Vision
There are two important concepts of an AutoEncoder, which makes it a very powerful algorithm for unsupervised learning problems: We will take an example of an AutoEncoder trained on images of cats, each of size 100×100.
So the input dimension is 10,000, and the AutoEncoder has to represent all this information in a vector of size 10, which makes the model learn only the important parts of the images so that it can re-create the original image just from this vector.
The task of the encoder is to convert the input to a lower dimensional representation, while the task of the decoder is to recreate the input from this lower dimensional representation.
We have a total of 70,000 images, out of which 60,000 are a part of train images with the label of the type of apparel (total classes: 10) and the remaining 10,000 images are unlabelled (known as test images).
A straight-forward task could be to compress a given image into discrete bits of information, and reconstruct the image back from these discrete bits.
An optimization on this we can do is to make this representation sparse, so that we require even less bits than we needed before to transfer the compressed image properties and reconstruct it back to the original image at the other end.
Technically speaking, to make representations more compact, we add a sparsity constraint on the activity of the hidden representations (called activity regularizer in keras), so that fewer units get activated at a given time to give us an optimal reconstruction.
Learning can be supervised, semi-supervised or unsupervised. Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design and board game programs, where they have produced results comparable to and in some cases superior to human experts. Deep learning models are vaguely inspired by information processing and communication patterns in biological nervous systems yet have various differences from the structural and functional properties of biological brains, which make them incompatible with neuroscience evidences. Deep learning is a class of machine learning algorithms that:(pp199–200) Most modern deep learning models are based on an artificial neural network, although they can also include propositional formulas or latent variables organized layer-wise in deep generative models such as the nodes in Deep Belief Networks and Deep Boltzmann Machines. In deep learning, each level learns to transform its input data into a slightly more abstract and composite representation.
Examples of deep structures that can be trained in an unsupervised manner are neural history compressors and deep belief networks. Deep neural networks are generally interpreted in terms of the universal approximation theorem or probabilistic inference. The universal approximation theorem concerns the capacity of feedforward neural networks with a single hidden layer of finite size to approximate continuous functions. In 1989, the first proof was published by George Cybenko for sigmoid activation functions and was generalised to feed-forward multi-layer architectures in 1991 by Kurt Hornik. The probabilistic interpretation derives from the field of machine learning.
More specifically, the probabilistic interpretation considers the activation nonlinearity as a cumulative distribution function. The probabilistic interpretation led to the introduction of dropout as regularizer in neural networks. The probabilistic interpretation was introduced by researchers including Hopfield, Widrow and Narendra and popularized in surveys such as the one by Bishop. The term Deep Learning was introduced to the machine learning community by Rina Dechter in 1986, and to Artificial Neural Networks by Igor Aizenberg and colleagues in 2000, in the context of Boolean threshold neurons. The first general, working learning algorithm for supervised, deep, feedforward, multilayer perceptrons was published by Alexey Ivakhnenko and Lapa in 1965. A 1971 paper described a deep network with 8 layers trained by the group method of data handling algorithm. Other deep learning working architectures, specifically those built for computer vision, began with the Neocognitron introduced by Kunihiko Fukushima in 1980. In 1989, Yann LeCun et al.
Each layer in the feature extraction module extracted features with growing complexity regarding the previous layer. In 1995, Brendan Frey demonstrated that it was possible to train (over two days) a network containing six fully connected layers and several hundred hidden units using the wake-sleep algorithm, co-developed with Peter Dayan and Hinton. Many factors contribute to the slow speed, including the vanishing gradient problem analyzed in 1991 by Sepp Hochreiter. Simpler models that use task-specific handcrafted features such as Gabor filters and support vector machines (SVMs) were a popular choice in the 1990s and 2000s, because of ANNs' computational cost and a lack of understanding of how the brain wires its biological networks.
In 2003, LSTM started to become competitive with traditional speech recognizers on certain tasks. Later it was combined with connectionist temporal classification (CTC) in stacks of LSTM RNNs. In 2015, Google's speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which they made available through Google Voice Search. In 2006, publications by Geoff Hinton, Ruslan Salakhutdinov, Osindero and Teh  showed how a many-layered feedforward neural network could be effectively pre-trained one layer at a time, treating each layer in turn as an unsupervised restricted Boltzmann machine, then fine-tuning it using supervised backpropagation. The papers referred to learning for deep belief nets.
It was believed that pre-training DNNs using generative models of deep belief nets (DBN) would overcome the main difficulties of neural nets. However, it was discovered that replacing pre-training with large amounts of training data for straightforward backpropagation when using DNNs with large, context-dependent output layers produced error rates dramatically lower than then-state-of-the-art Gaussian mixture model (GMM)/Hidden Markov Model (HMM) and also than more-advanced generative model-based systems. The nature of the recognition errors produced by the two types of systems was characteristically different, offering technical insights into how to integrate deep learning into the existing highly efficient, run-time speech decoding system deployed by all major speech recognition systems. Analysis around 2009-2010, contrasted the GMM (and other generative speech models) vs.
While there, Ng determined that GPUs could increase the speed of deep-learning systems by about 100 times. In particular, GPUs are well-suited for the matrix/vector math involved in machine learning. GPUs speed up training algorithms by orders of magnitude, reducing running times from weeks to days. Specialized hardware and algorithm optimizations can be used for efficient processing. In 2012, a team led by Dahl won the 'Merck Molecular Activity Challenge' using multi-task deep neural networks to predict the biomolecular target of one drug. In 2014, Hochreiter's group used deep learning to detect off-target and toxic effects of environmental chemicals in nutrients, household products and drugs and won the 'Tox21 Data Challenge' of NIH, FDA and NCATS. Significant additional impacts in image or object recognition were felt from 2011 to 2012.
DNNs have proven themselves capable, for example, of a) identifying the style period of a given painting, b) 'capturing' the style of a given painting and applying it in a visually pleasing manner to an arbitrary photograph, and c) generating striking imagery based on random visual input fields. Neural networks have been used for implementing language models since the early 2000s. LSTM helped to improve machine translation and language modeling. Other key techniques in this field are negative sampling and word embedding.
A compositional vector grammar can be thought of as probabilistic context free grammar (PCFG) implemented by an RNN. Recursive auto-encoders built atop word embeddings can assess sentence similarity and detect paraphrasing. Deep neural architectures provide the best results for constituency parsing, sentiment analysis, information retrieval, spoken language understanding, machine translation, contextual entity linking, writing style recognition, Text classifcation and others. Google Translate (GT) uses a large end-to-end long short-term memory network. GNMT uses an example-based machine translation method in which the system 'learns from millions of examples.' It translates 'whole sentences at a time, rather than pieces.
These failures are caused by insufficient efficacy (on-target effect), undesired interactions (off-target effects), or unanticipated toxic effects. Research has explored use of deep learning to predict biomolecular target, off-target and toxic effects of environmental chemicals in nutrients, household products and drugs. AtomNet is a deep learning system for structure-based rational drug design. AtomNet was used to predict novel candidate biomolecules for disease targets such as the Ebola virus and multiple sclerosis. Deep reinforcement learning has been used to approximate the value of possible direct marketing actions, defined in terms of RFM variables.
An autoencoder ANN was used in bioinformatics, to predict gene ontology annotations and gene-function relationships. In medical informatics, deep learning was used to predict sleep quality based on data from wearables and predictions of health complications from electronic health record data. Deep learning has also showed efficacy in healthcare. Finding the appropriate mobile audience for mobile advertising is always challenging, since many data points must be considered and assimilated before a target segment can be created and used in ad serving by any ad server. Deep learning has been used to interpret large, many-dimensioned advertising datasets.
On the one hand, several variants of the backpropagation algorithm have been proposed in order to increase its processing realism. Other researchers have argued that unsupervised forms of deep learning, such as those based on hierarchical generative models and deep belief networks, may be closer to biological reality. In this respect, generative neural network models have been related to neurobiological evidence about sampling-based processing in the cerebral cortex. Although a systematic comparison between the human brain organization and the neuronal encoding in deep networks has not yet been established, several analogies have been reported.
systems, like Watson (...) use techniques like deep learning as just one element in a very complicated ensemble of techniques, ranging from the statistical technique of Bayesian inference to deductive reasoning.' As an alternative to this emphasis on the limits of deep learning, one author speculated that it might be possible to train a machine vision stack to perform the sophisticated task of discriminating between 'old master' and amateur figure drawings, and hypothesized that such a sensitivity might represent the rudiments of a non-trivial machine empathy. This same author proposed that this would be in line with anthropology, which identifies a concern with aesthetics as a key element of behavioral modernity. In further reference to the idea that artistic sensitivity might inhere within relatively low levels of the cognitive hierarchy, a published series of graphic representations of the internal states of deep (20-30 layers) neural networks attempting to discern within essentially random data the images on which they were trained demonstrate a visual appeal: the original research notice received well over 1,000 comments, and was the subject of what was for a time the most frequently accessed article on The Guardian's web site.
Some deep learning architectures display problematic behaviors, such as confidently classifying unrecognizable images as belonging to a familiar category of ordinary images and misclassifying minuscule perturbations of correctly classified images. Goertzel hypothesized that these behaviors are due to limitations in their internal representations and that these limitations would inhibit integration into heterogeneous multi-component AGI architectures. These issues may possibly be addressed by deep learning architectures that internally form states homologous to image-grammar decompositions of observed entities and events. Learning a grammar (visual or linguistic) from training data would be equivalent to restricting the system to commonsense reasoning that operates on concepts in terms of grammatical production rules and is a basic goal of both human language acquisition and AI. As deep learning moves from the lab into the world, research and experience shows that artificial neural networks are vulnerable to hacks and deception.
Computational Intelligence and Neuroscience
Over the last years deep learning methods have been shown to outperform previous state-of-the-art machine learning techniques in several fields, with computer vision being one of the most prominent cases.
A brief account of their history, structure, advantages, and limitations is given, followed by a description of their applications in various computer vision tasks, such as object detection, face recognition, action and activity recognition, and human pose estimation.
Deep learning allows computational models of multiple processing layers to learn and represent data with multiple levels of abstraction mimicking how the brain perceives and understands multimodal information, thus implicitly capturing intricate structures of large‐scale data.
The recent surge of interest in deep learning methods is due to the fact that they have been shown to outperform previous state-of-the-art techniques in several tasks, as well as the abundance of complex data from different sources (e.g., visual, audio, medical, social, and sensor).
A series of major contributions in the field is presented in Table 1, including LeNet  and Long Short-Term Memory , leading up to today’s “era of deep learning.” One of the most substantial breakthroughs in deep learning came in 2006, when Hinton et al.
Guiding the training of intermediate levels of representation using unsupervised learning, performed locally at each level, was the main principle behind a series of developments that brought about the last decade’s surge in deep architectures and deep learning algorithms.
Among the most prominent factors that contributed to the huge boost of deep learning are the appearance of large, high-quality, publicly available labelled datasets, along with the empowerment of parallel GPU computing, which enabled the transition from CPU-based to GPU-based training thus allowing for significant acceleration in deep models’ training.
Additional factors may have played a lesser role as well, such as the alleviation of the vanishing gradient problem owing to the disengagement from saturating activation functions (such as hyperbolic tangent and the logistic function), the proposal of new regularization techniques (e.g., dropout, batch normalization, and data augmentation), and the appearance of powerful frameworks like TensorFlow , theano , and mxnet , which allow for faster prototyping.
Deep learning has fueled great strides in a variety of computer vision problems, such as object detection (e.g., [8, 9]), motion tracking (e.g., [10, 11]), action recognition (e.g., [12, 13]), human pose estimation (e.g., [14, 15]), and semantic segmentation (e.g., [16, 17]).
for example, Long Short-Term Memory (LSTM), in the category of Recurrent Neural Networks, although of great significance as a deep learning scheme, is not presented in this review, since it is predominantly applied in problems such as language modeling, text classification, handwriting recognition, machine translation, speech/music recognition, and less so in computer vision problems.
The overview is intended to be useful to computer vision and multimedia analysis researchers, as well as to general machine learning researchers, who are interested in the state of the art in deep learning for computer vision tasks, such as object detection and recognition, face recognition, action/activity recognition, and human pose estimation.
In Section 3, we describe the contribution of deep learning algorithms to key computer vision tasks, such as object detection and recognition, face recognition, action/activity recognition, and human pose estimation;
The first computational models based on these local connectivities between neurons and on hierarchically organized transformations of the image are found in Neocognitron , which describes that when neurons with the same parameters are applied on patches of the previous layer at different locations, a form of translational invariance is acquired.
Every layer of a CNN transforms the input volume to an output volume of neuron activation, eventually leading to the final fully connected layers, resulting in a mapping of the input data to a 1D feature vector.
In  a detailed theoretical analysis of max pooling and average pooling performances is given, whereas in  it was shown that max pooling can lead to faster convergence, select superior invariant features, and improve generalization.
Also there are a number of other variations of the pooling layer in the literature, each inspired by different motivations and serving distinct needs, for example, stochastic pooling , spatial pyramid pooling [28, 29], and def-pooling .
This construction is equivalent to a convolution operation, followed by an additive bias term and sigmoid function:where stands for the depth of the convolutional layer, is the weight matrix, and is the bias term.
If the input to convolutional layer is of dimension and the receptive field of units at a specific plane of convolutional layer is of dimension , then the constructed feature map will be a matrix of dimensions .
The conditional distributions over hidden and visible vectors can be derived by (5) and (6) asGiven a set of observations the derivative of the log-likelihood with respect to the model parameters can be derived by (6) aswhere denotes an expectation with respect to the data distribution , with representing the empirical distribution and is an expectation with respect to the distribution defined by the model, as in (6).
detailed explanation along with the description of a practical way to train RBMs was given in , whereas  discusses the main difficulties of training RBMs and their underlying reasons and proposes a new algorithm with an adaptive learning rate and an enhanced gradient, so as to address the aforementioned difficulties.
They model the joint distribution between observed vector and the hidden layers as follows:where , is a conditional distribution for the visible units at level conditioned on the hidden units of the RBM at level , and is the visible-hidden joint distribution in the top-level RBM.
This representation can be chosen as being the mean activation or samples of .(3)Train the second layer as an RBM, taking the transformed data (samples or mean activation) as training examples (for the visible layer of that RBM).(4)Iterate steps ( and ) for the desired number of layers, each time propagating upward either samples or mean values.(5)Fine-tune all the parameters of this deep architecture with respect to a proxy for the DBN log- likelihood, or with respect to a supervised training criterion (after adding extra learning machinery to convert the learned representation into supervised predictions, e.g., a linear classifier).
However, a later variation of the DBN, the Convolutional Deep Belief Network (CDBN) ([42, 43]), uses the spatial information of neighboring pixels by introducing convolutional RBMs, thus producing a translation invariant generative model that successfully scales when it comes to high dimensional images, as is evidenced in .
During network training, a DBM jointly trains all layers of a specific unsupervised model, and instead of maximizing the likelihood directly, the DBM uses a stochastic maximum likelihood (SML)  based algorithm to maximize the lower bound on the likelihood.
Instead, a greedy layer-wise training strategy was proposed , which essentially consists in pretraining the layers of the DBM, similarly to DBN, namely, by stacking RBMs and training each layer to independently model the output of the previous layer, followed by a final joint fine-tuning.
Furthermore, in DBMs, by following the approximate gradient of a variational lower bound on the likelihood objective, one can jointly optimize the parameters of all layers, which is very beneficial especially in cases of learning models from heterogeneous data originating from different modalities .
These include accelerating inference by using separate models to initialize the values of the hidden units in all layers [47, 49], or other improvements at the pretraining stage [50, 51] or at the training stage [52, 53].
If there is one linear hidden layer and the mean squared error criterion is used to train the network, then the hidden units learn to project the input in the span of the first principal components of the data .
The aforementioned optimization process results in low reconstruction error on test examples from the same distribution as the training examples but generally high reconstruction error on samples arbitrarily chosen from the input space.
In simple terms, there are two main aspects in the function of a denoising autoencoder: first it tries to encode the input (namely, preserve the information about the input), and second it tries to undo the effect of a corruption process stochastically applied to the input of the autoencoder (see Figure 3).
It should be mentioned that using autoencoders for denoising was introduced in earlier works (e.g., ), but the substantial contribution of  lies in the demonstration of the successful use of the method for unsupervised pretraining of a deep architecture and in linking the denoising autoencoder to a generative model.
Invariance to translation, rotation, and scale is one of the most important assets of CNNs, especially in computer vision problems, such as object detection, because it allows abstracting an object’s identity or category from the specifics of the visual input (e.g., relative positions/orientation of the camera and the object), thus enabling the network to effectively recognize a given object in cases where the actual pixel values on the image can significantly differ.
In this section, we survey works that have leveraged deep learning methods to address key tasks in computer vision, such as object detection, face recognition, action and activity recognition, and human pose estimation.
For example, the method described in  employs selective search  to derive object proposals, extracts CNN features for each proposal, and then feeds the features to an SVM classifier to decide whether the windows include the object or not.
vast majority of works on object detection using deep learning apply a variation of CNNs, for example, [8, 67, 68] (in which a new def-pooling layer and new learning strategy are proposed),  (weakly supervised cascaded CNNs), and  (subcategory-aware CNNs).
Furthermore, CNNs constitute the core of OpenFace , an open-source face recognition tool, which is of comparable (albeit a little lower) accuracy, is open-source, and is suitable for mobile computing, because of its smaller size and fast execution time.
In  deep learning was used for complex event detection and recognition in video sequences: first, saliency maps were used for detecting and localizing events, and then deep learning was applied to the pretrained features for identifying the most important frames that correspond to the underlying event.
they find that due to the challenges of large intraclass variances, small interclass variances, and limited training samples per activity, an approach that directly uses deep features learned from ImageNet in an SVM classifier is preferable.
In , the authors, instead of training the network using the whole image, use the local part patches and background patches to train a CNN, in order to learn conditional probabilities of the part presence and spatial relationships.
In  the approach trains multiple smaller CNNs to perform independent binary body-part classification, followed with a higher-level weak spatial model to remove strong outliers and to enforce global pose consistency.
The three key categories of deep learning for computer vision that have been reviewed in this paper, namely, CNNs, the “Boltzmann family” including DBNs and DBMs, and SdAs, have been employed to achieve significant performance rates in a variety of visual understanding tasks, such as object detection, face recognition, action and activity recognition, human pose estimation, image retrieval, and semantic segmentation.
As a closing note, in spite of the promising—in some cases impressive—results that have been documented in the literature, significant challenges do remain, especially as far as the theoretical groundwork that would clearly explain the ways to define the optimal selection of model type and structure for a given task or to profoundly comprehend the reasons for which a specific architecture or algorithm is effective in a given task or not.
Solving real-world business problems with computer vision
To learn more about how organizations are using deep learning methods to solve real business problems, check out the Enterprise Adoption sessions at the Strata Data Conference in New York City, Sept.
This aspect highlights a key property of deep learning networks—the ability of data scientists to choose the right architecture for the input data type so the network can automatically learn features.
We’re seeing applications of computer vision across the spectrum of the enterprise: In insurance, we see companies such as Orbital Insights analyzing satellite imagery to count cars and oil tank levels automatically to predict such things as mall sales and oil production, respectively.
The automotive industry has embraced computer vision (and deep learning) aggressively in the past five years with applications such as scene analysis, automated lane detection, and automated road sign reading to set speed limits.
Beyond convolutional neural networks, the automotive industry has leveraged deep learning and long-short-term memory networks to analyze sensor data to automatically detect other cars and objects around the car.
James Long shared with us this anecdote on how he sees integrated machine learning as a force multiplier, as opposed to job replacement: It’s small examples like this that show how latent integrated intelligence in vehicles is slowly making them “progressively automated”—as opposed to the idea that all cars will be self-driving tomorrow.
These challenges include: Most organizations do not collect enough quality data to produce the model their line of business wants in terms of accuracy (e.g., “Our model has an F1 of .80, but the line of business says the F1 has to be .95 to be financially viable to them”).
Another tip for enterprises is to focus on leveraging good, tried-and-true convolutional architectures from the past few years, as opposed to trying to implement the “hot new ICML paper of the week.” Twitter is great for discovering new papers as they come out, but it can also encourage folks to jump from one hot idea to the next before they can actually leverage real production value from new networks.
Most of the time, what people mean when they say “unstructured data” is that “it doesn’t look like a CSV file or a RDBMS table.” Ingest systems can also involve real-time tagging of images as they are ingested, helping us to understand if we have certain images as soon as they are ingested or serving an image detection system.
Many folks tend to conflate the idea of software engineering being fairly (within reason) deterministic (e.g., “We built a house out of these materials”) and data science having a wider range of outcomes with the same labor (e.g., “We mined for gold as long as the other team, but only found half as much gold on our land”).
A best practice is to invest in the best possible infrastructure that builds, secures, and deploys our model in a way that IT can consume, then let the data science team focus on building as many models as possible to find the best one for the task at hand.
To learn in more detail how to implement convolutional neural networks into enterprise applications, see our post “Integrating convolutional neural networks into enterprise pplications.” And, to hear more about applied machine learning in the context of streaming data infrastructure, attend our session Real-time image classification: Using convolutional neural networks on real-time streaming data” at the Strata Data Conference in New York City, Sept.
Creating a Modern OCR Pipeline Using Computer Vision and DeepLearning
In this post we will take you behind the scenes on how we built a state-of-the-art Optical Character Recognition (OCR) pipeline for our mobile document scanner.
Our mobile document scanner only outputs an image — any text in the image is just a set of pixels as far as the computer is concerned, and can’t be copy-pasted, searched for, or any of the other things you can do with text.
Once OCR is run, we can then enable the following features for our Dropbox Business users: When we built the first version of the mobile document scanner, we used a commercial off-the-shelf OCR library, in order to do product validation before diving too deep into creating our own machine learning-based OCR system.
Second, the commercial system was tuned for the traditional OCR world of images from flat bed scanners, whereas our operating scenario was much tougher, because mobile phone photos are far more unconstrained, with crinkled or curved documents, shadows and uneven lighting, blurriness and reflective highlights, etc.
Traditionally, OCR systems were heavily pipelined, with hand-built and highly-tuned modules taking advantage of all kinds of conditions they could assume to be true for images captured using a flatbed scanner.
For example, one module might find lines of text, then the next module would find words and segment letters, then another module might apply different techniques to each piece of a character to figure out what the character is, etc.
The last few years has seen the successful application of deep learning to numerous problems in computer vision that have given us powerful new tools for tackling OCR without having to replicate the complex processing pipelines of the past, relying instead on large quantities of data to have the system automatically learn how to do many of the previously manually-designed steps.
We use a wide variety of safety precautions with such user-donated data, including never keeping donated data on local machines in permanent storage, maintaining extensive auditing, requiring strong authentication to access any of it, and more.
Most current machine learning techniques are strongly-supervised, meaning that they require explicit manual labeling of input data so that the algorithms can learn to make predictions themselves.
DropTurk contains a standard list of annotation task UI templates that we can rapidly assemble and customize for new datasets and labeling tasks, which enables us to annotate our datasets quite fast.
DropTurk UI for adding ground truth data for word images Our DropTurk platform includes dashboards to get an overview of past jobs, watch the progress of current jobs, and access the results securely.
DropTurk Dashboard Using DropTurk, we collected both a word-level dataset, which has images of individual words and their annotated text, as well as a full document-level dataset, which has images of full documents (like receipts) and fully transcribed text.
The visual features that are output by the CNN are then fed as a sequence to a Bidirectional LSTM (Long Short Term Memory) — common in speech recognition systems — which make sense of our word “pieces,” and finally arrives at a text prediction using a Connectionist Temporal Classification (CTC) layer.
However, in most computer vision problems it’s currently too difficult to generate realistic-enough images for training algorithms: the variety of imaging environments and transformations is too varied to effectively simulate.
(One promising area of current research is Generative Adversarial Networks (GANs), which seem to be well-suited to generating realistic data.) Fortunately, our problem in this case is a perfect match for using synthetic data, since the types of images we need to generate are quite constrained and can thus be rendered automatically.
Synthetically generated word images We started simply with all three, with words coming from a collection of Project Gutenberg books from the 19th century, about a thousand fonts we collected, and some simple distortions like rotations, underlines, and blurs.
It tracked everything needed for machine learning reproducibility, such as a unique git hash for the code that was used, pointers to S3 with generated data sets and results, evaluation results, graphs, a high-level description of the goal of that experiment, and more.
Precision refers to the fraction of words returned by the deep net that were actually correct, while recall refers to the fraction of evaluation data that is correctly predicted by the deep net.
Screenshot from early end-to-end experiments in our lab notebook At a certain point our synthetic data pipeline was resulting in a Single Word Accuracy (SWA) percentage in the high-80s on our OCR benchmark set, and we decided we were done with that portion.
However, most documents don’t just have a handful of words — they have hundreds or even thousands of them, i.e., a few orders of magnitude more objects than most neural network-based object detection systems were capable of finding at the time.
Another important consideration was that traditional computer vision approaches using feature detectors might be easier to debug, as neural networks as notoriously opaque and have internal representations that are hard to understand and interpret.
This requires the word detector to thus sometimes include more than one word in a single detection box, or chop a single word in half if it is too long to fit the deep net’s input size.
Another bit of trickiness is dealing with images with white text on dark backgrounds, as opposed to dark text on white backgrounds, forcing our MSER detector to be able to handle both scenarios.
We then used this confidence score to bucket predictions in three ways: We also had to deal with issues caused by the previously mentioned fixed receptive image size of the Word Deep Net: namely, that a single “word” window might actually contain multiple words or only part of a very long word.
Finally, now that we had a fully working end-to-end system, we generated more than ten million synthetic words and trained our neural net for a very large number of iterations to squeeze out as much accuracy as we could.
Our jail infrastructure allows us to efficiently set up expensive resources a single time at startup, such as loading our trained models, then have these resources be cloned into a jail to satisfy a single OCR request.
We ended up essentially rewriting OpenCV’s C++ MSER implementation in a more modular way to avoid duplicating slow work when doing two passes (to be able to handle both black on white text as well as white on black text);
We then used these donated images, being very careful about their privacy, to do a qualitative blackbox test of both OCR systems end-to-end, and were elated to find that we indeed performed the same or better than the older commercial OCR SDK, allowing us to ramp up our system to 100% of Dropbox Business users.
Next, we tested whether fine-tuning our trained deep net on these donated documents versus our hand chosen fine tuning image suite helped accuracy.
We built an orientation predictor using another deep net based on the Inception Resnet v2 architecture, changed the final layer to predict orientation, collected an orientation training and validation data set, and fine-tuned from an ImageNet-trained model biased towards our own needs.
Finally, we were surprised to find some tough issues with the PDF file format containing our scanned OCRed hidden layer in Apple’s native Preview application. Most PDF renderers respect spaces embedded in the text for copy and paste, but Apple’s Preview application performs its own heuristics to determine word boundaries based on text position.
In all, this entire round of researching, productionization, and refinement took about 8 months, at the end of which we had built and deployed a state-of-the-art OCR pipeline to millions of users using modern computer vision and deep neural network techniques.
- On Thursday, January 17, 2019
Machine Learning and Computer Vision for Biological Imaging Applications - MATLAB Video
In this video, learn how to use computer vision and machine learning techniques in MATLAB® to solve practical image analysis, automation, and classification ...
Learning From and Dealing With Real, Rare World Data In Computer Vision
Computer Vision has achieved tremendous progress in recent years. Primarily because of the availability of massive datasets (e.g., ImageNet or the Yahoo ...
Computer Vision: Crash Course Computer Science #35
Today we're going to talk about how computers see. We've long known that our digital cameras and smartphones can take incredibly detailed images, but taking ...
Lecture 6 | Training Neural Networks I
In Lecture 6 we discuss many practical issues for training modern neural networks. We discuss different activation functions, the importance of data ...
Lecture 3 | Loss Functions and Optimization
Lecture 3 continues our discussion of linear classifiers. We introduce the idea of a loss function to quantify our unhappiness with a model's predictions, and ...
9 Cool Deep Learning Applications | Two Minute Papers #35
Machine learning provides us an incredible set of tools. If you have a difficult problem at hand, you don't need to hand craft an algorithm for it. It finds out by itself ...
Lecture 11 | Detection and Segmentation
In Lecture 11 we move beyond image classification, and show how convolutional networks can be applied to other core computer vision tasks. We show how ...
Mask Region based Convolution Neural Networks - EXPLAINED!
In this video, we will take a look at new type of neural network architecture called "Masked Region based Convolution Neural Networks", Masked R-CNN for ...
Tutorials Session A - Deep Learning for Computer Vision
This tutorial will look at how deep learning methods can be applied to problems in computer vision, most notably object recognition. It will start by motivating the ...
Riemannian manifolds, kernels and learning
I will talk about recent results from a number of people in the group on Riemannian manifolds in computer vision. In many Vision problems Riemannian ...