AI News, Deep Learning - The Past, Present and Future of Artificial Intelligence
- On Thursday, June 7, 2018
- By Read More
Deep Learning - The Past, Present and Future of Artificial Intelligence
One by one, the abilities and techniques that humans once imagined were uniquely our own have begun to fall to the onslaught of ever more powerful machines.
This talk will introduce the core concept of deep learning, explore why Sundar Pichai (CEO Google) recently announced that “machine learning is a core transformative way by which Google is rethinking everything they are doing” and explain why “deep learning is probably one of the most exciting things that is happening in the computer industry“ (Jen-Hsun Huang – CEO NVIDIA).
The Dark Secret at the Heart of AI
Last year, a strange self-driving car was released onto the quiet roads of Monmouth County, New Jersey.
The experimental vehicle, developed by researchers at the chip maker Nvidia, didn’t look different from other autonomous cars, but it was unlike anything demonstrated by Google, Tesla, or General Motors, and it showed the rising power of artificial intelligence.
Information from the vehicle’s sensors goes straight into a huge network of artificial neurons that process the data and then deliver the commands required to operate the steering wheel, the brakes, and other systems.
The car’s underlying AI technology, known as deep learning, has proved very powerful at solving problems in recent years, and it has been widely deployed for tasks like image captioning, voice recognition, and language translation.
There is now hope that the same techniques will be able to diagnose deadly diseases, make million-dollar trading decisions, and do countless other things to transform whole industries.
But this won’t happen—or shouldn’t happen—unless we find ways of making techniques like deep learning more understandable to their creators and accountable to their users.
But banks, the military, employers, and others are now turning their attention to more complex machine-learning approaches that could make automated decision-making altogether inscrutable.
“Whether it’s an investment decision, a medical decision, or maybe a military decision, you don’t want to just rely on a ‘black box’ method.” There’s already an argument that being able to interrogate an AI system about how it reached its conclusions is a fundamental legal right.
The resulting program, which the researchers named Deep Patient, was trained using data from about 700,000 individuals, and when tested on new records, it proved incredibly good at predicting disease.
Without any expert instruction, Deep Patient had discovered patterns hidden in the hospital data that seemed to indicate when people were on the way to a wide range of ailments, including cancer of the liver.
If something like Deep Patient is actually going to help doctors, it will ideally give them the rationale for its prediction, to reassure them that it is accurate and to justify, say, a change in the drugs someone is being prescribed.
Many thought it made the most sense to build machines that reasoned according to rules and logic, making their inner workings transparent to anyone who cared to examine some code.
But it was not until the start of this decade, after several clever tweaks and refinements, that very large—or “deep”—neural networks demonstrated dramatic improvements in automated perception.
It has given computers extraordinary powers, like the ability to recognize spoken words almost as well as a person could, a skill too complex to code into the machine by hand.
The same approach can be applied, roughly speaking, to other inputs that lead a machine to teach itself: the sounds that make up words in speech, the letters and words that create sentences in text, or the steering-wheel movements required for driving.
The resulting images, produced by a project known as Deep Dream, showed grotesque, alien-like animals emerging from clouds and plants, and hallucinatory pagodas blooming across forests and mountain ranges.
In 2015, Clune’s group showed how certain images could fool such a network into perceiving things that aren’t there, because the images exploit the low-level patterns the system searches for.
The images that turn up are abstract (imagine an impressionistic take on a flamingo or a school bus), highlighting the mysterious nature of the machine’s perceptual abilities.
It is the interplay of calculations inside a deep neural network that is crucial to higher-level pattern recognition and complex decision-making, but those calculations are a quagmire of mathematical functions and variables.
“But once it becomes very large, and it has thousands of units per layer and maybe hundreds of layers, then it becomes quite un-understandable.” In the office next to Jaakkola is Regina Barzilay, an MIT professor who is determined to apply machine learning to medicine.
The diagnosis was shocking in itself, but Barzilay was also dismayed that cutting-edge statistical and machine-learning methods were not being used to help with oncological research or to guide patient treatment.
She envisions using more of the raw data that she says is currently underutilized: “imaging data, pathology data, all this information.” How well can we get along with machines that are
After she finished cancer treatment last year, Barzilay and her students began working with doctors at Massachusetts General Hospital to develop a system capable of mining pathology reports to identify patients with specific clinical characteristics that researchers might want to study.
Barzilay and her students are also developing a deep-learning algorithm capable of finding early signs of breast cancer in mammogram images, and they aim to give this system some ability to explain its reasoning, too.
The U.S. military is pouring billions into projects that will use machine learning to pilot vehicles and aircraft, identify targets, and help analysts sift through huge piles of intelligence data.
A silver-haired veteran of the agency who previously oversaw the DARPA project that eventually led to the creation of Siri, Gunning says automation is creeping into countless areas of the military.
But soldiers probably won’t feel comfortable in a robotic tank that doesn’t explain itself to them, and analysts will be reluctant to act on information without some reasoning.
A chapter of Dennett’s latest book, From Bacteria to Bach and Back, an encyclopedic treatise on consciousness, suggests that a natural part of the evolution of intelligence itself is the creation of systems capable of performing tasks their creators do not know how to do.
But since there may be no perfect answer, we should be as cautious of AI explanations as we are of each other’s—no matter how clever a machine seems.“If it can’t do better than us at explaining what it’s doing,” he says, “then don’t trustit.”
Learning can be supervised, semi-supervised or unsupervised. Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design and board game programs, where they have produced results comparable to and in some cases superior to human experts. Deep learning models are vaguely inspired by information processing and communication patterns in biological nervous systems yet have various differences from the structural and functional properties of biological brains, which make them incompatible with neuroscience evidences. Deep learning is a class of machine learning algorithms that:(pp199–200) Most modern deep learning models are based on an artificial neural network, although they can also include propositional formulas or latent variables organized layer-wise in deep generative models such as the nodes in Deep Belief Networks and Deep Boltzmann Machines.
Examples of deep structures that can be trained in an unsupervised manner are neural history compressors and deep belief networks. Deep neural networks are generally interpreted in terms of the universal approximation theorem or probabilistic inference. The universal approximation theorem concerns the capacity of feedforward neural networks with a single hidden layer of finite size to approximate continuous functions. In 1989, the first proof was published by George Cybenko for sigmoid activation functions and was generalised to feed-forward multi-layer architectures in 1991 by Kurt Hornik. The probabilistic interpretation derives from the field of machine learning.
More specifically, the probabilistic interpretation considers the activation nonlinearity as a cumulative distribution function. The probabilistic interpretation led to the introduction of dropout as regularizer in neural networks. The probabilistic interpretation was introduced by researchers including Hopfield, Widrow and Narendra and popularized in surveys such as the one by Bishop. The term Deep Learning was introduced to the machine learning community by Rina Dechter in 1986, and to Artificial Neural Networks by Igor Aizenberg and colleagues in 2000, in the context of Boolean threshold neurons. The first general, working learning algorithm for supervised, deep, feedforward, multilayer perceptrons was published by Alexey Ivakhnenko and Lapa in 1965. A 1971 paper described a deep network with 8 layers trained by the group method of data handling algorithm. Other deep learning working architectures, specifically those built for computer vision, began with the Neocognitron introduced by Kunihiko Fukushima in 1980. In 1989, Yann LeCun et al.
Each layer in the feature extraction module extracted features with growing complexity regarding the previous layer. In 1995, Brendan Frey demonstrated that it was possible to train (over two days) a network containing six fully connected layers and several hundred hidden units using the wake-sleep algorithm, co-developed with Peter Dayan and Hinton. Many factors contribute to the slow speed, including the vanishing gradient problem analyzed in 1991 by Sepp Hochreiter. Simpler models that use task-specific handcrafted features such as Gabor filters and support vector machines (SVMs) were a popular choice in the 1990s and 2000s, because of ANNs' computational cost and a lack of understanding of how the brain wires its biological networks.
In 2003, LSTM started to become competitive with traditional speech recognizers on certain tasks. Later it was combined with connectionist temporal classification (CTC) in stacks of LSTM RNNs. In 2015, Google's speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which they made available through Google Voice Search. In 2006, publications by Geoff Hinton, Ruslan Salakhutdinov, Osindero and Teh  showed how a many-layered feedforward neural network could be effectively pre-trained one layer at a time, treating each layer in turn as an unsupervised restricted Boltzmann machine, then fine-tuning it using supervised backpropagation. The papers referred to learning for deep belief nets.
It was believed that pre-training DNNs using generative models of deep belief nets (DBN) would overcome the main difficulties of neural nets. However, it was discovered that replacing pre-training with large amounts of training data for straightforward backpropagation when using DNNs with large, context-dependent output layers produced error rates dramatically lower than then-state-of-the-art Gaussian mixture model (GMM)/Hidden Markov Model (HMM) and also than more-advanced generative model-based systems. The nature of the recognition errors produced by the two types of systems was characteristically different, offering technical insights into how to integrate deep learning into the existing highly efficient, run-time speech decoding system deployed by all major speech recognition systems. Analysis around 2009-2010, contrasted the GMM (and other generative speech models) vs.
While there, Ng determined that GPUs could increase the speed of deep-learning systems by about 100 times. In particular, GPUs are well-suited for the matrix/vector math involved in machine learning. GPUs speed up training algorithms by orders of magnitude, reducing running times from weeks to days. Specialized hardware and algorithm optimizations can be used for efficient processing. In 2012, a team led by Dahl won the 'Merck Molecular Activity Challenge' using multi-task deep neural networks to predict the biomolecular target of one drug. In 2014, Hochreiter's group used deep learning to detect off-target and toxic effects of environmental chemicals in nutrients, household products and drugs and won the 'Tox21 Data Challenge' of NIH, FDA and NCATS. Significant additional impacts in image or object recognition were felt from 2011 to 2012.
DNNs have proven themselves capable, for example, of a) identifying the style period of a given painting, b) 'capturing' the style of a given painting and applying it in a visually pleasing manner to an arbitrary photograph, and c) generating striking imagery based on random visual input fields. Neural networks have been used for implementing language models since the early 2000s. LSTM helped to improve machine translation and language modeling. Other key techniques in this field are negative sampling and word embedding.
A compositional vector grammar can be thought of as probabilistic context free grammar (PCFG) implemented by an RNN. Recursive auto-encoders built atop word embeddings can assess sentence similarity and detect paraphrasing. Deep neural architectures provide the best results for constituency parsing, sentiment analysis, information retrieval, spoken language understanding, machine translation, contextual entity linking, writing style recognition, Text classifcation and others. Google Translate (GT) uses a large end-to-end long short-term memory network. GNMT uses an example-based machine translation method in which the system 'learns from millions of examples.' It translates 'whole sentences at a time, rather than pieces.
These failures are caused by insufficient efficacy (on-target effect), undesired interactions (off-target effects), or unanticipated toxic effects. Research has explored use of deep learning to predict biomolecular target, off-target and toxic effects of environmental chemicals in nutrients, household products and drugs. AtomNet is a deep learning system for structure-based rational drug design. AtomNet was used to predict novel candidate biomolecules for disease targets such as the Ebola virus and multiple sclerosis. Deep reinforcement learning has been used to approximate the value of possible direct marketing actions, defined in terms of RFM variables.
An autoencoder ANN was used in bioinformatics, to predict gene ontology annotations and gene-function relationships. In medical informatics, deep learning was used to predict sleep quality based on data from wearables and predictions of health complications from electronic health record data. Deep learning has also showed efficacy in healthcare. Finding the appropriate mobile audience for mobile advertising is always challenging, since many data points must be considered and assimilated before a target segment can be created and used in ad serving by any ad server. Deep learning has been used to interpret large, many-dimensioned advertising datasets.
On the one hand, several variants of the backpropagation algorithm have been proposed in order to increase its processing realism. Other researchers have argued that unsupervised forms of deep learning, such as those based on hierarchical generative models and deep belief networks, may be closer to biological reality. In this respect, generative neural network models have been related to neurobiological evidence about sampling-based processing in the cerebral cortex. Although a systematic comparison between the human brain organization and the neuronal encoding in deep networks has not yet been established, several analogies have been reported.
systems, like Watson (...) use techniques like deep learning as just one element in a very complicated ensemble of techniques, ranging from the statistical technique of Bayesian inference to deductive reasoning.' As an alternative to this emphasis on the limits of deep learning, one author speculated that it might be possible to train a machine vision stack to perform the sophisticated task of discriminating between 'old master' and amateur figure drawings, and hypothesized that such a sensitivity might represent the rudiments of a non-trivial machine empathy. This same author proposed that this would be in line with anthropology, which identifies a concern with aesthetics as a key element of behavioral modernity. In further reference to the idea that artistic sensitivity might inhere within relatively low levels of the cognitive hierarchy, a published series of graphic representations of the internal states of deep (20-30 layers) neural networks attempting to discern within essentially random data the images on which they were trained demonstrate a visual appeal: the original research notice received well over 1,000 comments, and was the subject of what was for a time the most frequently accessed article on The Guardian's web site.
Some deep learning architectures display problematic behaviors, such as confidently classifying unrecognizable images as belonging to a familiar category of ordinary images and misclassifying minuscule perturbations of correctly classified images. Goertzel hypothesized that these behaviors are due to limitations in their internal representations and that these limitations would inhibit integration into heterogeneous multi-component AGI architectures. These issues may possibly be addressed by deep learning architectures that internally form states homologous to image-grammar decompositions of observed entities and events. Learning a grammar (visual or linguistic) from training data would be equivalent to restricting the system to commonsense reasoning that operates on concepts in terms of grammatical production rules and is a basic goal of both human language acquisition and AI. As deep learning moves from the lab into the world, research and experience shows that artificial neural networks are vulnerable to hacks and deception.
ANNs have been trained to defeat ANN-based anti-malware software by repeatedly attacking a defense with malware that was continually altered by a genetic algorithm until it tricked the anti-malware while retaining its ability to damage the target. Another group demonstrated that certain sounds could make the Google Now voice command system open a particular web address that would download malware. In “data poisoning”, false data is continually smuggled into a machine learning system’s training set to prevent it from achieving mastery.
Google Just Open Sourced TensorFlow, Its Artificial Intelligence Engine
Tech pundit Tim O'Reilly had just tried the new Google Photos app, and he was amazed by the depth of its artificial intelligence.
O'Reilly was standing a few feet from Google CEO and co-founder Larry Page this past May, at a small cocktail reception for the press at the annual Google I/O conference—the centerpiece of the company's year.
But its accuracy is enormously impressive—so impressive that O'Reilly couldn't understand why Google didn't sell access to its AI engine via the Internet, cloud-computing style, letting others drive their apps with the same machine learning.
"What we're hoping is that the community adopts this as a good way of expressing machine learning algorithms of lots of different types, and also contributes to building and improving [TensorFlow] in lots of different and interesting ways,"
And it's not sharing access to the remarkably advanced hardware infrastructure that drives this engine (that would certainly come with a price tag).
Google became the Internet's most dominant force in large part because of the uniquely powerful software and hardware it built inside its computer data centers—software and hardware that could help run all its online services, that could juggle traffic and data from an unprecedented number of people across the globe.
Typically, Google trains these neural nets using a vast array of machines equipped with GPU chips—computer processors that were originally built to render graphics for games and other highly visual applications, but have also proven quite adept at deep learning.
But after they've been trained—when it's time to put them into action—these neural nets run in different ways.
They often run on traditional computer processors inside the data center, and in some cases, they can run on mobile phones.
It can run entirely on a phone—without connecting to a data center across the 'net—letting you translate foreign text into your native language even when you don't have a good wireless signal.
It's a set of software libraries—a bunch of code—that you can slip into any application so that it too can learn tasks like image recognition, speech recognition, and language translation.
In open sourcing the tool, Google will also provide some sample neural networking models and algorithms, including models for recognizing photographs, identifying handwritten numbers, and analyzing text.
The rub is that Google is not yet open sourcing a version of TensorFlow that lets you train models across a vast array of machines.
But at the execution stage, the open source incarnation of TensorFlow will run on phones as well as desktops and laptops, and Google indicates that the company may eventually open source a version that runs across hundreds of machines.
Why this apparent change in Google philosophy—this decision to open source TensorFlow after spending so many years keeping important code to itself?
Deep learning originated with academics who openly shared their ideas, and many of them now work at Google—including University of Toronto professor Geoff Hinton, the godfather of deep learning.
The open source movement—where Internet companies share so many of their tools in order to accelerate the rate of development—has picked up considerable speed over the past decade.
Google has not handed the open source project to an independent third party, as many others have done in open sourcing major software.
But it has shared the code under what's called an Apache 2 license, meaning anyone is free to use the code as they please.
Like Torch and Theano, he says, it's good for quickly spinning up research projects, and like Caffe, it's good for pushing those research projects into the real world.
"A fair bit of the advancement in deep learning in the past three or four years has been helped by these kinds of libraries, which help researchers focus on their models.
Must Read Books for Beginners on Machine Learning and Artificial Intelligence
The power to run tasks in automated manner, the power to make our lives comfrotable, the power to improve things continuously by studying decisions at large sacle .
We’re in the early days, but you’ll see us in a systematic way think about how we can apply machine learning to all these areas.’
When Elon Musk, the busiest man of planet right now, was asked about his secret of success, he replied, ‘I used to read books.
The motive of this article is not to promote any particular book, but you make you aware of a world which exists beyond video tutorials, blogs and podcasts.
Programming Collective Intelligence, PCI as it is popularly known, is one of the best books to start learning machine learning. If there is one book to choose on machine learning – it is this one.
The book was written long before data science and machine learning acquired the cult status they have today – but the topics and chapters are entirely relevant even today!
Some of the topics covered in the book are collaborative filtering techniques, search engine features, Bayesian filtering and Support vector machines. If you don’t have a copy of this book – order it as soon as you finish reading this article! The book uses Python to deliver machine learning in a fascinating manner.
It has interesting case studies which will help you to understand the importance of using machine learning algorithms.
This book provides a perfect introduction to machine learning. This book prepares you to understand complex areas of machine learning.
This book serves as a excellent reference for students keen to understand the use of statistical techniques in machine learning and pattern recognition.
More than just providing an overview of artificial intelligence, this book thoroughly covers subjects from search algorithms, reducing problems to search problems, working with logic, planning, and more advanced topics in AI such as reasoning with partial observability, machine learning and language processing.
It teaches basic artificial intelligence algorithms such as dimensionality, distance metrics, clustering, error calculation, hill climbing, Nelder Mead, and linear regression.
It delves deep into the practical aspects of A.I and teaches its readers the method to build and debug robust practical programs.
This books covers topics such as Neural networks, genetic programming, computer vision, heuristic search, knowledge representation and reasoning, Bayes networks and explains them with great ease.
In these books, the authors have not only explained the ML concepts precisely, but also mentioned their perspective and experiences using those concepts, which you would have missed otherwise!
Speech recognition is the inter-disciplinary sub-field of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers.
Recognizing the speaker can simplify the task of translating speech in systems that have been trained on a specific person's voice or it can be used to authenticate or verify the identity of a speaker as part of a security process. From the technology perspective, speech recognition has a long history with several waves of major innovations.
It was thought that speech understanding would be key to making progress in speech recognition, although that later proved to not be true. BBN, IBM, Carnegie Mellon and Stanford Research Institute all participated in the program. The government funding revived speech recognition research that had been largely abandoned in the United States after John Pierce's letter.
Under Fred Jelinek's lead, IBM created a voice activated typewriter called Tangora, which could handle a 20,000 word vocabulary by the mid 1980s. Jelinek's statistical approach put less emphasis on emulating the way the human brain processes and understands speech in favor of using statistical modeling techniques like HMMs.
(Jelinek's group independently discovered the application of HMMs to speech.) This was controversial with linguists since HMMs are too simplistic to account for many common features of human languages. However, the HMM proved to be a highly useful way for modeling speech and replaced dynamic time warping to become the dominant speech recognition algorithm in the 1980s. IBM had a few competitors including Dragon Systems founded by James and Janet M.
During the same time, also CSELT was using HMM (the diphonies were studied since 1980) to recognize language like Italian. At the same time, CSELT led a series of European projects (Esprit I, II), and summarized the state-of-the-art in a book, later (2013) reprinted. Much of the progress in the field is owed to the rapidly increasing capabilities of computers.
At the end of the DARPA program in 1976, the best computer available to researchers was the PDP-10 with 4 MB ram. Using these computers it could take up to 100 minutes to decode just 30 seconds of speech. A few decades later, researchers had access to tens of thousands of times as much computing power.
Further reductions in word error rate came as researchers shifted acoustic models to be discriminative instead of using maximum likelihood estimation. In the mid-Eighties new speech recognition microprocessors were released: for example RIPAC, an independent-speaker recognition (for continuous speech) chip tailored for telephone services, was presented in the Netherlands in 1986. It was designed by CSELT/Elsag and manufactured by SGS.
Two of the earliest products were Dragon Dictate, a consumer product released in 1990 and originally priced at $9,000, and a recognizer from Kurzweil Applied Intelligence released in 1987. AT&T deployed the Voice Recognition Call Processing service in 1992 to route telephone calls without the use of a human operator. The technology was developed by Lawrence Rabiner and others at Bell Labs.
In the early 2000s, speech recognition was still dominated by traditional approaches such as Hidden Markov Models combined with feedforward artificial neural networks. Today, however, many aspects of speech recognition have been taken over by a deep learning method called Long short-term memory (LSTM), a recurrent neural network published by Sepp Hochreiter &
Around 2007, LSTM trained by Connectionist Temporal Classification (CTC) started to outperform traditional speech recognition in certain applications. In 2015, Google's speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which is now available through Google Voice to all smartphone users. The use of deep feedforward (non-recurrent) networks for acoustic modeling was introduced during later part of 2009 by Geoffrey Hinton and his students at University of Toronto and by Li Deng and colleagues at Microsoft Research, initially in the collaborative work between Microsoft and University of Toronto which was subsequently expanded to include IBM and Google (hence 'The shared views of four research groups' subtitle in their 2012 review paper). A Microsoft research executive called this innovation 'the most dramatic change in accuracy since 1979.' In contrast to the steady incremental improvements of the past few decades, the application of deep learning decreased word error rate by 30%. This innovation was quickly adopted across the field.
recurrent nets) of artificial neural networks had been explored for many years during 1980s, 1990s and a few years into the 2000s. But these methods never won over the non-uniform internal-handcrafting Gaussian mixture model/Hidden Markov model (GMM-HMM) technology based on generative models of speech trained discriminatively. A number of key difficulties had been methodologically analyzed in the 1990s, including gradient diminishing and weak temporal correlation structure in the neural predictive models. All these difficulties were in addition to the lack of big training data and big computing power in these early days.
reviewed part of this recent history about how their collaboration with each other and then with colleagues across four groups (University of Toronto, Microsoft, Google, and IBM) ignited a renaissance of applications of deep feedforward neural networks to speech recognition. Both acoustic modeling and language modeling are important parts of modern statistically-based speech recognition algorithms.
Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model, which includes both the acoustic and language model information, and combining it statically beforehand (the finite state transducer, or FST, approach).
Re scoring is usually done by trying to minimize the Bayes risk (or an approximation thereof): Instead of taking the source sentence with maximal probability, we try to take the sentence that minimizes the expectancy of a given loss function with regards to all possible transcriptions (i.e., we take the sentence that minimizes the average distance to other possible sentences weighted by their estimated probability).
Efficient algorithms have been devised to re score lattices represented as weighted finite state transducers with edit distances represented themselves as a finite state transducer verifying certain assumptions. Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach.
success of DNNs in large vocabulary speech recognition occurred in 2010 by industrial researchers, in collaboration with academic researchers, where large output layers of the DNN based on context dependent HMM states constructed by decision trees were adopted.  See comprehensive reviews of this development and of the state of the art as of October 2014 in the recent Springer book from Microsoft Research. See also the related background of automatic speech recognition and the impact of various machine learning paradigms including notably deep learning in recent overview articles. One fundamental principle of deep learning is to do away with hand-crafted feature engineering and to use raw features.
For example, a n-gram language model is required for all HMM-based systems, and a typical n-gram language model often takes several gigabytes in memory making them impractical to deploy on mobile devices. Consequently, modern commercial ASR systems from Google and Apple (as of 2017) are deployed on the cloud and require a network connection as opposed to the device locally.
Later, Baidu expanded on the work with extremely large datasets and demonstrated some commercial success in Chinese Mandarin and English. In 2016, University of Oxford presented LipNet, the first end-to-end sentence-level lip reading model, using spatiotemporal convolutions coupled with an RNN-CTC architecture, surpassing human-level performance in a restricted grammar dataset. An alternative approach to CTC-based models are attention-based models.
Latent Sequence Decompositions (LSD) was proposed by Carnegie Mellon University, MIT and Google Brain to directly emit sub-word units which are more natural than English characters; University of Oxford and Google DeepMind extended LAS to 'Watch, Listen, Attend and Spell' (WLAS) to handle lip reading surpassing human-level performance. Typically a manual control input, for example by means of a finger control on the steering-wheel, enables the speech recognition system and this is signalled to the driver by an audio prompt.
Following the audio prompt, the system has a 'listening window' during which it may accept a speech input for recognition. Simple voice commands may be used to initiate phone calls, select radio stations or play music from a compatible smartphone, MP3 player or music-loaded flash drive.
Back-end or deferred speech recognition is where the provider dictates into a digital dictation system, the voice is routed through a speech-recognition machine and the recognized draft document is routed along with the original voice file to the editor, where the draft is edited and report finalized.
The use of speech recognition is more naturally suited to the generation of narrative text, as part of a radiology/pathology interpretation, progress note or discharge summary: the ergonomic gains of using speech recognition to enter structured discrete data (e.g., numeric values or codes from a list or a controlled vocabulary) are relatively minimal for people who are sighted and who can operate a keyboard and mouse.
By contrast, many highly customized systems for radiology or pathology dictation implement voice 'macros', where the use of certain phrases – e.g., 'normal report', will automatically fill in a large number of default values and/or generate boilerplate, which will vary with the type of the exam – e.g., a chest X-ray vs.
In these programs, speech recognizers have been operated successfully in fighter aircraft, with applications including: setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling flight display.
The system is seen as a major design feature in the reduction of pilot workload, and even allows the pilot to assign targets to his aircraft with two simple voice commands or to any of his wingmen with only five commands. Speaker-independent systems are also being developed and are under test for the F35 Lightning II (JSF) and the Alenia Aermacchi M-346 Master lead-in fighter trainer.
It can teach proper pronunciation, in addition to helping a person develop fluency with their speaking skills. Students who are blind (see Blindness and education) or have very low vision can benefit from using the technology to convey words and then hear the computer recite them, as well as use a computer by commanding with their voice, instead of having to look at the screen and keyboard. Students who are physically disabled or suffer from Repetitive strain injury/other injuries to the upper extremities can be relieved from having to worry about handwriting, typing, or working with scribe on school assignments by using speech-to-text programs.
For individuals that are Deaf or Hard of Hearing, speech recognition software is used to automatically generate a closed-captioning of conversations such as discussions in conference rooms, classroom lectures, and/or religious services. Speech recognition is also very useful for people who have difficulty using their hands, ranging from mild repetitive stress injuries to involve disabilities that preclude using conventional computer input devices.
Individuals with learning disabilities who have problems with thought-to-paper communication (essentially they think of an idea but it is processed incorrectly causing it to end up differently on paper) can possibly benefit from the software but the technology is not bug proof. Also the whole idea of speak to text can be hard for intellectually disabled person's due to the fact that it is rare that anyone tries to learn the technology to teach the person with the disability. This type of technology can help those with dyslexia but other disabilities are still in question.
Giving them more work to fix, causing them to have to take more time with fixing the wrong word. The performance of speech recognition systems is usually evaluated in terms of accuracy and speed. Accuracy is usually rated with word error rate (WER), whereas speed is measured with the real time factor.
When a person reads it's usually in a context that has been previously prepared, but when a person uses spontaneous speech, it is difficult to recognize the speech because of the disfluencies (like 'uh' and 'um', false starts, incomplete sentences, stuttering, coughing, and laughter) and limited vocabulary.
For example, activation words like 'Alexa' spoken in an audio or video broadcast can cause devices in homes and offices to start listening for input inappropriately, or possibly take an unwanted action. Voice-controlled devices are also accessible to visitors to the building, or even those outside the building if they can be heard inside.
One transmits ultrasound and attempt to send commands without nearby people noticing. The other adds small, inaudible distortions to other speech or music that are specially crafted to confuse the specific speech recognition system into recognizing music as speech, or to make what sounds like one command to a human sound like a different command to the system. Popular speech recognition conferences held each year or two include SpeechTEK and SpeechTEK Europe, ICASSP, Interspeech/Eurospeech, and the IEEE ASRU.
A most recent comprehensive textbook, 'Fundamentals of Speaker Recognition' is an in depth source for up to date details on the theory and practice. A good insight into the techniques used in the best modern systems can be gained by paying attention to government sponsored evaluations such as those organised by DARPA (the largest speech recognition-related project ongoing as of 2007 is the GALE project, which involves both speech recognition and translation components).
Deng published near the end of 2014, with highly mathematically-oriented technical detail on how deep learning methods are derived and implemented in modern speech recognition systems based on DNNs and related deep learning methods. A related book, published earlier in 2014, 'Deep Learning: Methods and Applications' by L.
Yu provides a less technical but more methodology-focused overview of DNN-based speech recognition during 2009–2014, placed within the more general context of deep learning applications including not only speech recognition but also image recognition, natural language processing, information retrieval, multimodal processing, and multitask learning. In terms of freely available resources, Carnegie Mellon University's Sphinx toolkit is one place to start to both learn about speech recognition and to start experimenting.
- On Thursday, September 19, 2019
Use machine learning & artificial intelligence in your apps (Innovation track - Playtime EMEA 2017)
Hear examples and practical advice from developers who are already successfully using Machine Learning and Artificial Intelligence to improve the ...
Chris Voss: "Never Split the Difference" | Talks at Google
Everything we've previously been taught about negotiation is wrong: people are not rational; there is no such thing as 'fair'; compromise is the worst thing you ...
Towards ambient intelligence in AI-assisted healthcare spaces - Dr Fei-Fei Li, Stanford University
Abstract: Artificial intelligence has begun to impact healthcare in areas including electronic health records, medical images, and genomics. But one aspect of ...
2018 Isaac Asimov Memorial Debate: Artificial Intelligence
Isaac Asimov's famous Three Laws of Robotics might be seen as early safeguards for our reliance on artificial intelligence, but as Alexa guides our homes and ...
Keynote (Google I/O '18)
Learn about the latest product and platform innovations at Google in a Keynote led by Sundar Pichai. This video is also subtitled in Chinese, Indonesian, Italian, ...
Tech Talk: AI @ Google: Vicarious AI, Agent.AI Duncan Davidson
6:05 Presentation 1: Vicarious AI Vicarious AI is an artificial intelligence core technology R&D startup based in Silicon Valley, USA. Vicarious approaches AI with ...
Margaret Mitchell, Senior Research Scientist, Google at MLconf Seattle 2017
Margaret Mitchell is the Senior Research Scientist in Google's Research & Machine Intelligence group, working on advancing artificial intelligence towards ...
How a poker-playing AI is learning to negotiate better than any human
In 2012, a comic made its way around the internet listing games on a scale of how close they were to being dominated by artificial intelligence. Checkers and ...
Lecture 3 | GloVe: Global Vectors for Word Representation
Lecture 3 introduces the GloVe model for training word vectors. Then it extends our discussion of word vectors (interchangeably called word embeddings) by ...
Brian Greene in To Unweave a Rainbow: Science and the Essence of Being Human
As long ago as the early 19th century, the poet Keats bemoaned the washing away of the world's beauty and mystery in the wake of natural philosophy's ...