AI News, Machine-learning system tackles speech and object recognition, all at once

Machine-learning system tackles speech and object recognition, all at once

MIT computer scientists have developed a system that learns to identify objects within an image, based on a spoken description of the image.

“We wanted to do speech recognition in a way that’s more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don’t typically have access to.

In the paper, the researchers demonstrate their model on an image of a young girl with blonde hair and blue eyes, wearing a blue dress, with a white lighthouse with a red roof in the background.

The model learned to associate which pixels in the image corresponded with the words “girl,” “blonde hair,” “blue eyes,” “blue dress,” “white light house,” and “red roof.” When an audio caption was narrated, the model then highlighted each of those objects in the image as they were described.

If the model learns speech signals from language A that correspond to objects in the image, and learns the signals in language B that correspond to those same objects, it could assume those two signals — and matching words — are translations of one another.

With the correct image and caption pair, the model matches the first cell of the grid to the first segment of audio, then matches that same cell with the second segment of audio, and so on, all the way through each grid cell and across all time segments.

“The biggest contribution of the paper,” Harwath says, “is demonstrating that these cross-modal [audio and visual] alignments can be inferred automatically by simply teaching the network which images and captions belong together and which pairs don’t.” The authors dub this automatic-learning association between a spoken caption’s waveform with the image pixels a “matchmap.” After training on thousands of image-caption pairs, the network narrows down those alignments to specific words representing specific objects in that matchmap.

“Predictions start dispersed everywhere but, as you go through training, they converge into an alignment that represents meaningful semantic groundings between spoken words and visual objects.” “It is exciting to see that neural methods are now also able to associate image elements with audio segments, without requiring text as an intermediary,” says Florian Metze, an associate research professor at the Language Technologies Institute at Carnegie Mellon University.

Machine-learning system tackles speech and object recognition

MIT computer scientists have built up a system that figures out how to distinguish protests inside an image, in view of a spoken description of the picture. Given an image and a sound inscription, the model will feature continuously the important regions of the image being described.

David Harwath, a researcher in the Computer Science and Artificial Intelligence Laboratory said, “We wanted to do speech recognition in a way that’s more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don’t typically have access to.

The model learned to associate which pixels in the image corresponded with the words ‘girl’, ‘blonde hair’, ‘blue eyes’, ‘blue dress’, ‘white lighthouse’, and ‘red roof’.

By recognizing correct image and caption pair, the model matches the first cell of the grid to the first segment of audio, then matches that same cell with the second segment of audio. For each cell and audio segment, it provides a similarity score, depending on how closely the signal corresponds to the object.

Dr. Harwath said, “The challenge is that, during training, the model doesn’t have access to any true alignment information between the speech and the image. The biggest contribution of the paper is demonstrating that these cross-modal alignments can be inferred automatically by simply teaching the network which images and captions belong together and which pairs don’t.”

Harwath said, “It’s kind of like the Big Bang, where the matter was really dispersed, but then coalesced into planets and stars. Predictions start dispersed everywhere but, as you go through training, they converge into an alignment that represents meaningful semantic groundings between spoken words and visual objects.”

MIT boffins develop system that can recognise speech and objects at the same time

MIT computer scientists have developed a system that tackles speech and object recognition at the same time.

In their new paper, the researchers demonstrate their model on an image of a young girl with blonde hair and blue eyes, wearing a blue dress, with a white lighthouse with a red roof in the background.

The model learned to associate which pixels in the image corresponded with the words 'girl,' 'blonde hair,' 'blue eyes,' 'blue dress,' 'white lighthouse,' and 'red roof.'

If the model learns speech signals from language A that correspond to objects in the image and learns the signals in language B that correspond to those same objects, it could assume those two signals - and matching words - are translations of one another.'

He and his fellow MIT researchers also hope that one day their combined speech-object recognition technique could save countless hours of manual labour and open new doors in speech and image recognition.

However, generating speech with computers  — a process usually referred to as speech synthesis or text-to-speech (TTS) — is still largely based on so-called concatenative TTS, where a very large database of short speech fragments are recorded from a single speaker and then recombined to form complete utterances.

This has led to a great demand for parametric TTS, where all the information required to generate the data is stored in the parameters of the model, and the contents and characteristics of the speech can be controlled via the inputs to the model.

How to Make a Simple Tensorflow Speech Recognizer

In this video, we'll make a super simple speech recognizer in 20 lines of Python using the Tensorflow machine learning library. I go over the history of speech ...

70 English Words सीखने का Easy Way | Learn Vocabulary For Beginners Through Hindi | Awal

70 से ज़्यादा English Words का हिंदी से English में Meaning और Use सीखने का Easy Trick. Know the Smart Way to Learn 70 English Words...

Make Body Language Your Superpower

Making Body Language Your Superpower - an instructional video on using body language effectively. Presented by Stanford graduate students Matt Levy, Colin ...

16 Useful Clues To Understand Your Dog Better

Understanding a dog's body language is essential for building a strong and trustworthy relationship with your four-legged friend. This video will help you to ...

Your body language may shape who you are | Amy Cuddy

Body language affects how others see us, but it may also change how we see ourselves. Social psychologist Amy Cuddy argues that "power posing" -- standing ...

5 Body Language Tricks To Make Anyone Instantly Like You - Personality Development & English Lessons

5 Body Language Tricks To Make Anyone Instantly Like You - Free English Lessons There's no question that body language is important, you can capture - and ...

Sign Language to Speech Translation System Using PIC Microcontroller

1. Sign Language to Speech Translation System Using PIC Microcontroller, 2. Magic glove( sign to voice conversion) – SlideShare, 3. Conversation of Sign ...

Class-specific multiple classifiers scheme to recognize emotions from speech signals

Full text available on ScienceDirect:

Speech Sound Hand Cues - SpeechChick

Overview video showing how to use my Speech Sound Hand Cues. These are great for treating children with any speech disorder, but specifically for treating ...

10 Speaking Tips | Advanced Presentation Advice | How To Give A Powerful Speech | Public Speaking

- Click here to read the article 10 Speaking Tips | Advanced Presentation Advice | How To Give A ..