AI News, TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

TensorBoard was designed to help users visualize the structure of their graphs, as well as understand the behavior of their models The following is a brief overview of what EEG does under the hood Please see pages 14 and 15 of the November 2015 white paper to see a specific example of EEG visualization along with descriptions of the current UI This section lists areas of improvement and extension for TensorFlow identified for consideration by the TensorFlow team Extensions: Improvements: Systems designed primarily for neural networks: Systems that support symbolic differentiation: Systems with a core written in C++: Similarities shared with DistBelief and Project Adam: Differences between TensorFlow and DistBelief/Project Adam: Systems that represent complex workflows as dataflow graphs Systems that support data-dependent control flow Systems optimized for accessing the same data repeatedly Systems that execute dataflow graphs across heterogenous devices, including GPUs Feature implementations that are most similar to TensorFlow are listed after the feature

Neural Machine Translation (seq2seq) Tutorial

Authors: Thang Luong, Eugene Brevdo, Rui Zhao (Google Research Blogpost, Github) This version of the tutorial requires TensorFlow Nightly. For

using the stable TensorFlow versions, please consider other branches such as tf-1.4.

great success in a variety of tasks such as machine translation, speech recognition,

of seq2seq models and shows how to build a competitive seq2seq model

We achieve this goal by: We believe that it is important to provide benchmarks that people can easily replicate.

As a result, we have provided full experimental results and pretrained

on models on the following publicly available datasets: We first build up some basic knowledge about seq2seq models for NMT, explaining how

and tricks to build the best possible NMT models (both in speed and translation

RNNs, beam search, as well as scaling up to multiple GPUs using GNMT attention.

Back in the old days, traditional phrase-based translation systems performed their

task by breaking up source sentences into multiple chunks and then translated

An encoder converts a source sentence into a 'meaning' vector which is passed

Specifically, an NMT system first reads the source sentence using an encoder to

manner, NMT addresses the local translation problem in the traditional phrase-based

differ in terms of: (a) directionality – unidirectional or bidirectional;

In this tutorial, we consider as examples a deep multi-layer RNN which is unidirectional

simply consumes the input source words without making any prediction;

on the other hand, processes the target sentence while predicting the next

running: Let's first dive into the heart of building an NMT model with concrete code snippets

At the bottom layer, the encoder and decoder RNNs receive as input the following:

are in time-major format and contain word indices: Here for efficiency, we train with multiple sentences (batch_size) at once.

The embedding weights, one set per language, are usually

choose to initialize embedding weights with pretrained word representations such

Once retrieved, the word embeddings are then fed as input into the main network, which

consists of two multi-layer RNNs – an encoder for the source language and a

models do a better job when fitting large training datasets).

RNN uses zero vectors as its starting states and is built as follows: Note that sentences have different lengths to avoid wasting computation, we tell dynamic_rnn

describe how to build multi-layer LSTMs, add dropout, and use attention in a

Given the logits above, we are now ready to compute our training loss: Here, target_weights is a zero-one matrix of the same size as decoder_outputs.

SGD with a learning of 1.0, the latter approach effectively uses a much smaller

pass is just a matter of a few lines of code: One of the important steps in training RNNs is gradient clipping.

a decreasing learning rate schedule, which yields better performance.

nmt/scripts/download_iwslt15.sh /tmp/nmt_data Run the following command to start the training: The above command trains a 2-layer LSTM seq2seq model with 128-dim hidden units and

We can start Tensorboard to view the summary of the model during training: Training the reverse direction from English and Vietnamese can be done simply by changing:\

--src=en --tgt=vi While you're training your NMT models (and once you have trained models), you can

obtain translations given previously unseen source sentences.

Greedy decoding – example of how a trained NMT model produces a translation

for a source sentence 'Je suis étudiant' using greedy search.

the correct target words as an input, inference uses words predicted by the

know the target sequence lengths in advance, we use maximum_iterations to limit

Having trained a model, we can now create an inference file and translate some sentences:

Remember that in the vanilla seq2seq model, we pass the last source state from the

It consists of the following stages: Here, the function score is used to compared the target hidden state \( h_t \) with

each of the source hidden states \( \overline{h}_s \), and the result is normalized to produced

\( a_t \) is used to derive the softmax logit and loss.

and on whether the previous state \( h_{t-1} \) is used instead of \(

h_t \) in the scoring function as originally suggested in (Bahdanau et al., 2015).

of attention, i.e., direct connections between target and source, needs to be

use the current target hidden state as a 'query' to decide on which parts of

mechanism, we happen to use the set of source hidden states (or their transformed

versions, e.g., \( W_1h_t \) in Bahdanau's scoring style) as 'keys'.

Thanks to the attention wrapper, extending our vanilla seq2seq code with attention

attention_model.py First, we need to define an attention mechanism, e.g., from (Luong et al., 2015):

to create a new directory for the attention model, so we don't reuse the previously

Run the following command to start the training: After training, we can use the same inference command with the new out_dir for inference:

separate graphs: Building separate graphs has several benefits: The primary source of complexity becomes how to share Variables across the three graphs

Before: Three models in a single graph and sharing a single Session After: Three models in three graphs, with three Sessions sharing the same Variables Notice how the latter approach is 'ready' to be converted to a distributed version.

feed data at each session.run call (and thereby performing our own batching,

training and eval pipelines: The first approach is easier for users who aren't familiar with TensorFlow or need

to do exotic input modification (i.e., their own minibatch queueing) that can

Some examples: All datasets can be treated similarly via input processing.

To convert each sentence into vectors of word strings, for example, we use the dataset

map transformation: We can then switch each sentence vector into a tuple containing both the vector and

object table, this map converts the first tuple elements from a vector of strings

containing the tuples of the zipped lines can be created via: Batching of variable-length sentences is straightforward.

Values emitted from this dataset will be nested tuples whose tensors have a leftmost

The structure will be: Finally, bucketing that batches similarly-sized source sentences together is also

Reading data from a Dataset requires three lines of code: create the iterator, get

Bidirectionality on the encoder side generally gives better performance (with some

of how to build an encoder with a single bidirectional layer: The variables encoder_outputs and encoder_state can be used in the same way as

While greedy decoding can give us quite reasonable translation quality, a beam search

explore the search space of all possible translations by keeping around a small

a minimal beam width of, say size 10, is generally sufficient.

You may notice the speed improvement of the attention based NMT model is very small

(i.e., 1 bidirectional layers for the encoder), embedding dim is

measure the translation quality in terms of BLEU scores (Papineni et al., 2002).

step-time means the time taken to run one mini-batch (of size 128).

(i.e., 2 bidirectional layers for the encoder), embedding dim is

These results show that our code builds strong baseline systems for NMT.\ (Note

that WMT systems generally utilize a huge amount monolingual data which we currently do not.) Training Speed: (2.1s step-time, 3.4K wps) on Nvidia K40m

see the speed-ups with GNMT attention, we benchmark on K40m only: These results show that without GNMT attention, the gains from using multiple gpus are minimal.\ With

The above results show our models are very competitive among models of similar architectures.\ [Note

that OpenNMT uses smaller models and the current best result (as of this writing) is 28.4 obtained by the Transformer network (Vaswani et al., 2017) which has a significantly different architecture.] We have provided a

There's a wide variety of tools for building seq2seq models, so we pick one per language:\ Stanford

https://github.com/OpenNMT/OpenNMT-py [PyTorch] We would like to thank Denny Britz, Anna Goldie, Derek Murray, and Cinjon Resnick for their work bringing new features to TensorFlow and the seq2seq library.

Serving Models in Production with TensorFlow Serving (TensorFlow Dev Summit 2017)

Serving is the process of applying a trained model in your application. In this talk, Noah Fiedel describes TensorFlow Serving, a flexible, high-performance ML serving system designed for productio...

Lecture 7: Introduction to TensorFlow

Lecture 7 covers Tensorflow. TensorFlow is an open source software library for numerical computation using data flow graphs. It was originally developed by researchers and engineers working...

Mobile and Embedded TensorFlow (TensorFlow Dev Summit 2017)

Did you know that TensorFlow models can be deployed in iOS and Android apps, and even run on Raspberry Pi? In this talk Pete Warden will go through everything you need to know to make this...

A.I. Experiments: Visualizing High-Dimensional Space

Check out to learn more. This experiment helps visualize what's happening in machine learning. It allows coders to see and explore their high-dimensional data...

Distributed TensorFlow (TensorFlow Dev Summit 2017)

TensorFlow gives you the flexibility to scale up to hundreds of GPUs, train models with a huge number of parameters, and customize every last detail of the training process. In this talk, Derek...

Sigmoid

Watch on Udacity: Check out the full Advanced Operating Systems course for free at:

Generate Music in TensorFlow

In this video, I go over some of the state of the art advances in music generation coming out of DeepMind. Then we build our own music generation script in Python using Tensorflow and a type...

TensorFlow and Deep Learning without a PhD, Part 1 (Google Cloud Next '17)

With TensorFlow, deep machine learning transitions from an area of research to mainstream software engineering. In this video, Martin Gorner demonstrates how to construct and train a neural...

TensorFlow Tutorial #04 Save & Restore

How to use save and restore a Neural Network in TensorFlow. Also shows how to do Early Stopping using the validation set. NOTE: This is much easier using the Keras API in Tutorial #03-C! ...

Effective TensorFlow for Non-Experts (Google I/O '17)

TensorFlow is Google's machine learning framework. In this talk, you will learn how to use TensorFlow effectively. TensorFlow offers high level interfaces like Keras and Estimators, which...