AI News, Distributed Neural Networks with GPUs in the AWS Cloud

Distributed Neural Networks with GPUs in the AWS Cloud

by Alex Chen, Justin Basilico, and Xavier Amatriain As we have described previously on this blog, at Netflix we are constantly innovating by looking for better ways to find the best movies and TV shows for our members.

This involves designing and implementing architectures that can execute these techniques using a reasonable amount of resources in a reasonable amount of time.

The first successful instance of large-scale Deep Learning made use of 16000 CPU cores in 1000 machines in order to train an Artificial Neural Network in a matter of days.

Given our well-known approach and leadership in cloud computing, we sought out to implement a large-scale Neural Network training system that leveraged both the advantages of GPUs and the AWS cloud.

We also wanted to avoid needing special machines in a dedicated data center and instead leverage the full, on-demand computing power we can obtain from AWS.

In architecting our approach for leveraging computing power in the cloud, we sought to strike a balance that would make it fast and easy to train Neural Networks by looking at the entire training process.

In our solution, we take the approach of using GPU-based parallelism for training and using distributed computation for handling hyperparameter tuning and different configurations.

However, as explained above, training a single instance actually implies training and testing several models, each corresponding to a different combinations of hyperparameters.

Given that you are likely to have thousands of cores available in a single GPU instance, it is very convenient if you can squeeze the most out of that GPU and avoid getting into costly across-machine communication scenarios.

Note that one of the reasons we did not need to address level 3 distribution is because our model has millions of parameters (compared to the billions in the original paper by Ng).

We approached this by first getting a proof-of-concept to work on our own development machines and then addressing the issue of how to scale and use the cloud as a second stage.

While we tried to uncover the root cause, we worked our way around the issue by reimplementing the npps functions using the customized cuda kernel, e.g.

replace nppsMulC_32f_I function with: Replacing all npps functions in this way for the Neural Network code reduced the total training time on the cg1 instance from over 20 hours to just 47 minutes when training on 4 million samples.

NVreg_CheckPCIConfigSpace is a parameter of kernel module nvidia-current, that can be set using: We tested the effect of changing this parameter using a benchmark that calls MulC repeatedly (128x1000 times).

Below are the results (runtime in sec) on our cg1.4xlarge instances: As you can see, disabling accesses to PCI space had a spectacular effect in the original npps functions, decreasing the runtime by 95%.

First, we could look at optimizing our code by applying a kernel fusion trick that combines several computation steps into one kernel to reduce the memory access.

While our initial work was done using cg1.4xlarge EC2 instances, we were interested in moving to the new EC2 GPU g2.2xlarge instance type, which has a GRID K520 GPU (GK104 chip) with 1536 cores.

Thus, switching to G2 with the right configuration allowed us to run our experiments faster, or alternatively larger experiments in the same amount of time.

If you are not familiar with this concept, here is a simple explanation: Most machine learning algorithms have parameters to tune, which are called often called hyperparameters to distinguish them from model parameters that are produced as a result of the learning algorithm.

However, when faced with a complex model where training each one is time consuming and there are many hyperparameters to tune, it can be prohibitively costly to perform such exhaustive grid searches.

Gaussian Processes are a very effective way to perform regression and while they can have trouble scaling to large problems, they work well when there is a limited amount of data, like what we encounter when performing hyperparameter optimization.

We’ve squeezed high performance from our GPU but we only have 1–2 GPU cards per machine, so we would like to make use of the distributed computing power of the AWS cloud to perform the hyperparameter tuning for all configurations, such as different models per international region.

Scalable multi-node deep learning training using GPUs in the AWS Cloud

A key barrier to the wider adoption of deep neural networks on industrial-size datasets is the time and resources required to train them.

AlexNet, which won the 2012 ImageNet Large Scale Visual Recognition Competition (ILSVRC) and kicked off the current boom in deep neural networks, took nearly a week to train across the 1.2-million-image, 1000-category dataset.

While GPU performance has increased significantly since 2012, reducing training times from weeks to hours, Machine Learning (ML) practitioners seek opportunities to further lower model training times.At the same time, to drive higher prediction accuracy, models are getting larger and more complex, thus increasing the demand for compute resources.

We train the model using a standard training schedule of 90 epochsto a top-1 validation accuracy greater than 75.5% in about 50 minutesusing just 8 P3.16xlargeinstances (64 V100 GPUs).

With Amazon EC2 P3 instances, customers also have the flexibility to train a variety of model types (CNNs, RNNs, and GANs), and we expect that the high performance computing architecture that we outline here to efficiently scale across different frameworks and model types.

This translates to a wall-clock training time of 47 minutes with MXNet and 50 minutes with TensorFlow to churn through 90 epochs and hit Top-1 validation accuracy of 75.75% and 75.54%, respectively, on the ImageNet dataset.The time indicated here for both frameworks includes the time for training and checkpointing at each epoch.

Multi-node training throughput– We trained using mixed precision on 8 P3.16xlarge instances (64 V100 GPUs) with a batch size of 256 per GPU (aggregate batch size of ~16k) and observed near linear scaling hitting about 41k images/second with TensorFlow and 44k images/second with MXNet.For TensorFlow we used the distributed training framework Horovod instead of the native parameter server approach for its better scaling efficiency[6].

For MXNet we used the native parameter server approach, and the scaling efficiency of MXNet was computed using a kvstore of type ‘device’ on a single node and ‘dist_device_sync’ on multiple nodes.

The compute cluster loads data from the global namespace, augments it (using crops, flips, or blurs), and then performs the forward pass, back propagation, gradient synchronization, and weight updates.

MXNet uses the parameter server approach, where separate processes act as parameter servers to aggregate gradients from each worker node and perform weight updates.

As we scaled from one GPU to 64 GPUs (eight P3.16xlarge instances), we linearly scaled the learning rate by the number of GPUs used (0.1 for one GPU to 6.4 for 64 GPUs) while keeping the number of images per GPU constant at 256 (mini-batch size of 256 for one GPU to 16,384 for 64 GPUs).

To combat optimization instability with the large learning rate, we used a warmup scheme [3] where the learning rate is gradually, linearly scaled up from 0.001 to 6.4 over 10 epochs.

The performance numbers reported earlier uses a stack that includes these major components: Ubuntu 16.04, NVIDIA Driver 396, CUDA 9.2, cuDNN 7.1, NCCL 2.2, OpenMPI 3.1.1, Intel MKL and MKLDNN, TensorFlow 1.9 and Horovod 0.13, MXNet 1.3b (current master, soon to be released as v1.3).

The results of this work demonstrate that AWS can be used for rapid training of deep learning networks, using a performant, flexible, and scalable architecture.The implementation described in this blog post has room for further optimization.A single Amazon EC2 P3 instance with 8 NVIDIA V100 GPUs can train ResNet50 with ImageNet datain about three hours (NVIDIA, Fast.AI) using SuperConvergence and other advanced optimization techniques.We believe we can further lower the time-to-train across a distributed configuration by applying similar techniques.

Another area where we can extract more performance is by improving the scaling efficiency.Lastly, in keeping with our desire to equally support all popular frameworks, future efforts will be focused on replicating similar results on other frameworks, such as PyTorch andChainer.

Fast CNN Tuning with AWS GPU Instances and SigOpt

By Steven Tartakovsky, Michael McCourt, and Scott Clark of SigOpt Compared with traditional machine learning models, neural networks are computationally more complex and introduce many additional parameters.

MXNet is a deep learning framework that machine learning engineers and data scientists can use to quickly create sophisticated deep learning models.

In complex machine learning models and data processing pipelines, like the NLP CNN described in this post, many parameters determine how effective a predictive model will be.

Choosing these parameters, fitting the model, and determining how well the model performs is a time-consuming, trial-and-error process called hyperparameter optimization or, more generally, model tuning.

Although you need domain expertise to prepare data, generate features, and select metrics, you don’t need special knowledge of the problem domain for hyperparameter tuning.

SigOpt can significantly speed up and reduce the cost of this tuning step compared to standard hyperparameter tuning approaches like random search and grid search.

To show how these tools can get you faster results, we ran them on a sentiment analysis task.We used an open dataset of 10,622 labeled movie reviews from Rotten Tomatoes to predict whether the review is positive (4 or 5) or negative (1 or 2).

A single NVIDIA K80 GPU increases training speed approximately 50X compared to the standard distributed CPU workflow: We tune hyperparameters in the following categories: In the preprocessing step, we embed all of the words in the dataset into a lower dimensional space of a certain size (similar to what word2vec does).

However, for datasets comprising longer sentences, such as CR (maximum sentence length is 105, whereas it ranges from 36- 56 on the other sentiment datasets used here), the optimal region size may be larger.

We recognize that manual and grid search over hyperparameters is sub-optimal, and note that our suggestions here may also inform hyperparameter ranges to explore in random search or Bayesian optimization frameworks.

Data scientists and machine learning engineers often implement complex models and model training configurations from the literature (or open-source), then try applying it to some new data and become frustrated with the suboptimal predictive capacity.

Under the Basic scenario (without architecture tuning), SigOpt reached 80.4% accuracy on the validation set after 240 model trainings, and 81.0% in the Complex scenario with 400 model trainings.

Random search attained only 79.9% accuracy after 2400 model trainings, and 80.1% accuracy after 4000 model trainings.

SigOpt got an additional 5% improvement in performance compared with the default settings, and achieved these results with far fewer trials than grid and random search.

Using SigOpt and GPUs provide results with only $11 of compute cost, while using random search on standard infrastructure was over 400 times more time consuming and expensive.

The GPU instance was a p2.xlarge instance with an NVIDIA single K80 GPU, which cost $0.90 per hour and has an average training speed of 3 seconds per epoch.

We analyzed how well SigOpt performs against random and grid search by fixing the number of model tuning attempts and seeing how each optimization method performed, on average.

Namely, the middle range of values that SigOpt reported for this experiment was [79.39, 80.30], and the middle range of values for random search was [79.17, 79.76] providing a .6% difference in the median of values.

At a very high level, while you can configure complex processes, simulations, machine learning pipelines, and neural networks and determine how well they perform, it’s expensive and time-consuming to evaluate configuration choices.

How to Train TensorFlow Models Using GPUs

In recent years, there has been significant progress in the field of machine learning.

Much of this progress can be attributed to the increasing use of graphics processing units (GPUs) to accelerate the training of machine learning models.

In particular, the extra computational power has lead to the popularization of deep learning — the use of complex, multi-level neural networks to create models, capable of feature detection from large amounts of unlabeled training data.

GPUs are great for deep learning because the type of calculations they were designed to process are the same as those encountered in deep learning.

Images, videos, and other graphics are represented as matrices so that when you perform any operation, such as a zoom-in effect or a camera rotation, all you are doing is applying some mathematical transformation to a matrix.

In practice, this means that GPUs, compared to central processing units (CPUs), are more specialized at performing matrix operations and several other types of advanced mathematical transformations.

If you would like a particular operation to run on a device of your choice instead of using the defaults, you can use with tf.device to create a device context.

For example: For benchmarking purposes, we will use a convolutional neural network (CNN) for recognizing images that are provided as part of the TensorFlow tutorials.

How to use FLOYD to train your models (Deep Learning)

Let's discuss how we can use FloydHub to train our model very simple and faster.By seeing this video you will learn how to setup your account and train your ...

How to Train Your Models in the Cloud

Let's discuss whether you should train your models locally or in the cloud. I'll go through several dedicated GPU options, then compare three cloud options; AWS ...

How to train deep learning model free with Google Cloud GPUs ?

I am using free Google Cloud GPUs to train deep learning model for free! also in this video, you will get a comparison between training with Google Cloud GPUs ...

Getting started with the AWS Deep Learning AMI

Twitter: @julsimon Medium: Slideshare: - What is the AWS Deep Learning AMI? - Running .

Google Colaboratory for free GPU model training (Deep learning)

Here is a guide for training your model online for free. You can train your machine learning and deep learning model online and develop deep learning ...

Deep Learning for Data Scientists: Using Apache MXNet and R on AWS

Learning Objectives: - Deploy a Data science environment in minutes with the AWS Deep Learning AMI - Getting started with Apache MXNet on R - Train and ...

Lecture 11 | Detection and Segmentation

In Lecture 11 we move beyond image classification, and show how convolutional networks can be applied to other core computer vision tasks. We show how ...

Building Distributed TensorFlow Using Both GPU and CPU on Kubernetes [I] - Zeyu Zheng

Building Distributed TensorFlow Using Both GPU and CPU on Kubernetes [I] - Zeyu Zheng & Huizhi Zhao, Caicloud Big Data and Machine Learning have ...

Mask RCNN with Keras and Tensorflow (pt.2) Real time Mask RCNN

In this video we will write code to do real time Mask RCNN with the help of openCV Github code: ...

PipelineAI: High Performance Distributed TensorFlow AI + GPU + Model Optimizing Predictions

Highlights We will each build an end-to-end, continuous TensorFlow AI model training and deployment pipeline on our own GPU-based cloud ..