AI News, Distributed Neural Networks with GPUs in the AWS Cloud

Distributed Neural Networks with GPUs in the AWS Cloud

by Alex Chen, Justin Basilico, and Xavier Amatriain As we have described previously on this blog, at Netflix we are constantly innovating by looking for better ways to find the best movies and TV shows for our members.

This involves designing and implementing architectures that can execute these techniques using a reasonable amount of resources in a reasonable amount of time.

The first successful instance of large-scale Deep Learning made use of 16000 CPU cores in 1000 machines in order to train an Artificial Neural Network in a matter of days.

Given our well-known approach and leadership in cloud computing, we sought out to implement a large-scale Neural Network training system that leveraged both the advantages of GPUs and the AWS cloud.

We also wanted to avoid needing special machines in a dedicated data center and instead leverage the full, on-demand computing power we can obtain from AWS.

In architecting our approach for leveraging computing power in the cloud, we sought to strike a balance that would make it fast and easy to train Neural Networks by looking at the entire training process.

In our solution, we take the approach of using GPU-based parallelism for training and using distributed computation for handling hyperparameter tuning and different configurations.

However, as explained above, training a single instance actually implies training and testing several models, each corresponding to a different combinations of hyperparameters.

Given that you are likely to have thousands of cores available in a single GPU instance, it is very convenient if you can squeeze the most out of that GPU and avoid getting into costly across-machine communication scenarios.

Note that one of the reasons we did not need to address level 3 distribution is because our model has millions of parameters (compared to the billions in the original paper by Ng).

We approached this by first getting a proof-of-concept to work on our own development machines and then addressing the issue of how to scale and use the cloud as a second stage.

While we tried to uncover the root cause, we worked our way around the issue by reimplementing the npps functions using the customized cuda kernel, e.g.

replace nppsMulC_32f_I function with: Replacing all npps functions in this way for the Neural Network code reduced the total training time on the cg1 instance from over 20 hours to just 47 minutes when training on 4 million samples.

NVreg_CheckPCIConfigSpace is a parameter of kernel module nvidia-current, that can be set using: We tested the effect of changing this parameter using a benchmark that calls MulC repeatedly (128x1000 times).

Below are the results (runtime in sec) on our cg1.4xlarge instances: As you can see, disabling accesses to PCI space had a spectacular effect in the original npps functions, decreasing the runtime by 95%.

First, we could look at optimizing our code by applying a kernel fusion trick that combines several computation steps into one kernel to reduce the memory access.

While our initial work was done using cg1.4xlarge EC2 instances, we were interested in moving to the new EC2 GPU g2.2xlarge instance type, which has a GRID K520 GPU (GK104 chip) with 1536 cores.

Thus, switching to G2 with the right configuration allowed us to run our experiments faster, or alternatively larger experiments in the same amount of time.

If you are not familiar with this concept, here is a simple explanation: Most machine learning algorithms have parameters to tune, which are called often called hyperparameters to distinguish them from model parameters that are produced as a result of the learning algorithm.

However, when faced with a complex model where training each one is time consuming and there are many hyperparameters to tune, it can be prohibitively costly to perform such exhaustive grid searches.

Gaussian Processes are a very effective way to perform regression and while they can have trouble scaling to large problems, they work well when there is a limited amount of data, like what we encounter when performing hyperparameter optimization.

We’ve squeezed high performance from our GPU but we only have 1–2 GPU cards per machine, so we would like to make use of the distributed computing power of the AWS cloud to perform the hyperparameter tuning for all configurations, such as different models per international region.

How to use FLOYD to train your models (Deep Learning)

Let's discuss how we can use FloydHub to train our model very simple and faster.By seeing this video you will learn how to setup your account and train your ...

How to Train Your Models in the Cloud

Only a few days left to signup for my Decentralized Applications course! Let's discuss whether you should train your models locally or in ..

How to train deep learning model free with Google Cloud GPUs ?

I am using free Google Cloud GPUs to train deep learning model for free! also in this video, you will get a comparison between training with Google Cloud GPUs ...

Lecture 11 | Detection and Segmentation

In Lecture 11 we move beyond image classification, and show how convolutional networks can be applied to other core computer vision tasks. We show how ...

Getting started with the AWS Deep Learning AMI

Twitter: @julsimon Medium: Slideshare: - What is the AWS Deep Learning AMI? - Running .

Mask RCNN with Keras and Tensorflow (pt.1) Setup and Installation

In this series we will explore Mask RCNN using Keras and Tensorflow This video will look at - setup and installation Github slide: ...

4K Mask RCNN COCO Object detection and segmentation #2

source: Input 4K video: Coffee or caffe:

Mask R-CNN

ICCV17 | 16 | Mask R-CNN Kaiming He (Facebook AI Research), Georgia Gkioxari (Facebook), Piotr Dollar (Facebook AI Research), Ross Girshick (Facebook) ...

Learning from Simulated and Unsupervised Images through Adversarial Training, CVPR 2017 | Apple

Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, Russell Webb Best Paper Award With recent ..

yolov3 custom object detection in bbox-label-tool converting into yolo format

get BBox-Label-Tool if errors and issues shows up while instaling BBox-Label-Tool follow these video in my you ..