AI News, Paper Repro: Deep Neuroevolution

Paper Repro: Deep Neuroevolution

In this post, we reproduce the recent Uber paper “Deep Neuroevolution: Genetic Algorithms are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning”, which amazingly showed that simple genetic algorithms sometimes performed better than apparently advanced reinforcement learning algorithms on well studied problems such as Atari games.

Ok now we can move on to the fun stuff: This seems fairly clear to me, but I will attempt to re-explain it: basically instead of generating a network randomly and modifying its weights using gradient descent, we generate a bunch of networks randomly, constituting a “generation”, then evaluate them.

Uber is very clear that they stopped after 1 billion frames, not after a fixed number of generations, so we will make sure to keep track of the number of frames that we see and stop once we’ve passed 1 billion.

As the quote above implies, we absolutely need to implement this compression, not because it is cool but because we will need to have workers on several different machines, passing neural networks to each other, and it is simply crucial to be able to serialize the neural nets efficiently so that they can be transfered quickly from one machine to another.

Well I wasn't sure what numbers were allowable as random seeds, but this makes sure whatever I pass fits in a 32 bit signed integer, which I felt was almost certainly acceptable, while bigger numbers might not be).

The big problem is that each random state is already around 5kB in size (by contrast, a random seed is only 4 bytes, or 1,000 times less), which meant that my system quickly became very slow due to the large amount of data involved.

Training a neural network using genetic algorithms is different from training it using gradient descent in the following way: with gradient descent we need to do a lot of math on a single network, while with a genetic algorithm we need to do much less math (just a forward pass, not forward and backward) on many networks.

Another advantage of this architecture is that it allows us to put the master and the workers on different machines, which is useful for keeping costs low: we can put the master on a not very powerful but very reliable machine, while we can put the workers on machines that are very powerful but not as reliable.

Specifically the following should be in a .py file: Your master code can then add a job to the queue using job = rq.enqueue(...) with the function in question and the arguments it should take, and it can letter get the result using job.result, which will be None if the function hasn't returned yet.

There is even an instance with 96 CPUs (the m5.24xlarge), but it is worth noting that its CPUs are slightly slower than the c5.18xlarge and its cost per CPU is slightly higher due to the fact that it also has a lot more memory, which we don’t really need (the c5.18xlarge has 144 GB of memory and the m5.24xlarge has 384 GB, but as I recall I never used more than 5 GB on any machine I used…).

Well, a single c5.18xlarge costs $3.06 per hour in the US-West region (other regions should be in the same ballpark), and Uber claims that they were able to train their networks in about an hour using the equivalent of 10 c5.18xlarge, so training a single network should cost us $30.60, assuming all goes smoothly.

Additionally, the price you actually pay can be less predictable than with regular on-demand instances: the spot price of instances fluctuates as demand increases and decreases, and you are always paying the current spot price as long as it is lower than your maximum bid.

Note, however, that due to the fact that I trained on 3 games (Frostbite, Breakout and Space Invaders) and due to the many errors I made along the way, this whole experiment cost me around $115, so try to be extra careful or make sure you’re willing to spend around $100 before you start doing this.

You will be taken to this page: Many of the parameters you can keep as default, just remember to do the following: Some optional things are to set your maximum price (the default is to bid the On-Demand price, which is what I would recommend anyway) and maybe to reserve your spot instance for a fixed duration (this way you are guaranteed that your instance won’t be preempted for the next, say, 3 hours, but you might be paying more than with an ordinary spot instance).

This can be done by installing mkl-service in python (conda install mkl-service) and putting the following lines at the top of your worker file, before you import pytorch: Finally, it’s one or two hours later, depending on how many machines you chose to use, and we have results!

Because there is this randomness aspect (not to mention the potential for random number generation within the individual Atari games), I decided to run the game 30 times and to display the best, worst, and median result.

This seems to explain the extremely large discrepancy between the median score and the best score as I go through generations (Uber’s graph of “median” scores are not using this definition, which is why I can’t really compare them to what I found: in Uber’s definition it is the median of the best agent across multiple runs): it is not that agents are very sensitive to small changes in their weights, it is that they are very sensitive to small changes in their starting point!

However, I am a bit concered about its strategy which consists, at least at first, in simply staying far to the left and trying to shoot the mothership: this looks more like a weird local optimum than an actual skill that the agent has obtained.

I think, however, that there seem to be significant issues with GA that Uber doesn’t point out, partly because they weren’t able to evaluate their algorithms using a human-start database (which I think would have shown the brittleness of the various agents), and it would be interesting to try to fix these.

Lecture 11 | Detection and Segmentation

In Lecture 11 we move beyond image classification, and show how convolutional networks can be applied to other core computer vision tasks. We show how ...

How Deep Neural Networks Work

A gentle introduction to the principles behind neural networks, including backpropagation. Rated G for general audiences. Follow me for announcements: ...

Lecture 10: Neural Machine Translation and Models with Attention

Lecture 10 introduces translation, machine translation, and neural machine translation. Google's new NMT is highlighted followed by sequence models with ...

Lecture 7 | Training Neural Networks II

Lecture 7 continues our discussion of practical issues for training neural networks. We discuss different update rules commonly used to optimize neural networks ...

Lecture 13: Convolutional Neural Networks

Lecture 13 provides a mini tutorial on Azure and GPUs followed by research highlight "Character-Aware Neural Language Models." Also covered are CNN ...

Lecture 5: Backpropagation and Project Advice

Lecture 5 discusses how neural networks can be trained using a distributed gradient descent technique known as back propagation. Key phrases: Neural ...

11. Introduction to Machine Learning

MIT 6.0002 Introduction to Computational Thinking and Data Science, Fall 2016 View the complete course: Instructor: Eric Grimson ..

Lecture 8 | Deep Learning Software

In Lecture 8 we discuss the use of different software packages for deep learning, focusing on TensorFlow and PyTorch. We also discuss some differences ...

Anomaly Detection: Algorithms, Explanations, Applications

Anomaly detection is important for data cleaning, cybersecurity, and robust AI systems. This talk will review recent work in our group on (a) benchmarking ...

3. Graph-theoretic Models

MIT 6.0002 Introduction to Computational Thinking and Data Science, Fall 2016 View the complete course: Instructor: Eric Grimson ..