AI News, Deep Reinforcement Learning: Playing CartPole through Asynchronous Advantage Actor Critic (A3C) with tf.keras and eager execution

Deep Reinforcement Learning: Playing CartPole through Asynchronous Advantage Actor Critic (A3C) with tf.keras and eager execution

By Raymond Yuan, Software Engineering Intern In this tutorial we will learn how to train a model that is able to win at the simple game CartPole using deep reinforcement learning.

Reinforcement learning is an area of machine learning that involves agents that should take certain actions from within an environment to maximize or attain some reward.

In the process, we’ll build practical experience and develop intuition around the following concepts: We’ll follow the basic workflow: Audience: This tutorial is targeted towards anybody interested in reinforcement learning.

While we won’t go into too much depth into the basics of machine learning, we’ll cover topics such as policy and value networks at a high level.

The starting state (cart position, cart velocity, pole angle, and pole velocity at tip) is randomly initialized between +/-0.05.

For example, your model may seem to be performing well when you see high scores being returned, but in reality the high scores may not be reflective of a good algorithm or the result of random actions.

In a classification example, we can establish baseline performance by simply analyzing the class distribution and predicting our most common class.

At a high level, the A3C algorithm uses an asynchronous updating scheme that operates on fixed-length time steps of experience.

Each worker performs the following workflow cycle: With this training configuration, we expect to see a linear speed up with the number of agents.

As such, while the agent is playing the game, whenever it sees a certain state (or similar states), it will compute the probability of each available action given an input state, and then sample an action according to this probability distribution.

To delve into the mathematics more formally, policy gradients are a special case of the more general score function gradient estimator.

Then, using the log derivative trick, we figure out how to update our network’s parameters such that the action samples obtain higher rewards and end up with ∇ Ex[f(x)] =Ex[f(x) ∇ log p(x)].

In English, this equation explains how shifting θ in the direction of our gradient will maximize our scores according to our reward function f.

We implement with threads for the sake of simplicity and clarity of example.) In addition, to make keeping track of things easier, we’ll also implement a Memory class.

The losses are calculated as: Where R is the discounted rewards, V our value function (of an input state), 𝛑 our policy function (of an input state as well), and A our advantage function.

The worker agent will repeat the process of resetting the network parameters to all of those in the global network, and repeating the process of interacting with its environment, computing the loss, and then applying the gradients to the global network.

Lecture 14 | Deep Reinforcement Learning

In Lecture 14 we move from supervised learning to reinforcement learning (RL), in which an agent must learn to interact with an environment in order to ...

Google DeepMind AI Learns to Walk and Run

Main source: If there are any technical mistakes I'd be happy to hear about ..

How Microsoft AI defeated Ms Pacman : Build 2018

Microsoft Research developed a model called Hybrid Reward Architecture for scaling reinforcement learning to tasks that have extremely complex value ...

Lecture 16 | Adversarial Examples and Adversarial Training

In Lecture 16, guest lecturer Ian Goodfellow discusses adversarial examples in deep learning. We discuss why deep networks and other machine learning ...

Visualizing Rewards in Reinforcement Learning

In this video, we will have supercool visualization which will help you to feel the intuition behind rewards. This is the second of 15 video miniseries on Deep RL.

Highest Paid Mafia Boss Tells the TRUTH About the Life

The dark side of the mob. How Michael Franzese, the highest paid Mafia boss since Al Capone, one day decided to walk away from the life. Check out the official ...

A Guide to DeepMind's StarCraft AI Environment

I'm going to go through the steps necessary to install and run the StarCraft II Environment that DeepMind recently open-sourced! I'll discuss DeepMind's RL ...

Randomized ensemble reinforcement learning in Mario AI

0:00 Start learning using action filters 1+2 2:29 Learning progress / behaviour after 500 episodes 4:29 After 990 episodes (the hopping is gone) 5:18 Using a ...

Counterfactual Multi-Agent Policy Gradients

Many real-world problems, such as network packet routing and the coordination of autonomous vehicles, are naturally modelled as cooperative multi-agent ...