AI News, Attacking Machine Learning with Adversarial Examples

Attacking Machine Learning with Adversarial Examples

Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake;

At OpenAI, we think adversarial examples are a good aspect of security to work on because they represent a concrete problem in AI safety that can be addressed in the short term, and because fixing them is difficult enough that it requires a serious research effort.

(Though we'll need to explore many aspects of machine learning security to achieve our goal of building safe, widely distributed AI.) To get an idea of what adversarial examples look like, consider this demonstration from Explaining and Harnessing Adversarial Examples: starting with an image of a panda, the attacker adds a small perturbation that has been calculated to make the image be recognized as a gibbon with high confidence.

For example, attackers could target autonomous vehicles by using stickers or paint to create an adversarial stop sign that the vehicle would interpret as a 'yield' or other sign, as discussed in Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples.

When we think about the study of AI safety, we usually think about some of the most difficult problems in that field — how can we ensure that sophisticated reinforcement learning agents that are significantly more intelligent than human beings behave in ways that their designers intended?

This creates a model whose surface is smoothed in the directions an adversary will typically try to exploit, making it difficult for them to discover adversarial input tweaks that lead to incorrect categorization.

(Distillation was originally introduced in Distilling the Knowledge in a Neural Network as a technique for model compression, where a small model is trained to imitate a large one, in order to obtain computational savings.) Yet even these specialized algorithms can easily be broken by giving more computational firepower to the attacker.

In other words, they look at a picture of an airplane, they test which direction in picture space makes the probability of the “cat” class increase, and then they give a little push (in other words, they perturb the input) in that direction.

If the model’s output is “99.9% airplane, 0.1% cat”, then a little tiny change to the input gives a little tiny change to the output, and the gradient tells us which changes will increase the probability of the “cat” class.

Let’s run a thought experiment to see how well we could defend our model against adversarial examples by running it in “most likely class” mode instead of “probability mode.” The attacker no longer knows where to go to find inputs that will be classified as cats, so we might have some defense.

The defense strategies that perform gradient masking typically result in a model that is very smooth in specific directions and neighborhoods of training points, which makes it harder for the adversary to find gradients indicating good candidate directions to perturb the input in a damaging way for the model.

Neither algorithm was explicitly designed to perform gradient masking, but gradient masking is apparently a defense that machine learning algorithms can invent relatively easily when they are trained to defend themselves and not given specific instructions about how to do so.

Lecture 16 | Adversarial Examples and Adversarial Training

In Lecture 16, guest lecturer Ian Goodfellow discusses adversarial examples in deep learning. We discuss why deep networks and other machine learning ...

TensorFlow Tutorial #11 Adversarial Examples

How to fool a neural network into mis-classifying images by adding a little 'specialized' noise. Demonstrated on the Inception model.

Generative Adversarial Nets - Fresh Machine Learning #2

This episode of Fresh Machine Learning is all about a relatively new concept called a Generative Adversarial Network. A model continuously tries to fool another ...

Learning to Draw Samples: With Application to Amortized MLE for Generative Adversarial Training

Dilin Wang, Yihao Feng and Qiang Liu --- Bayesian Deep Learning Workshop NIPS 2016 December 10, 2016 — Centre Convencions Internacional Barcelona, ...

Learn to Draw Samples: Application to Amortized MLE for Generative Adversarial Training, NIPS 2016

NIPS 2016 Workshop on Bayesian Deep learning Dilin Wang, Yihao Feng and Qiang Liu We propose a simple algorithm to ..

Adversarial attacks and defenses - NIPS 2017

Workshop posters: - - .

Gaussian Mixture Models - The Math of Intelligence (Week 7)

We're going to predict customer churn using a clustering technique called the Gaussian Mixture Model! This is a probability distribution that consists of multiple ...

Generative Adversarial Networks (LIVE)

We're going to build a GAN to generate some images using Tensorflow. This will help you grasp the architecture and intuition behind adversarial approaches to ...

Backpropagation in 5 Minutes (tutorial)

Let's discuss the math behind back-propagation. We'll go over the 3 terms from Calculus you need to understand it (derivatives, partial derivatives, and the chain ...

Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks

Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks Nicolas Papernot (The Pennsylvania State University) Presented at the ...