AI News,

A recent paper with an innocent sounding title is probably the biggest news in neural networks since the invention of the backpropagation algorithm.

recent paper 'Intriguing properties of neural networks' by Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow and Rob Fergus, a team that includes authors from Google's deep learning research project outlines two pieces of news about the way neural networks behave that run counter to what we believed - and one of them is frankly astonishing.

For example, in a face recognizer a neuron might respond strongly to an image that has an eye or a nose - but notice there is no reason that the features should correspond to the neat labels that humans use.

That is, if you train a network to recognize a cat using a particular set of cat photos the network will, as long as it has been trained properly, have the ability to recognize a cat photo it hasn't seen before.

To create the slightly perturbed version you would simply modify each pixel value, and as long as the amount was small, then the cat photo would look exactly the same to a human - and presumably to a neural network.

What the researchers did was to invent an optimization algorithm that starts from a correctly classified example and tries to find a small perturbation in the pixel values that drives the output of the network to another classification.

'The above observations suggest that adversarial examples are somewhat universal and not just theresults of overfitting to a particular model or to the specific selection of the training set' This is perhaps the most remarkable part of the result.

If you change the situation just a little and ask what does it matter if a self-driving car that uses a deep neural network misclassifies a view of a pedestrian standing in front of the car as a clear road?

(The volume that is not near the surface drops exponentially with increasing dimension.) Given that the decision boundaries of a deep neural network are in a very high dimensional space it seems reasonable that most correctly classified examples are going to be close to the decision boundary - hence the ability to find a misclassified example close to the correct one, you simply have to work out the direction to the closest boundary.

+ Full Story Why Article 13 Must Be Stopped14/06/2018With only days to go before a crucial vote on EU copyright legislation could have a drastic impact on the internet and, as many others have pointed out, spell the end for memes, Internet luminaries ha[...] + Full StoryMore NewsSupreme Court To Rule On Apple App Store MonopolyNativeScript Adds LiveSync with WebpackYouTube Contest On KaggleApple Drops OpenGL and OpenCLAtom v Visual Studio Code - The Unexpected Consequence Of ConsolidationIBM Debater Argues Like A Human - But How?AI Used to Open Closed EyesAI Creates Breakthrough Realistic AnimationAre You A Typical Developer?C Undefined Behavior - Depressing and Terrifying (Updated)Udacity Launches Free Career Courses Packt Skill Up Survey Shows Programmers Are Mostly Full StackIntel Open Sources NLP Architect

Part-2: Tensorflow tutorial-> Building a small Neural network based image classifier:

In this Tensorflow tutorial, we shall build a convolutional neural network based image classifier using Tensorflow.

To demonstrate how to build a convolutional neural network based image classifier, we shall build a 6 layer neural network that will identify and separate images of dogs from that of cats.

A neuron takes an input(say x), do some computation on it(say: multiply it with a variable w and adds another variable b ) to produce a value (say;

The networks which have many hidden layers tend to be more accurate and are called deep network and hence machine learning algorithms which uses these deep networks are called deep learning.

Typically, all the neurons in one layer, do similar kind of mathematical operations and that’s how that a layer gets its name(Except for input and output layers as they do little mathematical operations).

In order to calculate the dot product, it’s mandatory for the 3rd dimension of the filter to be same as the number of channels in the input.

when we calculate the dot product it’s a matrix multiplication of 5*5*3 sized chunk with 5*5*3 sized filter.

We shall slide convolutional filter over whole input image to calculate this output across the image as shown by a schematic below: In this case, we slide our window by 1 pixel at a time.

If you concatenate all these outputs in 2D, we shall have an output activation map of size 28*28(can you think of why 28*28 from 32*32 with the filter of 5*5 and stride of 1).

So, it’s a standard practice to add zeros on the boundary of the input layer such that the output is the same size as input layer.

So, in this example, if we add a padding of size 2 on both sides of the input layer, the size of the output layer will be 32*32*6 which works great from the implementation purpose as well.

Then, the output size will be: (N-F+2P)/S +1 Pooling layer is mostly used immediately after the convolutional layer to reduce the spatial size(only width and height, not depth).

The most common form of pooling is Max pooling where we take a filter of size F*F and apply the maximum operation over the F*F sized part of the image.

Then the output sizes w2*h2*d2 will be: w2= (w1-f)/S +1 h2=(h1-f)/S +1 d2=d1 Most common pooling is done with the filter of size 2*2 with a stride of 2.

For example, when we are trying to build the classifier between dog and cat, we are looking to find parameters such that output layer gives out probability of dog as 1(or at least higher than cat) for all images of dogs and probability of cat as 1((or at least higher than dog) for all images of cats.

So, let’s say, we start with some initial values of parameters and feed 1 training image(in reality multiple images are fed together) of dog and we calculate the output of the network as 0.1 for it being a dog and 0.9 of it being a cat.

There is a variable that is used to govern how fast do we change the parameters of the network during training, it’s called learning rate.  If you think about it, we want to maximise the total correct classifications by the network i.e.

So, we keep an eye on the cost and we keep doing many iterations of forward and backward propagations(10s of thousands sometimes) till cost stops decreasing.

Let’s say \(y_{prediction}\) is the vector containing the output of the network for all the training images and \(y_{actual}\) is the vector containing actual values(also called ground truth) of these labeled images.

So, we define the cost as the average of these distances for all the images: $$ cost=0.5 \sum_{i=0}^n (y_{actual}-y_{prediction})^2 $$ This is a very simple example of cost, but in actual training, we use much more complicated cost measures, like cross-entropy cost.

In production set-up when we get a new image of dog/cat to classify, we load this model in the same network architecture and calculate the probability of the new image being a cat/dog.

Rather, let’s say we have total 1600 images, we divide them in small batches say of size 16 or 32 called batch-size.

ii) Shape function: if you have multi-dimensional Tensor in TF, you can get the shape of it by doing this: output will be: array([ 16, 128, 128,   3], dtype=int32) You can reshape this to a new 2D Tensor of shape[16  128*128*3]= [16 49152].

containing real values to the same shaped vector of real values in the range of (0,1), whose sum is 1.

We shall apply the softmax function to the output of our convolutional neural network in order to, convert the output to the probability for each class.

have used 2000 images of dogs and cats each from Kaggle dataset but you could use any n image folders on your computer which contain different kinds of objects.

Typically, in the first convolutional layer, you pass n images of size width*height*num_channels, then this has the size [n width height num_channels] filter= trainable variables defining the filter.

If your filter is of size filter_size and input fed has num_input_channels and you have num_filters filters in your current layer, then filter will have following shape: [filter_size filter_size num_input_channels num_filters] strides= defines how much you move your filter when doing convolution.

Finally, we use a RELU as our activation function which simply takes the output of max_pool and applies RELU using tf.nn.relu All these operations are done in a single convolution layer.

We will use a simple cost that will be calculated using a Tensorflow function softmax_cross_entropy_with_logits which takes the output of last fully connected layer and actual labels to calculate cross_entropy whose average will give us the cost.

optimizer = tf.train.AdamOptimizer(learning_rate=1e-4).minimize(cost) As you know, if we run optimizer operation inside session.run(), in order to calculate the value of cost, the whole network will have to be run and we will pass the training images in a feed_dict(Does that make sense?

In order to calculate the cost, the whole network(3 convolution+1 flattening+2 fc layers) will have to be executed to produce layer_fc2(which is required to calculate cross_entropy, hence cost).

acc = session.run(accuracy, feed_dict=feed_dict_train) As, training images along with labels are used for training, so in general training accuracy will be higher than validation.

After you are done with training, you shall notice that there are many new files in the folder: File dogs-cats-model.meta contains the complete network graph and we can use this to recreate the graph later.

pre-process the input image in the same way(as training), get hold of y_pred on the graph and pass it the new image in a feed dict.

Google's Artificial Brain Learns to Find Cat Videos

By Liat Clark, Wired UK When computer scientists at Google's mysterious X lab built a neural network of 16,000 computer processors with one billion connections and let it browse YouTube, it did what many web users might do – it began to look for cats.

simulation was exposed to 10 million randomly selected YouTube video thumbnails over the course of three days and, after being presented with a list of 20,000 different items, it began to recognize pictures of cats using a "deep learning"

Picking up on the most commonly occurring images featured on YouTube, the system achieved 81.7 percent accuracy in detecting human faces, 76.7 percent accuracy when identifying human body parts and 74.8 percent accuracy when identifying cats.

Starting with these learned features, we trained it to obtain 15.8 percent accuracy in recognizing 20,000 object categories, a leap of 70 percent relative improvement over the previous state-of-the-art [networks]."

"The idea is that instead of having teams of researchers trying to find out how to find edges, you instead throw a ton of data at the algorithm and you let the data speak and have the software automatically learn from the data,"

How do we ‘train’ neural networks ?

This is part 1 of my planned series on optimization algorithms used for ‘training’ in Machine Learning and Neural Networks in particular.

I do not claim my explanation to be full, deep introduction to neural networks and, in fact, hope that you’re already familiar with these concepts.

If you want a better understanding of what is going on in neural networks I provide a list of resources to learn at the end of my post.

We’re going to consider a little bit modified version of the simplest model of a neuron proposed by Frank Rosenblatt in 1957 called ‘Perceptron’.

The formula is the following: If our number is greater than zeros then we’re going to take that number as is, if it’s less than zero then we’ll take zero instead.

You should remember, that the way we store images in a computer is by representing it as an array of numbers, each number indicating the brightness of a given pixel.

So, the way to pass it to a neuron is to take that 2D array(or 3D in case of colored images), flatten it in a row to get a 1D vector and pass all those number to a neuron.

Unfortunately, It makes our network dependent on image size in a way that we’re only going to be able to process images of a given size defined by the network.

As you can see these neurons are stacked in 3 fully connected layers, that is each neuron from one layer is connected to each neuron from the following layer.

The need for non-linearity comes from the fact, that we connect neurons together and the fact the linear function on top of linear function is itself a linear function.

So, if didn’t have non-linear function applied in each neuron, the neural network would be a linear function, thus not more powerful than a single neuron.

Having said all that, we’re ready to define a function corresponding to the network I drew above: As a result, we have some kind of function, that takes some numbers and outputs another number between 0 and 1.

It is actually not so important what formula this function has, what important is that we have complex non-linear function parametrized by some weights in a sense that we can change that function by changing the weights.

The intuitive way to do it is, take each training example, pass through the network to get the number, subtract it from the actual number we wanted to get and square it (because negative numbers are just as bad as positives).

Where y stands for the number we want to get from the network, y with a hat — the number we actually got by passing our example through the network, i — index of a training example.

To compute the loss function we would go over each training example in our dataset, compute y for that example, and then compute the function defined above.

We can rewrite this formula, changing y to the actual function of our network to see deeper the connection of the loss function and the neural network.

Mathematically it can be written as the following: Which means: how much our function changes(left term) approximately equals to derivative of that function with respect to some variable x multiplied with how much we changed that variable.

Going back to our simple function f(x) = x, we said that our derivative is 1, which means, that if take some step epsilon in the positive direction, the function outputs will change by 1 multiplied by our step epsilon which is just epsilon.

It’s easy to check that now if we start at some value of x and make some step epsilon, then how much our function changed is not going to be exactly equal to the formula given above.

Now, gradient is vector of partial derivatives, whose elements contains derivatives with respect to some variable on which function is dependent.

With simple functions we’ve considering so far, this vector only contains one element, because we’ve only been using function which take one input.

Instead, we want to minimize our function, so we can take a step in opposite direction, negative, to be sure that our function will decrease, at least a little bit.

The derivative is now equals -4, which means, that if take a small step in positive direction our function will change proportionally to -4, thus it will decrease.

The variable with respect to which we’re going to be taking our derivatives are weights w, since these are the values we want to change to improve our network.

If we compute the gradient of the loss function w.r.t our weights and take small steps in the opposite direction of gradient our loss will gradually decrease until it converges to some local minima.

The formula we defined above for a very simple neural network was hard enough to get stuck trying to find all the derivatives and we only had 6 parameters.

The second way to go about it, and, in fact, the easiest to implement, is to approximate the derivative with the following formula we know from calculus: While it is super easy to implement, it’s way too much computationally expensive to do so.

Discussing this algorithm goes beyond the scope of this post, but if you want to learn more about it, go to last section of this post where I list number of recourses to learn more about neural networks.

The idea that we can take some function, then take some derivatives and end up with algorithm that can tell a dog from a cat on an image seemed a bit surreal to me.

Machine Learning is Fun Part 8: How to Intentionally Trick Neural Networks

We’ve already talked about the basic process of training a neural network to classify photos: But what if instead of tweaking the weights of the layers of the neural network, we instead tweaked the input image itself until we get the answer we want?

But let’s use back-propagation to adjust the input image instead of the neural network layers: So here’s the new algorithm: At end of this, we’ll have an image that fools the neural network without changing anything inside the neural network itself.

They’ll show up as discolored spots or wavy areas: To prevent these obvious distortions, we can add a simple constraint to our algorithm.

Just make sure you have Python 3 and Keras installed before you run it: When we run it, it properly detects our image as a Persian cat: Now let’s trick it into thinking that this cat is a toaster by tweaking the image until it fools the neural network.

Keras doesn’t have a built-in way to train against the input image instead of training the neural network layers, so I had to get a little tricky and code the training step manually.

In the real world, no company is going to let you download their trained neural network’s code, so that means we can’t attack them… Right?

But there are some things we do know so far: Since we don’t have any final answers yet, its worth thinking about the scenarios where you are using neural networks so that you can at least lessen the risk that this kind of attack would cause damage your business.

For example, if you have a single machine learning model as the only line of defense to grant access to a restricted resource and assume it can’t be fooled, that’s probably a bad idea.

Generative Adversarial Networks (GANs) - Computerphile

Artificial Intelligence where neural nets play against each other and improve enough to generate something new. Rob Miles explains GANs One of the papers ...

How good is your fit? - Ep. 21 (Deep Learning SIMPLIFIED)

A good model follows the “Goldilocks” principle in terms of data fitting. Models that underfit data will have poor accuracy, while models that overfit data will fail to ...

12a: Neural Nets

NOTE: These videos were recorded in Fall 2015 to update the Neural Nets portion of the class. MIT 6.034 Artificial Intelligence, Fall 2010 View the complete ...

How computers learn to recognize objects instantly | Joseph Redmon

Ten years ago, researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible. Today, computer vision ...

Lecture 5 | Convolutional Neural Networks

In Lecture 5 we move from fully-connected neural networks to convolutional neural networks. We discuss some of the key historical milestones in the ...

Lecture 16: Dynamic Neural Networks for Question Answering

Lecture 16 addresses the question ""Can all NLP tasks be seen as question answering problems?"". Key phrases: Coreference Resolution, Dynamic Memory ...

Lecture 8: Recurrent Neural Networks and Language Models

Lecture 8 covers traditional language models, RNNs, and RNN language models. Also reviewed are important training problems and tricks, RNNs for other ...

Lecture 7 | Training Neural Networks II

Lecture 7 continues our discussion of practical issues for training neural networks. We discuss different update rules commonly used to optimize neural networks ...

The art of neural networks | Mike Tyka | TEDxTUM

Did you know that art and technology can produce fascinating results when combined? Mike Tyka, who is both artist and computer scientist, talks about the ...

Recurrent Neural Network - The Math of Intelligence (Week 5)

Recurrent neural networks let us learn from sequential data (time series, music, audio, video frames, etc ). We're going to build one from scratch in numpy ...