AI News, NIPS Proceedingsβ

NIPS Proceedingsβ

Part of: Advances in Neural Information Processing Systems 27 (NIPS 2014) We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have.

This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions.

A visual proof that neural nets can compute any function

One of the most striking facts about neural networks is that they can compute any function at all.

No matter what the function, there is guaranteed to be a neural network so that for every possible input, $x$, the value $f(x)$ (or some close approximation) is output from the network, e.g.:

For instance, here's a network computing a function with $m = 3$ inputs and $n = 2$ outputs:

What's more, this universality theorem holds even if we restrict our networks to have just a single layer intermediate between the input and the output neurons - a so-called single hidden layer.

For instance, one of the original papers proving the result* *Approximation by superpositions of a sigmoidal function, by George Cybenko (1989).

Another important early paper is Multilayer feedforward networks are universal approximators, by Kurt Hornik, Maxwell Stinchcombe, and Halbert White (1989).

Again, that can be thought of as computing a function* *Actually, computing one of many functions, since there are often many acceptable translations of a given piece of text..

Or consider the problem of taking an mp4 movie file and generating a description of the plot of the movie, and a discussion of the quality of the acting.

Two caveats Before explaining why the universality theorem is true, I want to mention two caveats to the informal statement 'a neural network can compute any function'.

To make this statement more precise, suppose we're given a function $f(x)$ which we'd like to compute to within some desired accuracy $\epsilon > 0$.

The guarantee is that by using enough hidden neurons we can always find a neural network whose output $g(x)$ satisfies $|g(x) - f(x)|

If a function is discontinuous, i.e., makes sudden, sharp jumps, then it won't in general be possible to approximate using a neural net.

Summing up, a more precise statement of the universality theorem is that neural networks with a single hidden layer can be used to approximate any continuous function to any desired precision.

In this chapter we'll actually prove a slightly weaker version of this result, using two hidden layers instead of one.

In the problems I'll briefly outline how the explanation can, with a few tweaks, be adapted to give a proof which uses only a single hidden layer.

Universality with one input and one output To understand why the universality theorem is true, let's start by understanding how to construct a neural network which approximates a function with just one input and one output:

To build insight into how to construct a network to compute $f$, let's start with a network containing just a single hidden layer, with two hidden neurons, and an output layer containing a single output neuron:

In the diagram below, click on the weight, $w$, and drag the mouse a little ways to the right to increase $w$.

As we learnt earlier in the book, what's being computed by the hidden neuron is $\sigma(wx + b)$, where $\sigma(z) \equiv 1/(1+e^{-z})$ is the sigmoid function.

But for the proof of universality we will obtain more insight by ignoring the algebra entirely, and instead manipulating and observing the shape shown in the graph.

This won't just give us a better feel for what's going on, it will also give us a proof* *Strictly speaking, the visual approach I'm taking isn't what's traditionally thought of as a proof.

Occasionally, there will be small gaps in the reasoning I present: places where I make a visual argument that is plausible, but not quite rigorous.

function playVideo (name) { var div = $('#'+name)[0];

} function videoEnded (name) { var div = document.getElementById(name);

} We can simplify our analysis quite a bit by increasing the weight so much that the output really is a step function, to a very good approximation.

It's easy to analyze the sum of a bunch of step functions, but rather more difficult to reason about what happens when you add up a bunch of sigmoid shaped curves.

With a little work you should be able to convince yourself that the position of the step is proportional to $b$, and inversely proportional to $w$.

It will greatly simplify our lives to describe hidden neurons using just a single parameter, $s$, which is the step position, $s = -b/w$.

As noted above, we've implicitly set the weight $w$ on the input to be some large value - big enough that the step function is a very good approximation.

We can easily convert a neuron parameterized in this way back into the conventional model, by choosing the bias $b = -w s$.

In particular, we'll suppose the hidden neurons are computing step functions parameterized by step points $s_1$ (top neuron) and $s_2$ (bottom neuron).

Here, $a_1$ and $a_2$ are the outputs from the top and bottom hidden neurons, respectively* *Note, by the way, that the output from the whole network is $\sigma(w_1 a_1+w_2 a_2 + b)$, where $b$ is the bias on the output neuron.

We're going to focus on the weighted output from the hidden layer right now, and only later will we think about how that relates to the output from the whole network..

You'll see that the graph changes shape when this happens, since we have moved from a situation where the top hidden neuron is the first to be activated to a situation where the bottom hidden neuron is the first to be activated.

Similarly, try manipulating the step point $s_2$ of the bottom hidden neuron, and get a feel for how this changes the combined output from the hidden neurons.

You'll notice, by the way, that we're using our neurons in a way that can be thought of not just in graphical terms, but in more conventional programming terms, as a kind of if-then-else statement, e.g.:

In particular, we can divide the interval $[0, 1]$ up into a large number, $N$, of subintervals, and use $N$ pairs of hidden neurons to set up peaks of any desired height.

Apologies for the complexity of the diagram: I could hide the complexity by abstracting away further, but I think it's worth putting up with a little complexity, for the sake of getting a more concrete feel for how these networks work.

didn't say it at the time, but what I plotted is actually the function \begin{eqnarray} f(x) = 0.2+0.4 x^2+0.3x \sin(15 x) + 0.05 \cos(50 x), \tag{113}\end{eqnarray} plotted over $x$ from $0$ to $1$, and with the $y$ axis taking values from $0$ to $1$.

The solution is to design a neural network whose hidden layer has a weighted output given by $\sigma^{-1} \circ f(x)$, where $\sigma^{-1}$ is just the inverse of the $\sigma$ function.

If we can do this, then the output from the network as a whole will be a good approximation to $f(x)$* *Note that I have set the bias on the output neuron to $0$..

How well you're doing is measured by the average deviation between the goal function and the function the network is actually computing.

It's only a coarse approximation, but we could easily do much better, merely by increasing the number of pairs of hidden neurons, allowing more bumps.

So, for instance, for the second hidden neuron $s = 0.2$ becomes $b = -1000 \times 0.2 = -200$.

So, for instance, the value you've chosen above for the first $h$, $h = $ , means that the output weights from the top two hidden neurons are and , respectively.

Just as in our earlier discussion, as the input weight gets larger the output approaches a step function.

Here, we assume the weight on the $x$ input has some large value - I've used $w_1 = 1000$ - and the weight $w_2 = 0$.

Of course, it's also possible to get a step function in the $y$ direction, by making the weight on the $y$ input very large (say, $w_2 = 1000$), and the weight on the $x$ equal to $0$, i.e., $w_1 = 0$:

The number on the neuron is again the step point, and in this case the little $y$ above the number reminds us that the step is in the $y$ direction.

But do keep in mind that the little $y$ marker implicitly tells us that the $y$ weight is large, and the $x$ weight is $0$.

That reminds us that they're producing $y$ step functions, not $x$ step functions, and so the weight is very large on the $y$ input, and zero on the $x$ input, not vice versa.

If we choose the threshold appropriately - say, a value of $3h/2$, which is sandwiched between the height of the plateau and the height of the central tower - we could squash the plateau down to zero, and leave just the tower standing.

This is a bit tricky, so if you think about this for a while and remain stuck, here's two hints: (1) To get the output neuron to show the right kind of if-then-else behaviour, we need the input weights (all $h$ or $-h$) to be large;

Even for this relatively modest value of $h$, we get a pretty good tower function.

To make the respective roles of the two sub-networks clear I've put them in separate boxes, below: each box computes a tower function, using the technique described above.

In particular, by making the weighted output from the second hidden layer a good approximation to $\sigma^{-1} \circ f$, we ensure the output from our network will be a good approximation to any desired function, $f$.

The $s_1, t_1$ and so on are step points for neurons - that is, all the weights in the first layer are large, and the biases are set to give the step points $s_1, t_1, s_2, \ldots$.

Of course, such a function can be regarded as just $n$ separate real-valued functions, $f^1(x_1, \ldots, x_m), f^2(x_1, \ldots, x_m)$, and so on.

As a hint, try working in the case of just two input variables, and showing that: (a) it's possible to get step functions not just in the $x$ or $y$ directions, but in an arbitrary direction;

(b) by adding up many of the constructions from part (a) it's possible to approximate a tower function which is circular in shape, rather than rectangular;

To do part (c) it may help to use ideas from a bit later in this chapter.

Recall that in a sigmoid neuron the inputs $x_1, x_2, \ldots$ result in the output $\sigma(\sum_j w_j x_j + b)$, where $w_j$ are the weights, $b$ is the bias, and $\sigma$ is the sigmoid function:

That is, we'll assume that if our neurons has inputs $x_1, x_2, \ldots$, weights $w_1, w_2, \ldots$ and bias $b$, then the output is $s(\sum_j w_j x_j + b)$.

It should be pretty clear that if we add all these bump functions up we'll end up with a reasonable approximation to $\sigma^{-1} \circ f(x)$, except within the windows of failure.

Suppose that instead of using the approximation just described, we use a set of hidden neurons to compute an approximation to half our original goal function, i.e., to $\sigma^{-1} \circ f(x) / 2$.

And suppose we use another set of hidden neurons to compute an approximation to $\sigma^{-1} \circ f(x)/ 2$, but with the bases of the bumps shifted by half the width of a bump:

Although the result isn't directly useful in constructing networks, it's important because it takes off the table the question of whether any particular function is computable using a neural network.

As argued in Chapter 1, deep networks have a hierarchical structure which makes them particularly well adapted to learn the hierarchies of knowledge that seem to be useful in solving real-world problems.

Put more concretely, when attacking problems such as image recognition, it helps to use a system that understands not just individual pixels, but also increasingly more complex concepts: from edges to simple geometric shapes, all the way up through complex, multi-object scenes.

In later chapters, we'll see evidence suggesting that deep networks do a better job than shallow networks at learning such hierarchies of knowledge.

and empirical evidence suggests that deep networks are the networks best adapted to learn the functions useful in solving many real-world problems.

NIPS Proceedingsβ

Part of: Advances in Neural Information Processing Systems 27 (NIPS 2014) We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have.

This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions.

On Characterizing the Capacity of Neural Networks using Algebraic Topology

The learnability of different neural architectures can be characterized directly by computable measures of data complexity. In this talk, we reframe the problem of ...

Artificial Neural Networks: Going Deeper | Lecture 3

Deep Learning Crash Course playlist: Previous video: ..

ML50. Neural Networks Learning - Cost function

Neural Networks - Activation functions

Shallow neural networks Learn to build a neural network with one hidden layer, using forward propagation and back propagation. Learning Objectives ...

Lecture 9 | CNN Architectures

In Lecture 9 we discuss some common architectures for convolutional neural networks. We discuss architectures which performed well in the ImageNet ...

Normalized Inputs and Initial Weights

This video is part of the Udacity course "Deep Learning". Watch the full course at

Convolution - Georgia Tech - Computability, Complexity, Theory: Algorithms

Watch on Udacity: Check out the full Advanced Operating Systems course for ..

Lecture 1.3: James DiCarlo - Neural Mechanisms of Recognition Part 1

MIT RES.9-003 Brains, Minds and Machines Summer Course, Summer 2015 View the complete course: Instructor: James ..

SRN: Side-output Residual Network for Object Symmetry Detection in the Wild

Wei Ke, Jie Chen, Jianbin Jiao, Guoying Zhao, Qixiang Ye In this paper, we establish a baseline for object symmetry detection in complex backgrounds by ...

Lesson 1: Deep Learning 2018

NB: Please go to to view this video since there is important updated information there. If you have questions, use the forums at ..