AI News, NVGaze artificial intelligence

A.I. Artificial Intelligence(2001)

To get one thing straight, Kubrick decided Spielberg would be the better man for directing it, and I think this was a very wise decision, many of the ideas are pure Kubrick, but Spielberg has the neccassary attributes to direct such a film, and great credit has to go to Kubrick for handing it to him.Haley Joel Osment is amazing, the robot/human emotion must be amazingly difficult to pull off effectively, but Osment does it with such relative ease to the point where you do believe he is a robot, not that he is just acting as a robot.

You may hate it, you may love it, but no matter what, it will effect your emotions in some way and you will discuss the film afterwards.This film will be truely judged in 20 years or so, when it can be assessed purely as a film, as with 'Blade Runner', '2001', and even 'The Thing', it will get better with age.

Using neural nets to recognize handwritten digits

Simple intuitions about how we recognize shapes - 'a 9 has a loop at the top, and a vertical stroke in the bottom right' - turn out to be not so simple to express algorithmically.

As a prototype it hits a sweet spot: it's challenging - it's no small feat to recognize handwritten digits - but it's not so difficult as to require an extremely complicated solution, or tremendous computational power.

But along the way we'll develop many key ideas about neural networks, including two important types of artificial neuron (the perceptron and the sigmoid neuron), and the standard learning algorithm for neural networks, known as stochastic gradient descent.

Today, it's more common to use other models of artificial neurons - in this book, and in much modern work on neural networks, the main neuron model used is one called the sigmoid neuron.

A perceptron takes several binary inputs, $x_1, x_2, \ldots$, and produces a single binary output: In the example shown the perceptron has three inputs, $x_1, x_2, x_3$.

The neuron's output, $0$ or $1$, is determined by whether the weighted sum $\sum_j w_j x_j$ is less than or greater than some threshold value.

To put it in more precise algebraic terms: \begin{eqnarray} \mbox{output} & = & \left\{ \begin{array}{ll} 0 & \mbox{if } \sum_j w_j x_j \leq \mbox{ threshold} \\ 1 & \mbox{if } \sum_j w_j x_j > \mbox{ threshold} \end{array} \right.

And it should seem plausible that a complex network of perceptrons could make quite subtle decisions: In this network, the first column of perceptrons - what we'll call the first layer of perceptrons - is making three very simple decisions, by weighing the input evidence.

The first change is to write $\sum_j w_j x_j$ as a dot product, $w \cdot x \equiv \sum_j w_j x_j$, where $w$ and $x$ are vectors whose components are the weights and inputs, respectively.

Using the bias instead of the threshold, the perceptron rule can be rewritten: \begin{eqnarray} \mbox{output} = \left\{ \begin{array}{ll} 0 & \mbox{if } w\cdot x + b \leq 0 \\ 1 & \mbox{if } w\cdot x + b > 0 \end{array} \right.

This requires computing the bitwise sum, $x_1 \oplus x_2$, as well as a carry bit which is set to $1$ when both $x_1$ and $x_2$ are $1$, i.e., the carry bit is just the bitwise product $x_1 x_2$: To get an equivalent network of perceptrons we replace all the NAND gates by perceptrons with two inputs, each with weight $-2$, and an overall bias of $3$.

Note that I've moved the perceptron corresponding to the bottom right NAND gate a little, just to make it easier to draw the arrows on the diagram: One notable aspect of this network of perceptrons is that the output from the leftmost perceptron is used twice as input to the bottommost perceptron.

(If you don't find this obvious, you should stop and prove to yourself that this is equivalent.) With that change, the network looks as follows, with all unmarked weights equal to -2, all biases equal to 3, and a single weight of -4, as marked: Up to now I've been drawing inputs like $x_1$ and $x_2$ as variables floating to the left of the network of perceptrons.

In fact, it's conventional to draw an extra layer of perceptrons - the input layer - to encode the inputs: This notation for input perceptrons, in which we have an output, but no inputs, is a shorthand.

Then the weighted sum $\sum_j w_j x_j$ would always be zero, and so the perceptron would output $1$ if $b > 0$, and $0$ if $b \leq 0$.

Instead of explicitly laying out a circuit of NAND and other gates, our neural networks can simply learn to solve problems, sometimes problems where it would be extremely difficult to directly design a conventional circuit.

If it were true that a small change in a weight (or bias) causes only a small change in output, then we could use this fact to modify the weights and biases to get our network to behave more in the manner we want.

In fact, a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from $0$ to $1$.

We'll depict sigmoid neurons in the same way we depicted perceptrons: Just like a perceptron, the sigmoid neuron has inputs, $x_1, x_2, \ldots$.

Instead, it's $\sigma(w \cdot x+b)$, where $\sigma$ is called the sigmoid function* *Incidentally, $\sigma$ is sometimes called the logistic function, and this new class of neurons called logistic neurons.

\tag{3}\end{eqnarray} To put it all a little more explicitly, the output of a sigmoid neuron with inputs $x_1,x_2,\ldots$, weights $w_1,w_2,\ldots$, and bias $b$ is \begin{eqnarray} \frac{1}{1+\exp(-\sum_j w_j x_j-b)}.

In fact, there are many similarities between perceptrons and sigmoid neurons, and the algebraic form of the sigmoid function turns out to be more of a technical detail than a true barrier to understanding.

var data = d3.range(sample).map(function(d){ return { x: x1(d), y: s(x1(d))};

var y = d3.scale.linear() .domain([0, 1]) .range([height, 0]);

}) var graph ='#sigmoid_graph') .append('svg') .attr('width', width + m[1] + m[3]) .attr('height', height + m[0] + m[2]) .append('g') .attr('transform', 'translate(' + m[3] + ',' + m[0] + ')');

var xAxis = d3.svg.axis() .scale(x) .tickValues(d3.range(-4, 5, 1)) .orient('bottom') graph.append('g') .attr('class', 'x axis') .attr('transform', 'translate(0, ' + height + ')') .call(xAxis);

var yAxis = d3.svg.axis() .scale(y) .tickValues(d3.range(0, 1.01, 0.2)) .orient('left') .ticks(5) graph.append('g') .attr('class', 'y axis') .call(yAxis);

graph.append('text') .attr('class', 'x label') .attr('text-anchor', 'end') .attr('x', width/2) .attr('y', height+35) .text('z');

graph.append('text') .attr('x', (width / 2)) .attr('y', -10) .attr('text-anchor', 'middle') .style('font-size', '16px') .text('sigmoid function');

var data = d3.range(sample).map(function(d){ return { x: x1(d), y: s(x1(d))};

var y = d3.scale.linear() .domain([0,1]) .range([height, 0]);

}) var graph ='#step_graph') .append('svg') .attr('width', width + m[1] + m[3]) .attr('height', height + m[0] + m[2]) .append('g') .attr('transform', 'translate(' + m[3] + ',' + m[0] + ')');

var xAxis = d3.svg.axis() .scale(x) .tickValues(d3.range(-4, 5, 1)) .orient('bottom') graph.append('g') .attr('class', 'x axis') .attr('transform', 'translate(0, ' + height + ')') .call(xAxis);

var yAxis = d3.svg.axis() .scale(y) .tickValues(d3.range(0, 1.01, 0.2)) .orient('left') .ticks(5) graph.append('g') .attr('class', 'y axis') .call(yAxis);

graph.append('text') .attr('class', 'x label') .attr('text-anchor', 'end') .attr('x', width/2) .attr('y', height+35) .text('z');

graph.append('text') .attr('x', (width / 2)) .attr('y', -10) .attr('text-anchor', 'middle') .style('font-size', '16px') .text('step function');

If $\sigma$ had in fact been a step function, then the sigmoid neuron would be a perceptron, since the output would be $1$ or $0$ depending on whether $w\cdot x+b$ was positive or negative* *Actually, when $w \cdot x +b = 0$ the perceptron outputs $0$, while the step function outputs $1$.

The smoothness of $\sigma$ means that small changes $\Delta w_j$ in the weights and $\Delta b$ in the bias will produce a small change $\Delta \mbox{output}$ in the output from the neuron.

In fact, calculus tells us that $\Delta \mbox{output}$ is well approximated by \begin{eqnarray} \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j} \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b, \tag{5}\end{eqnarray} where the sum is over all the weights, $w_j$, and $\partial \, \mbox{output} / \partial w_j$ and $\partial \, \mbox{output} /\partial b$ denote partial derivatives of the $\mbox{output}$ with respect to $w_j$ and $b$, respectively.

While the expression above looks complicated, with all the partial derivatives, it's actually saying something very simple (and which is very good news): $\Delta \mbox{output}$ is a linear function of the changes $\Delta w_j$ and $\Delta b$ in the weights and bias.

If it's the shape of $\sigma$ which really matters, and not its exact form, then why use the particular form used for $\sigma$ in Equation (3)\begin{eqnarray} \sigma(z) \equiv \frac{1}{1+e^{-z}} \nonumber\end{eqnarray}$('#margin_387419264610_reveal').click(function() {$('#margin_387419264610').toggle('slow', function() {});});?

In fact, later in the book we will occasionally consider neurons where the output is $f(w \cdot x + b)$ for some other activation function $f(\cdot)$.

The main thing that changes when we use a different activation function is that the particular values for the partial derivatives in Equation (5)\begin{eqnarray} \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j} \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b \nonumber\end{eqnarray}$('#margin_727997094331_reveal').click(function() {$('#margin_727997094331').toggle('slow', function() {});});

It turns out that when we compute those partial derivatives later, using $\sigma$ will simplify the algebra, simply because exponentials have lovely properties when differentiated.

But in practice we can set up a convention to deal with this, for example, by deciding to interpret any output of at least $0.5$ as indicating a '9', and any output less than $0.5$ as indicating 'not a 9'.

Exercises Sigmoid neurons simulating perceptrons, part I $\mbox{}$ Suppose we take all the weights and biases in a network of perceptrons, and multiply them by a positive constant, $c > 0$.

Show that the behaviour of the network doesn't change.Sigmoid neurons simulating perceptrons, part II $\mbox{}$ Suppose we have the same setup as the last problem - a network of perceptrons.

Suppose the weights and biases are such that $w \cdot x + b \neq 0$ for the input $x$ to any particular perceptron in the network.

Now replace all the perceptrons in the network by sigmoid neurons, and multiply the weights and biases by a positive constant $c > 0$.

Suppose we have the network: As mentioned earlier, the leftmost layer in this network is called the input layer, and the neurons within the layer are called input neurons.

The term 'hidden' perhaps sounds a little mysterious - the first time I heard the term I thought it must have some deep philosophical or mathematical significance - but it really means nothing more than 'not an input or an output'.

For example, the following four-layer network has two hidden layers: Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons.

If the image is a $64$ by $64$ greyscale image, then we'd have $4,096 = 64 \times 64$ input neurons, with the intensities scaled appropriately between $0$ and $1$.

The output layer will contain just a single neuron, with output values of less than $0.5$ indicating 'input image is not a 9', and values greater than $0.5$ indicating 'input image is a 9 '.

A trial segmentation gets a high score if the individual digit classifier is confident of its classification in all segments, and a low score if the classifier is having a lot of trouble in one or more segments.

So instead of worrying about segmentation we'll concentrate on developing a neural network which can solve the more interesting and difficult problem, namely, recognizing individual handwritten digits.

As discussed in the next section, our training data for the network will consist of many $28$ by $28$ pixel images of scanned handwritten digits, and so the input layer contains $784 = 28 \times 28$ neurons.

The input pixels are greyscale, with a value of $0.0$ representing white, a value of $1.0$ representing black, and in between values representing gradually darkening shades of grey.

A seemingly natural way of doing that is to use just $4$ output neurons, treating each neuron as taking on a binary value, depending on whether the neuron's output is closer to $0$ or to $1$.

The ultimate justification is empirical: we can try out both network designs, and it turns out that, for this particular problem, the network with $10$ output neurons learns to recognize digits better than the network with $4$ output neurons.

In a similar way, let's suppose for the sake of argument that the second, third, and fourth neurons in the hidden layer detect whether or not the following images are present:

Of course, that's not the only sort of evidence we can use to conclude that the image was a $0$ - we could legitimately get a $0$ in many other ways (say, through translations of the above images, or slight distortions).

Assume that the first $3$ layers of neurons are such that the correct output in the third layer (i.e., the old output layer) has activation at least $0.99$, and incorrect outputs have activation less than $0.01$.

We'll use the MNIST data set, which contains tens of thousands of scanned images of handwritten digits, together with their correct classifications.

To make this a good test of performance, the test data was taken from a different set of 250 people than the original training data (albeit still a group split between Census Bureau employees and high school students).

For example, if a particular training image, $x$, depicts a $6$, then $y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T$ is the desired output from the network.

We use the term cost function throughout this book, but you should note the other terminology, since it's often used in research papers and other discussions of neural networks.

\tag{6}\end{eqnarray} Here, $w$ denotes the collection of all weights in the network, $b$ all the biases, $n$ is the total number of training inputs, $a$ is the vector of outputs from the network when $x$ is input, and the sum is over all training inputs, $x$.

If we instead use a smooth cost function like the quadratic cost it turns out to be easy to figure out how to make small changes in the weights and biases so as to get an improvement in the cost.

Even given that we want to use a smooth cost function, you may still wonder why we choose the quadratic function used in Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \|

This is a well-posed problem, but it's got a lot of distracting structure as currently posed - the interpretation of $w$ and $b$ as weights and biases, the $\sigma$ function lurking in the background, the choice of network architecture, MNIST, and so on.

And for neural networks we'll often want far more variables - the biggest neural networks have cost functions which depend on billions of weights and biases in an extremely complicated way.

We could do this simulation simply by computing derivatives (and perhaps some second derivatives) of $C$ - those derivatives would tell us everything we need to know about the local 'shape' of the valley, and therefore how our ball should roll.

So rather than get into all the messy details of physics, let's simply ask ourselves: if we were declared God for a day, and could make up our own laws of physics, dictating to the ball how it should roll, what law or laws of motion could we pick that would make it so the ball always rolled to the bottom of the valley?

To make this question more precise, let's think about what happens when we move the ball a small amount $\Delta v_1$ in the $v_1$ direction, and a small amount $\Delta v_2$ in the $v_2$ direction.

Calculus tells us that $C$ changes as follows: \begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2.

To figure out how to make such a choice it helps to define $\Delta v$ to be the vector of changes in $v$, $\Delta v \equiv (\Delta v_1, \Delta v_2)^T$, where $T$ is again the transpose operation, turning row vectors into column vectors.

We denote the gradient vector by $\nabla C$, i.e.: \begin{eqnarray} \nabla C \equiv \left( \frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2} \right)^T.

In fact, it's perfectly fine to think of $\nabla C$ as a single mathematical object - the vector defined above - which happens to be written using two symbols.

With these definitions, the expression (7)\begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2 \nonumber\end{eqnarray}$('#margin_512380394946_reveal').click(function() {$('#margin_512380394946').toggle('slow', function() {});});

\tag{9}\end{eqnarray} This equation helps explain why $\nabla C$ is called the gradient vector: $\nabla C$ relates changes in $v$ to changes in $C$, just as we'd expect something called a gradient to do.

In particular, suppose we choose \begin{eqnarray} \Delta v = -\eta \nabla C, \tag{10}\end{eqnarray} where $\eta$ is a small, positive parameter (known as the learning rate).

\nabla C \|^2 \geq 0$, this guarantees that $\Delta C \leq 0$, i.e., $C$ will always decrease, never increase, if we change $v$ according to the prescription in (10)\begin{eqnarray} \Delta v = -\eta \nabla C \nonumber\end{eqnarray}$('#margin_48762573303_reveal').click(function() {$('#margin_48762573303').toggle('slow', function() {});});.

to compute a value for $\Delta v$, then move the ball's position $v$ by that amount: \begin{eqnarray} v \rightarrow v' = v -\eta \nabla C.

To make gradient descent work correctly, we need to choose the learning rate $\eta$ to be small enough that Equation (9)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_560455937071_reveal').click(function() {$('#margin_560455937071').toggle('slow', function() {});});

In practical implementations, $\eta$ is often varied so that Equation (9)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_157848846275_reveal').click(function() {$('#margin_157848846275').toggle('slow', function() {});});

Then the change $\Delta C$ in $C$ produced by a small change $\Delta v = (\Delta v_1, \ldots, \Delta v_m)^T$ is \begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v, \tag{12}\end{eqnarray} where the gradient $\nabla C$ is the vector \begin{eqnarray} \nabla C \equiv \left(\frac{\partial C}{\partial v_1}, \ldots, \frac{\partial C}{\partial v_m}\right)^T.

\tag{13}\end{eqnarray} Just as for the two variable case, we can choose \begin{eqnarray} \Delta v = -\eta \nabla C, \tag{14}\end{eqnarray} and we're guaranteed that our (approximate) expression (12)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_869505431896_reveal').click(function() {$('#margin_869505431896').toggle('slow', function() {});});

This gives us a way of following the gradient to a minimum, even when $C$ is a function of many variables, by repeatedly applying the update rule \begin{eqnarray} v \rightarrow v' = v-\eta \nabla C.

The rule doesn't always work - several things can go wrong and prevent gradient descent from finding the global minimum of $C$, a point we'll return to explore in later chapters.

But, in practice gradient descent often works extremely well, and in neural networks we'll find that it's a powerful way of minimizing the cost function, and so helping the net learn.

It can be proved that the choice of $\Delta v$ which minimizes $\nabla C \cdot \Delta v$ is $\Delta v = - \eta \nabla C$, where $\eta = \epsilon / \|\nabla C\|$ is determined by the size constraint $\|\Delta v\|

Hint: If you're not already familiar with the Cauchy-Schwarz inequality, you may find it helpful to familiarize yourself with it.

If there are a million such $v_j$ variables then we'd need to compute something like a trillion (i.e., a million squared) second partial derivatives* *Actually, more like half a trillion, since $\partial^2 C/ \partial v_j \partial v_k = \partial^2 C/ \partial v_k \partial v_j$.

The idea is to use gradient descent to find the weights $w_k$ and biases $b_l$ which minimize the cost in Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \|

In other words, our 'position' now has components $w_k$ and $b_l$, and the gradient vector $\nabla C$ has corresponding components $\partial C / \partial w_k$ and $\partial C / \partial b_l$.

Writing out the gradient descent update rule in terms of components, we have \begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\eta \frac{\partial C}{\partial w_k} \tag{16}\\ b_l & \rightarrow & b_l' = b_l-\eta \frac{\partial C}{\partial b_l}.

In practice, to compute the gradient $\nabla C$ we need to compute the gradients $\nabla C_x$ separately for each training input, $x$, and then average them, $\nabla C = \frac{1}{n} \sum_x \nabla C_x$.

To make these ideas more precise, stochastic gradient descent works by randomly picking out a small number $m$ of randomly chosen training inputs.

Provided the sample size $m$ is large enough we expect that the average value of the $\nabla C_{X_j}$ will be roughly equal to the average over all $\nabla C_x$, that is, \begin{eqnarray} \frac{\sum_{j=1}^m \nabla C_{X_{j}}}{m} \approx \frac{\sum_x \nabla C_x}{n} = \nabla C, \tag{18}\end{eqnarray} where the second sum is over the entire set of training data.

Swapping sides we get \begin{eqnarray} \nabla C \approx \frac{1}{m} \sum_{j=1}^m \nabla C_{X_{j}}, \tag{19}\end{eqnarray} confirming that we can estimate the overall gradient by computing gradients just for the randomly chosen mini-batch.

Then stochastic gradient descent works by picking out a randomly chosen mini-batch of training inputs, and training with those, \begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \tag{20}\\ b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l}, \tag{21}\end{eqnarray} where the sums are over all the training examples $X_j$ in the current mini-batch.

And, in a similar way, the mini-batch update rules (20)\begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \nonumber\end{eqnarray}$('#margin_801900730537_reveal').click(function() {$('#margin_801900730537').toggle('slow', function() {});});

and (21)\begin{eqnarray} b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray}$('#margin_985072620111_reveal').click(function() {$('#margin_985072620111').toggle('slow', function() {});});

We can think of stochastic gradient descent as being like political polling: it's much easier to sample a small mini-batch than it is to apply gradient descent to the full batch, just as carrying out a poll is easier than running a full election.

For example, if we have a training set of size $n = 60,000$, as in MNIST, and choose a mini-batch size of (say) $m = 10$, this means we'll get a factor of $6,000$ speedup in estimating the gradient!

Of course, the estimate won't be perfect - there will be statistical fluctuations - but it doesn't need to be perfect: all we really care about is moving in a general direction that will help decrease $C$, and that means we don't need an exact computation of the gradient.

In practice, stochastic gradient descent is a commonly used and powerful technique for learning in neural networks, and it's the basis for most of the learning techniques we'll develop in this book.

That is, given a training input, $x$, we update our weights and biases according to the rules $w_k \rightarrow w_k' = w_k - \eta \partial C_x / \partial w_k$ and $b_l \rightarrow b_l' = b_l - \eta \partial C_x / \partial b_l$.

Name one advantage and one disadvantage of online learning, compared to stochastic gradient descent with a mini-batch size of, say, $20$.

In neural networks the cost $C$ is, of course, a function of many variables - all the weights and biases - and so in some sense defines a surface in a very high-dimensional space.

I won't go into more detail here, but if you're interested then you may enjoy reading this discussion of some of the techniques professional mathematicians use to think in high dimensions.

We'll leave the test images as is, but split the 60,000-image MNIST training set into two parts: a set of 50,000 images, which we'll use to train our neural network, and a separate 10,000 image validation set.

We won't use the validation data in this chapter, but later in the book we'll find it useful in figuring out how to set certain hyper-parameters of the neural network - things like the learning rate, and so on, which aren't directly selected by our learning algorithm.

When I refer to the 'MNIST training data' from now on, I'll be referring to our 50,000 image data set, not the original 60,000 image data set* *As noted earlier, the MNIST data set is based on two data sets collected by NIST, the United States' National Institute of Standards and Technology.

for x, y in zip(sizes[:-1], sizes[1:])]

So, for example, if we want to create a Network object with 2 neurons in the first layer, 3 neurons in the second layer, and 1 neuron in the final layer, we'd do this with the code: net = Network([2, 3, 1])

The biases and weights in the Network object are all initialized randomly, using the Numpy np.random.randn function to generate Gaussian distributions with mean $0$ and standard deviation $1$.

Note that the Network initialization code assumes that the first layer of neurons is an input layer, and omits to set any biases for those neurons, since biases are only ever used in computing the outputs from later layers.

The big advantage of using this ordering is that it means that the vector of activations of the third layer of neurons is: \begin{eqnarray} a' = \sigma(w a + b).

(This is called vectorizing the function $\sigma$.) It's easy to verify that Equation (22)\begin{eqnarray} a' = \sigma(w a + b) \nonumber\end{eqnarray}$('#margin_587961844970_reveal').click(function() {$('#margin_587961844970').toggle('slow', function() {});});

gives the same result as our earlier rule, Equation (4)\begin{eqnarray} \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}$('#margin_529530134232_reveal').click(function() {$('#margin_529530134232').toggle('slow', function() {});});, for computing the output of a sigmoid neuron.

in component form, and verify that it gives the same result as the rule (4)\begin{eqnarray} \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}$('#margin_192120789771_reveal').click(function() {$('#margin_192120789771').toggle('slow', function() {});});

We then add a feedforward method to the Network class, which, given an input a for the network, returns the corresponding output* *It is assumed that the input a is an (n, 1) Numpy ndarray, not a (n,) vector.

Although using an (n,) vector appears the more natural choice, using an (n, 1) ndarray makes it particularly easy to modify the code to feedforward multiple inputs at once, and that is sometimes convenient.

All the method does is applies Equation (22)\begin{eqnarray} a' = \sigma(w a + b) \nonumber\end{eqnarray}$('#margin_23850917661_reveal').click(function() {$('#margin_23850917661').toggle('slow', function() {});});


for k in xrange(0, n, mini_batch_size)]

self.update_mini_batch(mini_batch, eta)

print "Epoch {0}: {1} / {2}".format(

j, self.evaluate(test_data), n_test)

print "Epoch {0} complete".format(j)

This is done by the code self.update_mini_batch(mini_batch, eta), which updates the network weights and biases according to a single iteration of gradient descent, using just the training data in mini_batch.

delta_nabla_b, delta_nabla_w = self.backprop(x, y)

nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]

nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]

for w, nw in zip(self.weights, nabla_w)]

for b, nb in zip(self.biases, nabla_b)]

Most of the work is done by the line delta_nabla_b, delta_nabla_w = self.backprop(x, y)

The self.backprop method makes use of a few extra functions to help in computing the gradient, namely sigmoid_prime, which computes the derivative of the $\sigma$ function, and self.cost_derivative, which I won't describe here.

for x, y in zip(sizes[:-1], sizes[1:])]


for k in xrange(0, n, mini_batch_size)]

self.update_mini_batch(mini_batch, eta)

print "Epoch {0}: {1} / {2}".format(

j, self.evaluate(test_data), n_test)

print "Epoch {0} complete".format(j)

delta_nabla_b, delta_nabla_w = self.backprop(x, y)

nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]

nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]

for w, nw in zip(self.weights, nabla_w)]

for b, nb in zip(self.biases, nabla_b)]

activations = [x] # list to store all the activations, layer by layer

zs = [] # list to store all the z vectors, layer by layer

# l = 1 means the last layer of neurons, l = 2 is the

delta =[-l+1].transpose(), delta) * sp

for (x, y) in test_data]

Finally, we'll use stochastic gradient descent to learn from the MNIST training_data over 30 epochs, with a mini-batch size of 10, and a learning rate of $\eta = 3.0$, >>>

As was the case earlier, if you're running the code as you read along, you should be warned that it takes quite a while to execute (on my machine this experiment takes tens of seconds for each training epoch), so it's wise to continue reading in parallel while the code executes.

At least in this case, using more hidden neurons helps us get better results* *Reader feedback indicates quite some variation in results for this experiment, and some training runs give results quite a bit worse.

Using the techniques introduced in chapter 3 will greatly reduce the variation in performance across different training runs for our networks..

(If making a change improves things, try doing more!) If we do that several times over, we'll end up with a learning rate of something like $\eta = 1.0$ (and perhaps fine tune to $3.0$), which is close to our earlier experiments.

Exercise Try creating a network with just two layers - an input and an output layer, no hidden layer - with 784 and 10 neurons, respectively.

The data structures used to store the MNIST data are described in the documentation strings - it's straightforward stuff, tuples and lists of Numpy ndarray objects (think of them as vectors if you're not familiar with ndarrays): """mnist_loader~~~~~~~~~~~~A library to load the MNIST image data.

In some sense, the moral of both our results and those in more sophisticated papers, is that for some problems: sophisticated algorithm $\leq$ simple learning algorithm + good training data.

We could attack this problem the same way we attacked handwriting recognition - by using the pixels in the image as input to a neural network, with the output from the network a single neuron indicating either 'Yes, it's a face' or 'No, it's not a face'.

The end result is a network which breaks down a very complicated question - does this image show a face or not - into very simple questions answerable at the level of single pixels.

It does this through a series of many layers, with early layers answering very simple and specific questions about the input image, and later layers building up a hierarchy of ever more complex and abstract concepts.

Comparing a deep network to a shallow network is a bit like comparing a programming language with the ability to make function calls to a stripped down language with no ability to make such calls.

CHI '19- Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

Early warning dashboards in higher education analyze student data to enable early identification of underperforming students, allowing timely interventions by faculty and staff.

To understand perceptions regarding the ethics and impact of such learning analytics applications, we conducted a multi-stakeholder analysis of an early-warning dashboard deployed at the University of Michigan through semi-structured interviews with the system's developers, academic advisors (the primary users), and students.

To understand perceptions regarding the ethics and impact of such learning analytics applications, we conducted a multi-stakeholder analysis of an early-warning dashboard deployed at the University of Michigan through semi-structured interviews with the system's developers, academic advisors (the primary users), and students.

A.I. Artificial Intelligence

The film languished in protracted development for years, partly because Kubrick felt computer-generated imagery was not advanced enough to create the David character, whom he believed no child actor would convincingly portray.

In the late 22nd century, rising sea levels from global warming have wiped out coastal cities, reducing the world's population.

In Madison, David, a prototype Mecha child capable of experiencing love, is given to Henry Swinton and his wife Monica, whose son Martin contracted a rare disease and has been placed in suspended animation.

Monica feels uneasy with David, but eventually warms to him and activates his imprinting protocol, causing him to have an enduring childlike love for her.

Martin becomes jealous of David, and taunts him to perform worrisome acts, such as cutting off locks of Monica's hair while she is sleeping.

At a pool party, one of Martin's friends pokes David with a knife, triggering his self-protection programming, grabbing onto Martin and falling into the pool.

On the way there, Monica has a change of heart, and tells David to run away, avoiding any humans and seeking out unregistered Mecha to protect him.

Joe agrees to help David find Blue Fairy, whom David remembers from The Adventures of Pinocchio, and believes can turn him into a real boy, allowing Monica to love him.

Before David can explain, Joe is captured via electromagnet by authorities, but Joe sets the copter's controls to submerge, and tells David to remember him.

David spends a 'perfect day' with Monica, and as she falls asleep in the evening, she tells David that she has always loved him: 'the everlasting moment he had been waiting for', the narrator says.

Kubrick began development on an adaptation of 'Super-Toys Last All Summer Long' in the late 1970s, hiring the story's author, Brian Aldiss, to write a film treatment.

Bob Shaw briefly served as writer, leaving after six weeks due to Kubrick's demanding work schedule, and Ian Watson was hired as the new writer in March 1990.

In early 1994, the film was in pre-production with Christopher 'Fangorn' Baker as concept artist, and Sara Maitland assisting on the story, which gave it 'a feminist fairy-tale focus'.[7]

'We tried to construct a little boy with a movable rubber face to see whether we could make it look appealing,' producer Jan Harlan reflected.

Spielberg copied Kubrick's obsessively secretive approach to filmmaking by refusing to give the complete script to cast and crew, banning press from the set, and making actors sign confidentiality agreements.

for a family film, no action figures were created, although Hasbro released a talking Teddy following the film's release in June 2001.[15]

in widescreen and full-screen 2-disc special editions featuring an eight-part documentary detailing the film's development, production, music and visual effects.

The bonus features also included interviews with Haley Joel Osment, Jude Law, Frances O'Connor, Steven Spielberg, and John Williams, two teaser trailers for the film's original theatrical release and an extensive photo gallery featuring production sills and Stanley Kubrick's original storyboards.[33]

The film was first released on Blu-ray Disc in Japan by Warner Home Video on December 22, 2010, followed shortly after with a U.S release by Paramount Home Media Distribution (current owners of the DreamWorks catalog) on April 5, 2011.

This Blu-ray featured the film newly remastered in high-definition and incorporated all the bonus features previously included on the 2-disc special-edition DVD.[34]

A.I went on to gross $78.62 million in US totals as well as $157.31 million in foreign countries, coming to a worldwide total of $235.93 million.[35]

Based on 192 reviews collected by Rotten Tomatoes, 73% of critics gave the film positive notices with a score of 6.6/10.

The website's critical consensus reads, 'A curious, not always seamless, amalgamation of Kubrick's chilly bleakness and Spielberg's warm-hearted optimism.

Of the film's ending, he wondered how it might have been had Kubrick directed the film: 'That is one of the 'ifs' of film history—at least the ending indicates Spielberg adding some sugar to Kubrick's wine.

The actual ending is overly sympathetic and moreover rather overtly engineered by a plot device that does not really bear credence.

Maltin, on the other hand, gives the film two stars out of four in his Movie Guide, writing: '[The] intriguing story draws us in, thanks in part to Osment's exceptional performance, but takes several wrong turns;

to Solaris (1972), and praised both 'Kubrick for proposing that Spielberg direct the project and Spielberg for doing his utmost to respect Kubrick's intentions while making it a profoundly personal work.'[43]

Film critic Armond White, of the New York Press, praised the film noting that 'each part of David's journey through carnal and sexual universes into the final eschatological devastation becomes as profoundly philosophical and contemplative as anything by cinema's most thoughtful, speculative artists – Borzage, Ozu, Demy, Tarkovsky.'[44]

Dubbing it Spielberg's 'first boring movie', LaSalle also believed the robots at the end of the film were aliens, and compared Gigolo Joe to the 'useless' Jar Jar Binks, yet praised Robin Williams for his portrayal of a futuristic Albert Einstein.[47][failed verification]

Spielberg responded to some of the criticisms of the film, stating that many of the 'so called sentimental' elements of A.I., including the ending, were in fact Kubrick's and the darker elements were his own.[49]

Plus, quite a few critics in America misunderstood the film, thinking for instance that the Giacometti-style beings in the final 20 minutes were aliens (whereas they were robots of the future who had evolved themselves from the robots in the earlier part of the film) and also thinking that the final 20 minutes were a sentimental addition by Spielberg, whereas those scenes were exactly what I wrote for Stanley and exactly what he wanted, filmed faithfully by Spielberg.'[52]

[Despite Mr. Watson's reference to worldwide box office of 4th, the movie actually finished 16th worldwide among 2001 movie releases.][53]

Upon rewatching the film many years after its release, BBC film critic Mark Kermode apologized to Spielberg in an interview in January 2013 for 'getting it wrong' on the film when he first viewed it in 2001.

Time for the Neural Network to Learn

I will defer to this great textbook (online and free!) for the detailed math (if you want to understand neural networks more deeply, definitely check it out).

We then move this error backwards through our model via the same weights and connections that we use for forward propagating our signal (so instead of Activation 1, now we have Error1 — the error attributable to the top blue neuron).

We can now state the objective of backpropagation in a similar manner: We want to calculate the error attributable to each neuron (I will just refer to this error quantity as the neuron’s error because saying “attributable” again and again is no fun) starting from the layer closest to the output all the way back to the starting layer of our model.

This makes intuitive sense — if a particular neuron has a much larger error than all the other ones, then tweaking the weights and bias of our offending neuron will have a greater impact on our model’s total error than fiddling with any of the other neurons.

So basically backpropagation allows us to calculate the error attributable to each neuron and that in turn allows us to calculate the partial derivatives and ultimately the gradient so that we can utilize gradient descent.

Just like in real life, the lazy ones that play it safe (low and infrequent activations) skate by blame free while the neurons that do the most work get blamed and have their weights and biases modified.