AI News, Difference between revisions of "Artificial Neural Networks/Activation Functions"

Difference between revisions of "Artificial Neural Networks/Activation Functions"

There are a number of common activation functions in use with neural networks.

The output is a certain value, A1, if the input sum is above a certain threshold and A0 if the input sum is below a certain threshold.

These kinds of step activation functions are useful for binary classification schemes.

In other words, when we want to classify an input pattern into one of two groups, we can use a binary classifier with a step activation function.

Each identifier would be a small network that would output a 1 if a particular input feature is present, and a 0 otherwise.

Combining multiple feature detectors into a single network would allow a very complicated clustering or classification problem to be solved.

linear combination is where the weighted sum input of the neuron plus a linearly dependent bias becomes the system output.

In these cases, the sign of the output is considered to be equivalent to the 1 or 0 of the step function systems, which enables the two methods be to equivalent if

This is called the log-sigmoid because a sigmoid can also be constructed using the hyperbolic tangent function instead of this relation, in which case it would be called a tan-sigmoid.

Sigmoid functions in this respect are very similar to the input-output relationships of biological neurons, although not exactly the same.

Sigmoid functions are also prized because their derivatives are easy to calculate, which is helpful for calculating the weight updates in certain training algorithms.

The softmax activation function is useful predominantly in the output layer of a clustering system.

Understanding Activation Functions in Neural Networks

Recently, a colleague of mine asked me a few questions like “why do we have so many activation functions?”, “why is that one works better than the other?”, ”how do we know which one to use?”, “is it hardcore maths?” and so on.

So I thought, why not write an article on it for those who are familiar with neural network only at a basic level and is therefore, wondering about activation functions and their “why-how-mathematics!”.

Simply put, it calculates a “weighted sum” of its input, adds a bias and then decides whether it should be “fired” or not ( yeah right, an activation function does this, but let’s go with the flow for a moment ).

Because we learnt it from biology that’s the way brain works and brain is a working testimony of an awesome and intelligent system ).

To check the Y value produced by a neuron and decide whether outside connections should consider this neuron as “fired” or not.

You would want the network to activate only 1 neuron and others should be 0 ( only then would you be able to say it classified properly/identified the class ).

And then if more than 1 neuron activates, you could find which neuron has the “highest activation” and so on ( better than max, a softmax, but let’s leave that for now ).

But..since there are intermediate activation values for the output, learning can be smoother and easier ( less wiggly ) and chances of more than 1 neuron being 100% activated is lesser when compared to step function while training ( also depending on what you are training and the data ).

Ok, so we want something to give us intermediate ( analog ) activation values rather than saying “activated” or not ( binary ).

straight line function where activation is proportional to input ( which is the weighted sum from neuron ).

We can definitely connect a few neurons together and if more than 1 fires, we could take the max ( or softmax) and decide based on that.

If there is an error in prediction, the changes made by back propagation is constant and not depending on the change in input delta(x) !!!

That activation in turn goes into the next level as input and the second layer calculates weighted sum on that input and it in turn, fires based on another linear activation function.

No matter how many layers we have, if all are linear in nature, the final activation function of last layer is nothing but just a linear function of the input of first layer!

No matter how we stack, the whole network is still equivalent to a single layer with linear activation ( a combination of linear functions in a linear manner is still another linear function ).

It tends to bring the activations to either side of the curve ( above x = 2 and below x = -2 for example).

Another advantage of this activation function is, unlike linear function, the output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of linear function.

The network refuses to learn further or is drastically slow ( depending on use case and until gradient /computation gets hit by floating point value limits ).

Imagine a network with random initialized weights ( or normalised ) and almost 50% of the network yields 0 activation because of the characteristic of ReLu ( output 0 for negative values of x ).

That means, those neurons which go into that state will stop responding to variations in error/ input ( simply because gradient is 0, nothing changes ).

When you know the function you are trying to approximate has certain characteristics, you can choose an activation function which will approximate the function faster leading to faster training process.

For example, a sigmoid works well for a classifier ( see the graph of sigmoid, doesn’t it show the properties of an ideal classifier?

Activation function

In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs.

A standard computer chip circuit can be seen as a digital network of activation functions that can be 'ON' (1) or 'OFF' (0), depending on input.

This is similar to the behavior of the linear perceptron in neural networks.

However, only nonlinear activation functions allow such networks to compute nontrivial problems using only a small number of nodes[citation needed].

In artificial neural networks this function is also called the transfer function.

In biologically inspired neural networks, the activation function is usually an abstraction representing the rate of action potential firing in the cell.[1]

In its simplest form, this function is binary—that is, either the neuron is firing or not.

The function looks like

(

v

i

)

{\displaystyle \phi (v_{i})=U(v_{i})}

is the Heaviside step function.

In this case many neurons must be used in computation beyond linear separation of categories.

line of positive slope may be used to reflect the increase in firing rate that occurs as input current increases.

Such a function would be of the form

{\displaystyle \phi (v_{i})=\mu v_{i}}

{\displaystyle \mu }

This activation function is linear, and therefore has the same problems as the binary function.

In addition, networks constructed using this model have unstable convergence because neuron inputs along favored paths tend to increase without bound, as this function is not normalizable.

All problems mentioned above can be handled by using a normalizable sigmoid activation function.

One realistic model stays at zero until input current is received, at which point the firing frequency increases quickly at first, but gradually approaches an asymptote at 100% firing rate.

Mathematically, this looks like

{\displaystyle \phi (v_{i})=U(v_{i})\tanh(v_{i})}

where the hyperbolic tangent function can be replaced by any sigmoid function.

This behavior is realistically reflected in the neuron, as neurons cannot physically fire faster than a certain rate.

This model runs into problems, however, in computational networks as it is not differentiable, a requirement to calculate backpropagation.

The final model, then, that is used in multilayer perceptrons is a sigmoidal activation function in the form of a hyperbolic tangent.

Two forms of this function are commonly used:

{\displaystyle \phi (v_{i})=\tanh(v_{i})}

whose range is normalized from -1 to 1, and

{\displaystyle \phi (v_{i})=(1+\exp(-v_{i}))^{-1}}

is vertically translated to normalize from 0 to 1.

The latter model is often considered more biologically realistic, but it runs into theoretical and experimental difficulties with certain types of computational problems.

special class of activation functions known as radial basis functions (RBFs) are used in RBF networks, which are extremely efficient as universal function approximators.

These activation functions can take many forms, but they are usually found as one of three functions:

{\displaystyle c_{i}}

is the vector representing the function center and

{\displaystyle a}

{\displaystyle \sigma }

are parameters affecting the spread of the radius.

Support vector machines (SVMs) can effectively utilize a class of activation functions that includes both sigmoids and RBFs.

In this case, the input is transformed to reflect a decision boundary hyperplane based on a few training inputs called support vectors

{\displaystyle x}

The activation function for the hidden layer of these machines is referred to as the inner product kernel,

{\displaystyle K(v_{i},x)=\phi (v_{i})}

The support vectors are represented as the centers in RBFs with the kernel equal to the activation function, but they take a unique form in the perceptron as

{\displaystyle \beta _{0}}

{\displaystyle \beta _{1}}

must satisfy certain conditions for convergence.

These machines can also accept arbitrary-order polynomial activation functions where

Activation function having types:

Some desirable properties in an activation function include:

The following table compares the properties of several activation functions that are functions of one fold x from the previous layer or layers:

The following table lists activation functions that are not functions of a single fold x from the previous layer or layers:

{\displaystyle \delta _{ij}}

Fundamentals of Deep Learning – Activation Functions and When to Use Them?

Internet provides access to plethora of information today.

When our brain is fed with a lot of information simultaneously, it tries hard to understand and classify the information between useful and not-so-useful information.

Let us go through these activation functions, how they work and figure out which activation functions fits well into what kind of  problem statement.

Before I delve into the details of activation functions, let’s do a little review of what are neural networks and how they function.

A neural network is a very powerful machine learning mechanism which basically mimics how a human brain learns.

The brain receives the stimulus from the outside world, does the processing on the input, and then generates the output.

As the task gets complicated multiple neurons form a complex network, passing information among themselves.

The black circles in the picture above are neurons. Each neuron is characterized by its weight, bias and activation function.

A linear equation is simple to solve but is limited in its capacity to solve complex problems. A neural network without an activation function is essentially just a linear regression model.

The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks.

We would want our neural networks to work on complicated tasks like language translations and image classifications.

Activation functions make the back-propagation possible since the gradients are supplied along with the error to update the weights and biases.

If the value Y is above a given threshold value then activate the neuron else leave it deactivated.

When we simply need to say yes or no for a single class, step function would be the best choice, as it would either activate the neuron or leave it to zero.

The function is more theoretical than practical since in most cases we would be classifying the data into multiple classes than just a single class.

This makes the step function not so useful since during back-propagation when the gradients of the activation functions are sent for error calculations to improve and optimize the results.

The gradient of the step function reduces it all to zero and improvement of the models doesn’t really happen.

We saw the problem with the step function, the gradient being zero, it was impossible to update gradient during the backpropagation.

Now if each layer has a linear transformation, no matter how many layers we have the final output is nothing but a linear transformation of the input.

Our choice of using sigmoid or tanh would basically depend on the requirement of gradient in the problem statement.

First things first, the ReLU function is non linear, which means we can easily backpropagate the errors and have multiple layers of neurons being activated by the ReLU function.

But ReLU also falls a prey to the gradients moving towards zero. If you look at the negative side of the graph, the gradient is zero, which means for activations in that region, the gradient is zero and the weights are not updated during back propagation.

So in this case the gradient of the left side of the graph is non zero and so we would no longer encounter dead neurons in that region.

The parametrised ReLU function is used when the leaky ReLU function still fails to solve the problem of dead neurons and the relevant information is not successfully passed to the next layer.

The softmax function is also a type of sigmoid function but is handy when we are trying to handle classification problems.

The softmax function would squeeze the outputs for each class between 0 and 1 and would also divide by the sum of the outputs.

The softmax function is ideally used in the output layer of the classifier where we are actually trying to attain the probabilities to define the class of each input.

Now that we have seen so many activation  functions, we need some logic / heuristics to know which activation function should be used in which situation.

However depending upon the properties of the problem we might be able to make a better choice for easy and quicker convergence of the network.

In this article I have discussed the various types of activation functions and what are the types of problems one might encounter while using each of them.

Artificial Neural Networks/Activation Functions

There are a number of common activation functions in use with neural networks.

The output is a certain value, A1, if the input sum is above a certain threshold and A0 if the input sum is below a certain threshold.

These kinds of step activation functions are useful for binary classification schemes.

In other words, when we want to classify an input pattern into one of two groups, we can use a binary classifier with a step activation function.

Each identifier would be a small network that would output a 1 if a particular input feature is present, and a 0 otherwise.

Combining multiple feature detectors into a single network would allow a very complicated clustering or classification problem to be solved.

linear combination is where the weighted sum input of the neuron plus a linearly dependent bias becomes the system output.

In these cases, the sign of the output is considered to be equivalent to the 1 or 0 of the step function systems, which enables the two methods be to equivalent if

This is called the log-sigmoid because a sigmoid can also be constructed using the hyperbolic tangent function instead of this relation, in which case it would be called a tan-sigmoid.

Sigmoid functions in this respect are very similar to the input-output relationships of biological neurons, although not exactly the same.

Sigmoid functions are also prized because their derivatives are easy to calculate, which is helpful for calculating the weight updates in certain training algorithms.

The softmax activation function is useful predominantly in the output layer of a clustering system.

A visual proof that neural nets can compute any function

One of the most striking facts about neural networks is that they can compute any function at all.

No matter what the function, there is guaranteed to be a neural network so that for every possible input, $x$, the value $f(x)$ (or some close approximation) is output from the network, e.g.:

For instance, here's a network computing a function with $m = 3$ inputs and $n = 2$ outputs:

What's more, this universality theorem holds even if we restrict our networks to have just a single layer intermediate between the input and the output neurons - a so-called single hidden layer.

For instance, one of the original papers proving the result* *Approximation by superpositions of a sigmoidal function, by George Cybenko (1989).

Another important early paper is Multilayer feedforward networks are universal approximators, by Kurt Hornik, Maxwell Stinchcombe, and Halbert White (1989).

Again, that can be thought of as computing a function* *Actually, computing one of many functions, since there are often many acceptable translations of a given piece of text..

Or consider the problem of taking an mp4 movie file and generating a description of the plot of the movie, and a discussion of the quality of the acting.

Two caveats Before explaining why the universality theorem is true, I want to mention two caveats to the informal statement 'a neural network can compute any function'.

To make this statement more precise, suppose we're given a function $f(x)$ which we'd like to compute to within some desired accuracy $\epsilon > 0$.

The guarantee is that by using enough hidden neurons we can always find a neural network whose output $g(x)$ satisfies $|g(x) - f(x)|

If a function is discontinuous, i.e., makes sudden, sharp jumps, then it won't in general be possible to approximate using a neural net.

Summing up, a more precise statement of the universality theorem is that neural networks with a single hidden layer can be used to approximate any continuous function to any desired precision.

In this chapter we'll actually prove a slightly weaker version of this result, using two hidden layers instead of one.

In the problems I'll briefly outline how the explanation can, with a few tweaks, be adapted to give a proof which uses only a single hidden layer.

Universality with one input and one output To understand why the universality theorem is true, let's start by understanding how to construct a neural network which approximates a function with just one input and one output:

To build insight into how to construct a network to compute $f$, let's start with a network containing just a single hidden layer, with two hidden neurons, and an output layer containing a single output neuron:

In the diagram below, click on the weight, $w$, and drag the mouse a little ways to the right to increase $w$.

As we learnt earlier in the book, what's being computed by the hidden neuron is $\sigma(wx + b)$, where $\sigma(z) \equiv 1/(1+e^{-z})$ is the sigmoid function.

But for the proof of universality we will obtain more insight by ignoring the algebra entirely, and instead manipulating and observing the shape shown in the graph.

This won't just give us a better feel for what's going on, it will also give us a proof* *Strictly speaking, the visual approach I'm taking isn't what's traditionally thought of as a proof.

Occasionally, there will be small gaps in the reasoning I present: places where I make a visual argument that is plausible, but not quite rigorous.

function playVideo (name) { var div = $('#'+name)[0];

} function videoEnded (name) { var div = document.getElementById(name);

} We can simplify our analysis quite a bit by increasing the weight so much that the output really is a step function, to a very good approximation.

It's easy to analyze the sum of a bunch of step functions, but rather more difficult to reason about what happens when you add up a bunch of sigmoid shaped curves.

With a little work you should be able to convince yourself that the position of the step is proportional to $b$, and inversely proportional to $w$.

It will greatly simplify our lives to describe hidden neurons using just a single parameter, $s$, which is the step position, $s = -b/w$.

As noted above, we've implicitly set the weight $w$ on the input to be some large value - big enough that the step function is a very good approximation.

We can easily convert a neuron parameterized in this way back into the conventional model, by choosing the bias $b = -w s$.

In particular, we'll suppose the hidden neurons are computing step functions parameterized by step points $s_1$ (top neuron) and $s_2$ (bottom neuron).

Here, $a_1$ and $a_2$ are the outputs from the top and bottom hidden neurons, respectively* *Note, by the way, that the output from the whole network is $\sigma(w_1 a_1+w_2 a_2 + b)$, where $b$ is the bias on the output neuron.

We're going to focus on the weighted output from the hidden layer right now, and only later will we think about how that relates to the output from the whole network..

You'll see that the graph changes shape when this happens, since we have moved from a situation where the top hidden neuron is the first to be activated to a situation where the bottom hidden neuron is the first to be activated.

Similarly, try manipulating the step point $s_2$ of the bottom hidden neuron, and get a feel for how this changes the combined output from the hidden neurons.

You'll notice, by the way, that we're using our neurons in a way that can be thought of not just in graphical terms, but in more conventional programming terms, as a kind of if-then-else statement, e.g.:

In particular, we can divide the interval $[0, 1]$ up into a large number, $N$, of subintervals, and use $N$ pairs of hidden neurons to set up peaks of any desired height.

Apologies for the complexity of the diagram: I could hide the complexity by abstracting away further, but I think it's worth putting up with a little complexity, for the sake of getting a more concrete feel for how these networks work.

didn't say it at the time, but what I plotted is actually the function \begin{eqnarray} f(x) = 0.2+0.4 x^2+0.3x \sin(15 x) + 0.05 \cos(50 x), \tag{113}\end{eqnarray} plotted over $x$ from $0$ to $1$, and with the $y$ axis taking values from $0$ to $1$.

The solution is to design a neural network whose hidden layer has a weighted output given by $\sigma^{-1} \circ f(x)$, where $\sigma^{-1}$ is just the inverse of the $\sigma$ function.

If we can do this, then the output from the network as a whole will be a good approximation to $f(x)$* *Note that I have set the bias on the output neuron to $0$..

How well you're doing is measured by the average deviation between the goal function and the function the network is actually computing.

It's only a coarse approximation, but we could easily do much better, merely by increasing the number of pairs of hidden neurons, allowing more bumps.

So, for instance, for the second hidden neuron $s = 0.2$ becomes $b = -1000 \times 0.2 = -200$.

So, for instance, the value you've chosen above for the first $h$, $h = $ , means that the output weights from the top two hidden neurons are and , respectively.

Just as in our earlier discussion, as the input weight gets larger the output approaches a step function.

Here, we assume the weight on the $x$ input has some large value - I've used $w_1 = 1000$ - and the weight $w_2 = 0$.

Of course, it's also possible to get a step function in the $y$ direction, by making the weight on the $y$ input very large (say, $w_2 = 1000$), and the weight on the $x$ equal to $0$, i.e., $w_1 = 0$:

The number on the neuron is again the step point, and in this case the little $y$ above the number reminds us that the step is in the $y$ direction.

But do keep in mind that the little $y$ marker implicitly tells us that the $y$ weight is large, and the $x$ weight is $0$.

That reminds us that they're producing $y$ step functions, not $x$ step functions, and so the weight is very large on the $y$ input, and zero on the $x$ input, not vice versa.

If we choose the threshold appropriately - say, a value of $3h/2$, which is sandwiched between the height of the plateau and the height of the central tower - we could squash the plateau down to zero, and leave just the tower standing.

This is a bit tricky, so if you think about this for a while and remain stuck, here's two hints: (1) To get the output neuron to show the right kind of if-then-else behaviour, we need the input weights (all $h$ or $-h$) to be large;

Even for this relatively modest value of $h$, we get a pretty good tower function.

To make the respective roles of the two sub-networks clear I've put them in separate boxes, below: each box computes a tower function, using the technique described above.

In particular, by making the weighted output from the second hidden layer a good approximation to $\sigma^{-1} \circ f$, we ensure the output from our network will be a good approximation to any desired function, $f$.

The $s_1, t_1$ and so on are step points for neurons - that is, all the weights in the first layer are large, and the biases are set to give the step points $s_1, t_1, s_2, \ldots$.

Of course, such a function can be regarded as just $n$ separate real-valued functions, $f^1(x_1, \ldots, x_m), f^2(x_1, \ldots, x_m)$, and so on.

As a hint, try working in the case of just two input variables, and showing that: (a) it's possible to get step functions not just in the $x$ or $y$ directions, but in an arbitrary direction;

(b) by adding up many of the constructions from part (a) it's possible to approximate a tower function which is circular in shape, rather than rectangular;

To do part (c) it may help to use ideas from a bit later in this chapter.

Recall that in a sigmoid neuron the inputs $x_1, x_2, \ldots$ result in the output $\sigma(\sum_j w_j x_j + b)$, where $w_j$ are the weights, $b$ is the bias, and $\sigma$ is the sigmoid function:

That is, we'll assume that if our neurons has inputs $x_1, x_2, \ldots$, weights $w_1, w_2, \ldots$ and bias $b$, then the output is $s(\sum_j w_j x_j + b)$.

It should be pretty clear that if we add all these bump functions up we'll end up with a reasonable approximation to $\sigma^{-1} \circ f(x)$, except within the windows of failure.

Suppose that instead of using the approximation just described, we use a set of hidden neurons to compute an approximation to half our original goal function, i.e., to $\sigma^{-1} \circ f(x) / 2$.

And suppose we use another set of hidden neurons to compute an approximation to $\sigma^{-1} \circ f(x)/ 2$, but with the bases of the bumps shifted by half the width of a bump:

Although the result isn't directly useful in constructing networks, it's important because it takes off the table the question of whether any particular function is computable using a neural network.

As argued in Chapter 1, deep networks have a hierarchical structure which makes them particularly well adapted to learn the hierarchies of knowledge that seem to be useful in solving real-world problems.

Put more concretely, when attacking problems such as image recognition, it helps to use a system that understands not just individual pixels, but also increasingly more complex concepts: from edges to simple geometric shapes, all the way up through complex, multi-object scenes.

In later chapters, we'll see evidence suggesting that deep networks do a better job than shallow networks at learning such hierarchies of knowledge.

and empirical evidence suggests that deep networks are the networks best adapted to learn the functions useful in solving many real-world problems.

Activation Functions in Neural Networks (Sigmoid, ReLU, tanh, softmax)

ActivationFunctions #ReLU #Sigmoid #Softmax #MachineLearning Activation Functions in Neural Networks are used to contain the output between fixed values ...

Neural Network Calculation (Part 2): Activation Functions & Basic Calculation

From In this part we see how to calculate one section of a neural network. This calculation will be repeated many times to ..

Activation Functions

In a neural network, the output value of a neuron is almost always transformed in some way using a function. A trivial choice would be a linear transformation ...

Deep Learning with Tensorflow - Activation Functions

Enroll in the course for free at: Deep Learning with TensorFlow Introduction The majority of data ..

Derivative of the sigmoid activation function, 9/2/2015

Activation Function using Sigmoid & ReLU using TensorFlow

Impact of Bias on the Sigmoid Activation function

OptimizersLossesAndMetrics - Keras

Here I go over the nitty-gritty parts of models, including the optimizers, the losses and the metrics. I first go over the usage of optimizers. Optimizers are ...

Sigmoid function

A sigmoid function is a mathematical function having an "S" shape (sigmoid curve). Often, sigmoid function refers to the special case of the logistic function ...

Neural network tutorial: The back-propagation algorithm (Part 1)

In this video we will derive the back-propagation algorithm as is used for neural networks. I use the sigmoid transfer function because it is the most common, but ...