# AI News, Artificial Neural Networks/Activation Functions ## Artificial Neural Networks/Activation Functions

There are a number of common activation functions in use with neural networks.

The output is a certain value, A1, if the input sum is above a certain threshold and A0 if the input sum is below a certain threshold.

These kinds of step activation functions are useful for binary classification schemes.

In other words, when we want to classify an input pattern into one of two groups, we can use a binary classifier with a step activation function.

Each identifier would be a small network that would output a 1 if a particular input feature is present, and a 0 otherwise.

Combining multiple feature detectors into a single network would allow a very complicated clustering or classification problem to be solved.

linear combination is where the weighted sum input of the neuron plus a linearly dependant bias becomes the system output.

In these cases, the sign of the output is considered to be equivalent to the 1 or 0 of the step function systems, which enables the two methods be to equivalent if

This is called the log-sigmoid because a sigmoid can also be constructed using the hyperbolic tangent function instead of this relation, in which case it would be called a tan-sigmoid.

Sigmoid functions in this respect are very similar to the input-output relationships of biological neurons, although not exactly the same.

Sigmoid functions are also prized because their derivatives are easy to calculate, which is helpful for calculating the weight updates in certain training algorithms.

The softmax activation function is useful predominantly in the output layer of a clustering system.

## Understanding Activation Functions in Neural Networks

Recently, a colleague of mine asked me a few questions like “why do we have so many activation functions?”, “why is that one works better than the other?”, ”how do we know which one to use?”, “is it hardcore maths?” and so on.

So I thought, why not write an article on it for those who are familiar with neural network only at a basic level and is therefore, wondering about activation functions and their “why-how-mathematics!”.

Simply put, it calculates a “weighted sum” of its input, adds a bias and then decides whether it should be “fired” or not ( yeah right, an activation function does this, but let’s go with the flow for a moment ).

Because we learnt it from biology that’s the way brain works and brain is a working testimony of an awesome and intelligent system ).

To check the Y value produced by a neuron and decide whether outside connections should consider this neuron as “fired” or not.

You would want the network to activate only 1 neuron and others should be 0 ( only then would you be able to say it classified properly/identified the class ).

And then if more than 1 neuron activates, you could find which neuron has the “highest activation” and so on ( better than max, a softmax, but let’s leave that for now ).

But..since there are intermediate activation values for the output, learning can be smoother and easier ( less wiggly ) and chances of more than 1 neuron being 100% activated is lesser when compared to step function while training ( also depending on what you are training and the data ).

Ok, so we want something to give us intermediate ( analog ) activation values rather than saying “activated” or not ( binary ).

straight line function where activation is proportional to input ( which is the weighted sum from neuron ).

We can definitely connect a few neurons together and if more than 1 fires, we could take the max ( or softmax) and decide based on that.

If there is an error in prediction, the changes made by back propagation is constant and not depending on the change in input delta(x) !!!

That activation in turn goes into the next level as input and the second layer calculates weighted sum on that input and it in turn, fires based on another linear activation function.

No matter how many layers we have, if all are linear in nature, the final activation function of last layer is nothing but just a linear function of the input of first layer!

No matter how we stack, the whole network is still equivalent to a single layer with linear activation ( a combination of linear functions in a linear manner is still another linear function ).

It tends to bring the activations to either side of the curve ( above x = 2 and below x = -2 for example).

Another advantage of this activation function is, unlike linear function, the output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of linear function.

The network refuses to learn further or is drastically slow ( depending on use case and until gradient /computation gets hit by floating point value limits ).

Imagine a network with random initialized weights ( or normalised ) and almost 50% of the network yields 0 activation because of the characteristic of ReLu ( output 0 for negative values of x ).

That means, those neurons which go into that state will stop responding to variations in error/ input ( simply because gradient is 0, nothing changes ).

When you know the function you are trying to approximate has certain characteristics, you can choose an activation function which will approximate the function faster leading to faster training process.

For example, a sigmoid works well for a classifier ( see the graph of sigmoid, doesn’t it show the properties of an ideal classifier?

## Activation function

In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs.

A standard computer chip circuit can be seen as a digital network of activation functions that can be 'ON' (1) or 'OFF' (0), depending on input.

This is similar to the behavior of the linear perceptron in neural networks.

However, only nonlinear activation functions allow such networks to compute nontrivial problems using only a small number of nodes&#91;1&#93;.

In artificial neural networks this function is also called the transfer function.

In biologically inspired neural networks, the activation function is usually an abstraction representing the rate of action potential firing in the cell.&#91;2&#93;

In its simplest form, this function is binary—that is, either the neuron is firing or not.

The function looks like

(

v

i

)

{\displaystyle \phi (v_{i})=U(v_{i})}

is the Heaviside step function.

In this case many neurons must be used in computation beyond linear separation of categories.

line of positive slope may be used to reflect the increase in firing rate that occurs as input current increases.

Such a function would be of the form

{\displaystyle \phi (v_{i})=\mu v_{i}}

{\displaystyle \mu }

This activation function is linear, and therefore has the same problems as the binary function.

In addition, networks constructed using this model have unstable convergence because neuron inputs along favored paths tend to increase without bound, as this function is not normalizable.

All problems mentioned above can be handled by using a normalizable sigmoid activation function.

One realistic model stays at zero until input current is received, at which point the firing frequency increases quickly at first, but gradually approaches an asymptote at 100% firing rate.

Mathematically, this looks like

{\displaystyle \phi (v_{i})=U(v_{i})\tanh(v_{i})}

where the hyperbolic tangent function can be replaced by any sigmoid function.

This behavior is realistically reflected in the neuron, as neurons cannot physically fire faster than a certain rate.

This model runs into problems, however, in computational networks as it is not differentiable, a requirement to calculate backpropagation.

The final model, then, that is used in multilayer perceptrons is a sigmoidal activation function in the form of a hyperbolic tangent.

Two forms of this function are commonly used:

{\displaystyle \phi (v_{i})=\tanh(v_{i})}

whose range is normalized from -1 to 1, and

{\displaystyle \phi (v_{i})=(1+\exp(-v_{i}))^{-1}}

is vertically translated to normalize from 0 to 1.

The latter model is often considered more biologically realistic, but it runs into theoretical and experimental difficulties with certain types of computational problems.

special class of activation functions known as radial basis functions (RBFs) are used in RBF networks, which are extremely efficient as universal function approximators.

These activation functions can take many forms, but they are usually found as one of three functions:

{\displaystyle c_{i}}

is the vector representing the function center and

{\displaystyle a}

{\displaystyle \sigma }

Support vector machines (SVMs) can effectively utilize a class of activation functions that includes both sigmoids and RBFs.

In this case, the input is transformed to reflect a decision boundary hyperplane based on a few training inputs called support vectors

{\displaystyle x}

The activation function for the hidden layer of these machines is referred to as the inner product kernel,

{\displaystyle K(v_{i},x)=\phi (v_{i})}

The support vectors are represented as the centers in RBFs with the kernel equal to the activation function, but they take a unique form in the perceptron as

{\displaystyle \beta _{0}}

{\displaystyle \beta _{1}}

must satisfy certain conditions for convergence.

These machines can also accept arbitrary-order polynomial activation functions where

Activation function having types:

Some desirable properties in an activation function include:

The following table compares the properties of several activation functions that are functions of one fold x from the previous layer or layers:

The following table lists activation functions that are not functions of a single fold x from the previous layer or layers:

{\displaystyle \delta _{ij}}

## Fundamentals of Deep Learning &#8211; Activation Functions and When to Use Them?

When our brain is fed with a lot of information simultaneously, it tries hard to understand and classify the information between useful and not-so-useful information.

Let us go through these activation functions, how they work and figure out which activation functions fits well into what kind of  problem statement.

Before I delve into the details of activation functions, let&#8217;s do a little review of what are neural networks and how they function.

A neural network is a very powerful machine learning mechanism which basically mimics how a human brain learns.

The brain receives the stimulus from the outside world, does the processing on the input, and then generates the output.

As the task gets complicated multiple neurons form a complex network, passing information among themselves.

The black circles in the picture above are neurons. Each neuron is characterized by its weight, bias and activation function.

A linear equation is simple to solve but is limited in its capacity to solve complex problems. A neural network without an activation function is essentially just a linear regression model.

The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks.

We would want our neural networks to work on complicated tasks like language translations and image classifications.

Activation functions make the back-propagation possible since the gradients are supplied along with the error to update the weights and biases.

If the value Y is above a given threshold value then activate the neuron else leave it deactivated.

When we simply need to say yes or no for a single class, step function would be the best choice, as it would either activate the neuron or leave it to zero.

The function is more theoretical than practical since in most cases we would be classifying the data into multiple classes than just a single class.

This makes the step function not so useful since during back-propagation when the gradients of the activation functions are sent for error calculations to improve and optimize the results.

The gradient of the step function reduces it all to zero and improvement of the models doesn&#8217;t really happen.

We saw the problem with the step function, the gradient being zero, it was impossible to update gradient during the backpropagation.

Now if each layer has a linear transformation, no matter how many layers we have the final output is nothing but a linear transformation of the input.

Our choice of using sigmoid or tanh would basically depend on the requirement of gradient in the problem statement.

First things first, the ReLU function is non linear, which means we can easily backpropagate the errors and have multiple layers of neurons being activated by the ReLU function.

But ReLU also falls a prey to the gradients moving towards zero. If you look at the negative side of the graph, the gradient is zero, which means for activations in that region, the gradient is zero and the weights are not updated during back propagation.

So in this case the gradient of the left side of the graph is non zero and so we would no longer encounter dead neurons in that region.

The parametrised ReLU function is used when the leaky ReLU function still fails to solve the problem of dead neurons and the relevant information is not successfully passed to the next layer.

The softmax function is also a type of sigmoid function but is handy when we are trying to handle classification problems.

The softmax function would squeeze the outputs for each class between 0 and 1 and would also divide by the sum of the outputs.

The softmax function is ideally used in the output layer of the classifier where we are actually trying to attain the probabilities to define the class of each input.

Now that we have seen so many activation  functions, we need some logic / heuristics to know which activation function should be used in which situation.

However depending upon the properties of the problem we might be able to make a better choice for easy and quicker convergence of the network.

In this article I have discussed the various types of activation functions and what are the types of problems one might encounter while using each of them.

## Artificial Neural Networks/Activation Functions

There are a number of common activation functions in use with neural networks.

The output is a certain value, A1, if the input sum is above a certain threshold and A0 if the input sum is below a certain threshold.

These kinds of step activation functions are useful for binary classification schemes.

In other words, when we want to classify an input pattern into one of two groups, we can use a binary classifier with a step activation function.

Each identifier would be a small network that would output a 1 if a particular input feature is present, and a 0 otherwise.

Combining multiple feature detectors into a single network would allow a very complicated clustering or classification problem to be solved.

linear combination is where the weighted sum input of the neuron plus a linearly dependent bias becomes the system output.

In these cases, the sign of the output is considered to be equivalent to the 1 or 0 of the step function systems, which enables the two methods be to equivalent if

This is called the log-sigmoid because a sigmoid can also be constructed using the hyperbolic tangent function instead of this relation, in which case it would be called a tan-sigmoid.

Sigmoid functions in this respect are very similar to the input-output relationships of biological neurons, although not exactly the same.

Sigmoid functions are also prized because their derivatives are easy to calculate, which is helpful for calculating the weight updates in certain training algorithms.

The softmax activation function is useful predominantly in the output layer of a clustering system.

## Multi-Layer Neural Networks with Sigmoid Function— Deep Learning for Rookies (2)

Welcome back to my second post of the series Deep Learning for Rookies (DLFR), by yours truly, a rookie ;) Feel free to refer back to my first post here or my blog if you find it hard to follow.

You’ll be able to brag about your understanding soon ;) Last time, we introduced the field of Deep Learning and examined a simple a neural network — perceptron……or a dinosaur……ok, seriously, a single-layer perceptron.

After all, most problems in the real world are non-linear, and as individual humans, you and I are pretty darn good at the decision-making of linear or binary problems like should I study Deep Learning or not without needing a perceptron.

Fast forward almost two decades to 1986, Geoffrey Hinton, David Rumelhart, and Ronald Williams published a paper “Learning representations by back-propagating errors”, which introduced: If you are completely new to DL, you should remember Geoffrey Hinton, who plays a pivotal role in the progress of DL.

Remember that we iterated the importance of designing a neural network so that the network can learn from the difference between the desired output (what the fact is) and actual output (what the network returns) and then send a signal back to the weights and ask the weights to adjust themselves?

Secondly, when we multiply each of the m features with a weight (w1, w2, …, wm) and sum them all together, this is a dot product: So here are the takeaways for now: The procedure of how input values are forward propagated into the hidden layer, and then from hidden layer to the output is the same as in Graph 1.

One thing to remember is: If the activation function is linear, then you can stack as many hidden layers in the neural network as you wish, and the final output is still a linear combination of the original input data.

So basically, a small change in any weight in the input layer of our perceptron network could possibly lead to one neuron to suddenly flip from 0 to 1, which could again affect the hidden layer’s behavior, and then affect the final outcome.

Non-linear just means that the output we get from the neuron, which is the dot product of some inputs x (x1, x2, …, xm) and weights w (w1, w2, …,wm) plus bias and then put into a sigmoid function, cannot be represented by a linear combination of the input x (x1, x2, …,xm).

This non-linear activation function, when used by each neuron in a multi-layer neural network, produces a new “representation” of the original data, and ultimately allows for non-linear decision boundary, such as XOR.

if our output value is on the lower flat area on the two corners, then it’s false or 0 since it’s not right to say the weather is both hot and cold or neither hot or cold (ok, I guess the weather could be neither hot or cold…you get what I mean though…right?).

You can memorize these takeaways since they’re facts, but I encourage you to google a bit on the internet and see if you can understand the concept better (it is natural that we take some time to understand these concepts).

From the XOR example above, you’ve seen that adding two hidden neurons in 1 hidden layer could reshape our problem into a different space, which magically created a way for us to classify XOR with a ridge.

Now, the computer can’t really “see” a digit like we humans do, but if we dissect the image into an array of 784 numbers like [0, 0, 180, 16, 230, …, 4, 77, 0, 0, 0], then we can feed this array into our neural network.

So if the neural network thinks the handwritten digit is a zero, then we should get an output array of [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], the first output in this array that senses the digit to be a zero is “fired” to be 1 by our neural network, and the rest are 0.

If the neural network thinks the handwritten digit is a 5, then we should get [0, 0, 0, 0, 0, 1, 0, 0, 0, 0].

Remember we mentioned that neural networks become better by repetitively training themselves on data so that they can adjust the weights in each layer of the network to get the final results/actual output closer to the desired output?

For the sake of argument, let’s imagine the following case in Graph 14, which I borrow from Michael Nielsen’s online book: After training the neural network with rounds and rounds of labeled data in supervised learning, assume the first 4 hidden neurons learned to recognize the patterns above in the left side of Graph 14.

Then, if we feed the neural network an array of a handwritten digit zero, the network should correctly trigger the top 4 hidden neurons in the hidden layer while the other hidden neurons are silent, and then again trigger the first output neuron while the rest are silent.

If you train the neural network with a new set of randomized weights, it might produce the following network instead (compare Graph 15 with Graph 14), since the weights are randomized and we never know which one will learn which or what pattern.

Activation Functions in Neural Networks (Sigmoid, ReLU, tanh, softmax)

ActivationFunctions #ReLU #Sigmoid #Softmax #MachineLearning Activation Functions in Neural Networks are used to contain the output between fixed values ...

Neural Network Calculation (Part 2): Activation Functions & Basic Calculation

From In this part we see how to calculate one section of a neural network. This calculation will be repeated many times to ..

Activation Functions

In a neural network, the output value of a neuron is almost always transformed in some way using a function. A trivial choice would be a linear transformation ...

Deep Learning with Tensorflow - Activation Functions

Enroll in the course for free at: Deep Learning with TensorFlow Introduction The majority of data ..

Derivative of the sigmoid activation function, 9/2/2015

Activation Function using Sigmoid & ReLU using TensorFlow

Impact of Bias on the Sigmoid Activation function

OptimizersLossesAndMetrics - Keras

Here I go over the nitty-gritty parts of models, including the optimizers, the losses and the metrics. I first go over the usage of optimizers. Optimizers are ...

Neural network tutorial: The back-propagation algorithm (Part 1)

In this video we will derive the back-propagation algorithm as is used for neural networks. I use the sigmoid transfer function because it is the most common, but ...

Sigmoid function

A sigmoid function is a mathematical function having an "S" shape (sigmoid curve). Often, sigmoid function refers to the special case of the logistic function ...