# AI News, Artificial Neural Networks/Activation Functions ## Artificial Neural Networks/Activation Functions

There are a number of common activation functions in use with neural networks.

The output is a certain value, A1, if the input sum is above a certain threshold and A0 if the input sum is below a certain threshold.

These kinds of step activation functions are useful for binary classification schemes.

In other words, when we want to classify an input pattern into one of two groups, we can use a binary classifier with a step activation function.

Each identifier would be a small network that would output a 1 if a particular input feature is present, and a 0 otherwise.

Combining multiple feature detectors into a single network would allow a very complicated clustering or classification problem to be solved.

linear combination is where the weighted sum input of the neuron plus a linearly dependent bias becomes the system output.

In these cases, the sign of the output is considered to be equivalent to the 1 or 0 of the step function systems, which enables the two methods be to equivalent if

This is called the log-sigmoid because a sigmoid can also be constructed using the hyperbolic tangent function instead of this relation, in which case it would be called a tan-sigmoid.

Sigmoid functions in this respect are very similar to the input-output relationships of biological neurons, although not exactly the same.

Sigmoid functions are also prized because their derivatives are easy to calculate, which is helpful for calculating the weight updates in certain training algorithms.

The softmax activation function is useful predominantly in the output layer of a clustering system.

## Understanding Activation Functions in Neural Networks

Recently, a colleague of mine asked me a few questions like “why do we have so many activation functions?”, “why is that one works better than the other?”, ”how do we know which one to use?”, “is it hardcore maths?” and so on.

So I thought, why not write an article on it for those who are familiar with neural network only at a basic level and is therefore, wondering about activation functions and their “why-how-mathematics!”.

Simply put, it calculates a “weighted sum” of its input, adds a bias and then decides whether it should be “fired” or not ( yeah right, an activation function does this, but let’s go with the flow for a moment ).

Because we learnt it from biology that’s the way brain works and brain is a working testimony of an awesome and intelligent system ).

To check the Y value produced by a neuron and decide whether outside connections should consider this neuron as “fired” or not.

You would want the network to activate only 1 neuron and others should be 0 ( only then would you be able to say it classified properly/identified the class ).

And then if more than 1 neuron activates, you could find which neuron has the “highest activation” and so on ( better than max, a softmax, but let’s leave that for now ).

But..since there are intermediate activation values for the output, learning can be smoother and easier ( less wiggly ) and chances of more than 1 neuron being 100% activated is lesser when compared to step function while training ( also depending on what you are training and the data ).

Ok, so we want something to give us intermediate ( analog ) activation values rather than saying “activated” or not ( binary ).

straight line function where activation is proportional to input ( which is the weighted sum from neuron ).

We can definitely connect a few neurons together and if more than 1 fires, we could take the max ( or softmax) and decide based on that.

If there is an error in prediction, the changes made by back propagation is constant and not depending on the change in input delta(x) !!!

That activation in turn goes into the next level as input and the second layer calculates weighted sum on that input and it in turn, fires based on another linear activation function.

No matter how many layers we have, if all are linear in nature, the final activation function of last layer is nothing but just a linear function of the input of first layer!

No matter how we stack, the whole network is still equivalent to a single layer with linear activation ( a combination of linear functions in a linear manner is still another linear function ).

It tends to bring the activations to either side of the curve ( above x = 2 and below x = -2 for example).

Another advantage of this activation function is, unlike linear function, the output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of linear function.

The network refuses to learn further or is drastically slow ( depending on use case and until gradient /computation gets hit by floating point value limits ).

Imagine a network with random initialized weights ( or normalised ) and almost 50% of the network yields 0 activation because of the characteristic of ReLu ( output 0 for negative values of x ).

That means, those neurons which go into that state will stop responding to variations in error/ input ( simply because gradient is 0, nothing changes ).

When you know the function you are trying to approximate has certain characteristics, you can choose an activation function which will approximate the function faster leading to faster training process.

For example, a sigmoid works well for a classifier ( see the graph of sigmoid, doesn’t it show the properties of an ideal classifier?

## Activation function

In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs.

A standard computer chip circuit can be seen as a digital network of activation functions that can be 'ON' (1) or 'OFF' (0), depending on input.

This is similar to the behavior of the linear perceptron in neural networks.

However, only nonlinear activation functions allow such networks to compute nontrivial problems using only a small number of nodes&#91;1&#93;.

In artificial neural networks this function is also called the transfer function.

In biologically inspired neural networks, the activation function is usually an abstraction representing the rate of action potential firing in the cell.&#91;2&#93;

In its simplest form, this function is binary—that is, either the neuron is firing or not.

The function looks like

(

v

i

)

{\displaystyle \phi (v_{i})=U(v_{i})}

is the Heaviside step function.

In this case many neurons must be used in computation beyond linear separation of categories.

line of positive slope may be used to reflect the increase in firing rate that occurs as input current increases.

Such a function would be of the form

{\displaystyle \phi (v_{i})=\mu v_{i}}

{\displaystyle \mu }

This activation function is linear, and therefore has the same problems as the binary function.

In addition, networks constructed using this model have unstable convergence because neuron inputs along favored paths tend to increase without bound, as this function is not normalizable.

All problems mentioned above can be handled by using a normalizable sigmoid activation function.

One realistic model stays at zero until input current is received, at which point the firing frequency increases quickly at first, but gradually approaches an asymptote at 100% firing rate.

Mathematically, this looks like

{\displaystyle \phi (v_{i})=U(v_{i})\tanh(v_{i})}

where the hyperbolic tangent function can be replaced by any sigmoid function.

This behavior is realistically reflected in the neuron, as neurons cannot physically fire faster than a certain rate.

This model runs into problems, however, in computational networks as it is not differentiable, a requirement to calculate backpropagation.

The final model, then, that is used in multilayer perceptrons is a sigmoidal activation function in the form of a hyperbolic tangent.

Two forms of this function are commonly used:

{\displaystyle \phi (v_{i})=\tanh(v_{i})}

whose range is normalized from -1 to 1, and

{\displaystyle \phi (v_{i})=(1+\exp(-v_{i}))^{-1}}

is vertically translated to normalize from 0 to 1.

The latter model is often considered more biologically realistic, but it runs into theoretical and experimental difficulties with certain types of computational problems.

special class of activation functions known as radial basis functions (RBFs) are used in RBF networks, which are extremely efficient as universal function approximators.

These activation functions can take many forms, but they are usually found as one of three functions:

{\displaystyle c_{i}}

is the vector representing the function center and

{\displaystyle a}

{\displaystyle \sigma }

are parameters affecting the spread of the radius.

Support vector machines (SVMs) can effectively utilize a class of activation functions that includes both sigmoids and RBFs.

In this case, the input is transformed to reflect a decision boundary hyperplane based on a few training inputs called support vectors

{\displaystyle x}

The activation function for the hidden layer of these machines is referred to as the inner product kernel,

{\displaystyle K(v_{i},x)=\phi (v_{i})}

The support vectors are represented as the centers in RBFs with the kernel equal to the activation function, but they take a unique form in the perceptron as

{\displaystyle \beta _{0}}

{\displaystyle \beta _{1}}

must satisfy certain conditions for convergence.

These machines can also accept arbitrary-order polynomial activation functions where

Activation function having types:

Some desirable properties in an activation function include:

The following table compares the properties of several activation functions that are functions of one fold x from the previous layer or layers:

The following table lists activation functions that are not functions of a single fold x from the previous layer or layers:

{\displaystyle \delta _{ij}}

## Fundamentals of Deep Learning &#8211; Activation Functions and When to Use Them?

Internet provides access to plethora of information today.

When our brain is fed with a lot of information simultaneously, it tries hard to understand and classify the information between useful and not-so-useful information.

Let us go through these activation functions, how they work and figure out which activation functions fits well into what kind of  problem statement.

Before I delve into the details of activation functions, let&#8217;s do a little review of what are neural networks and how they function.

A neural network is a very powerful machine learning mechanism which basically mimics how a human brain learns.

The brain receives the stimulus from the outside world, does the processing on the input, and then generates the output.

As the task gets complicated multiple neurons form a complex network, passing information among themselves.

The black circles in the picture above are neurons. Each neuron is characterized by its weight, bias and activation function.

A linear equation is simple to solve but is limited in its capacity to solve complex problems. A neural network without an activation function is essentially just a linear regression model.

The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks.

We would want our neural networks to work on complicated tasks like language translations and image classifications.

Activation functions make the back-propagation possible since the gradients are supplied along with the error to update the weights and biases.

If the value Y is above a given threshold value then activate the neuron else leave it deactivated.

When we simply need to say yes or no for a single class, step function would be the best choice, as it would either activate the neuron or leave it to zero.

The function is more theoretical than practical since in most cases we would be classifying the data into multiple classes than just a single class.

This makes the step function not so useful since during back-propagation when the gradients of the activation functions are sent for error calculations to improve and optimize the results.

The gradient of the step function reduces it all to zero and improvement of the models doesn&#8217;t really happen.

We saw the problem with the step function, the gradient being zero, it was impossible to update gradient during the backpropagation.

Now if each layer has a linear transformation, no matter how many layers we have the final output is nothing but a linear transformation of the input.

Our choice of using sigmoid or tanh would basically depend on the requirement of gradient in the problem statement.

First things first, the ReLU function is non linear, which means we can easily backpropagate the errors and have multiple layers of neurons being activated by the ReLU function.

But ReLU also falls a prey to the gradients moving towards zero. If you look at the negative side of the graph, the gradient is zero, which means for activations in that region, the gradient is zero and the weights are not updated during back propagation.

So in this case the gradient of the left side of the graph is non zero and so we would no longer encounter dead neurons in that region.

The parametrised ReLU function is used when the leaky ReLU function still fails to solve the problem of dead neurons and the relevant information is not successfully passed to the next layer.

The softmax function is also a type of sigmoid function but is handy when we are trying to handle classification problems.

The softmax function would squeeze the outputs for each class between 0 and 1 and would also divide by the sum of the outputs.

The softmax function is ideally used in the output layer of the classifier where we are actually trying to attain the probabilities to define the class of each input.

Now that we have seen so many activation  functions, we need some logic / heuristics to know which activation function should be used in which situation.

However depending upon the properties of the problem we might be able to make a better choice for easy and quicker convergence of the network.

In this article I have discussed the various types of activation functions and what are the types of problems one might encounter while using each of them.

## Artificial Neural Networks/Activation Functions

There are a number of common activation functions in use with neural networks.

The output is a certain value, A1, if the input sum is above a certain threshold and A0 if the input sum is below a certain threshold.

These kinds of step activation functions are useful for binary classification schemes.

In other words, when we want to classify an input pattern into one of two groups, we can use a binary classifier with a step activation function.

Each identifier would be a small network that would output a 1 if a particular input feature is present, and a 0 otherwise.

Combining multiple feature detectors into a single network would allow a very complicated clustering or classification problem to be solved.

linear combination is where the weighted sum input of the neuron plus a linearly dependent bias becomes the system output.

In these cases, the sign of the output is considered to be equivalent to the 1 or 0 of the step function systems, which enables the two methods be to equivalent if

This is called the log-sigmoid because a sigmoid can also be constructed using the hyperbolic tangent function instead of this relation, in which case it would be called a tan-sigmoid.

Sigmoid functions in this respect are very similar to the input-output relationships of biological neurons, although not exactly the same.

Sigmoid functions are also prized because their derivatives are easy to calculate, which is helpful for calculating the weight updates in certain training algorithms.

The softmax activation function is useful predominantly in the output layer of a clustering system.

## Activation Functions: Neural Networks

As you can see the function is a line or linear.Therefore, the output of the functions will not be confined between any range.

Equation : f(x) = x Range : (-infinity to infinity) It doesn’t help with the complexity or various parameters of usual data that is fed to the neural networks.

Nonlinearity helps to makes the graph look something like this It makes it easy for the model to generalize or adapt with variety of data and to differentiate between the output.

Therefore, it is especially used for models where we have to predict the probability as an output.Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.

But the issue is that all the negative values become zero immediately which decreases the ability of the model to fit or train from the data properly.

That means any negative input given to the ReLU activation function turns the value into zero immediately in the graph, which in turns affects the resulting graph by not mapping the negative values appropriately.

Activation Functions in Neural Networks (Sigmoid, ReLU, tanh, softmax)

ActivationFunctions #ReLU #Sigmoid #Softmax #MachineLearning Activation Functions in Neural Networks are used to contain the output between fixed values ...

Neural Network Calculation (Part 2): Activation Functions & Basic Calculation

From In this part we see how to calculate one section of a neural network. This calculation will be repeated many times to ..

Deep Learning with Tensorflow - Activation Functions

Enroll in the course for free at: Deep Learning with TensorFlow Introduction The majority of data ..

Activation Functions

In a neural network, the output value of a neuron is almost always transformed in some way using a function. A trivial choice would be a linear transformation ...

Derivative of the sigmoid activation function, 9/2/2015

Activation Function using Sigmoid & ReLU using TensorFlow

Mod-08 Lec-26 Multilayer Feedforward Neural networks with Sigmoidal activation functions;

Pattern Recognition by Prof. P.S. Sastry, Department of Electronics & Communication Engineering, IISc Bangalore. For more details on NPTEL visit ...

Training a two input perceptron to build an AND gate, 4/2/2015

The University of Jordan; Lutfi Al-Sharif; Mechatronics Engineering.

Neural networks [1.2] : Feedforward neural network - activation function

OptimizersLossesAndMetrics - Keras

Here I go over the nitty-gritty parts of models, including the optimizers, the losses and the metrics. I first go over the usage of optimizers. Optimizers are ...