AI News, NIPS Proceedingsβ

NIPS Proceedingsβ

Part of: Advances in Neural Information Processing Systems 27 (NIPS 2014) Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on the first layer they learn features similar to Gabor filters and color blobs.

Transferability is negatively affected by two distinct issues: (1) the specialization of higher layer neurons to their original task at the expense of performance on the target task, which was expected, and (2) optimization difficulties related to splitting networks between co-adapted neurons, which was not expected.

NIPS: Oral Session 4 – Jason Yosinski

Transferability is negatively affected by two distinct issues: (1) the specialization of higher layer neurons to their original task at the expense of performance on the target task, which was expected, and (2) optimization difficulties related to splitting networks between co-adapted neurons, which was not expected.

We also document that the transferability of features decreases as the distance between the base task and target task increases, but that transferring features even from distant tasks can be better than using random features.

Types of artificial neural networks

Particularly, they are inspired by the behaviour of neurons and the electrical signals they convey between input (such as from the eyes or nerve endings in the hand), processing, and output from the brain (such as reacting to light, touch, or heat).

The way neurons semantically communicate is an area of ongoing research.[1][2][3][4] Most artificial neural networks bear only some resemblance to their more complex biological counterparts, but are very effective at their intended tasks (e.g.

Then, using PDF of each class, the class probability of a new input is estimated and Bayes’ rule is employed to allocate it to the class with the highest posterior probability.[10] It was derived from the Bayesian network[11] and a statistical algorithm called Kernel Fisher discriminant analysis.[12] It is used for classification and pattern recognition.

It has been implemented using a perceptron network whose connection weights were trained with back propagation (supervised learning).[13] In a convolutional neural network (CNN, or ConvNet or shift invariant or space invariant.[14][15]) the unit connectivity pattern is inspired by the organization of the visual cortex.

Unit response can be approximated mathematically by a convolution operation.[16] They are variations of multilayer perceptrons that use minimal preprocessing.[17] They have wide applications in image and video recognition, recommender systems[18] and natural language processing.[19] Regulatory feedback networks started as a model to explain brain phenomena found during recognition including network-wide bursting and difficulty with similarity found universally in sensory recognition.[20] This approach can also perform mathematically equivalent classification as feedforward methods and is used as a tool to create and modify networks.[21][22] Radial basis functions are functions that have a distance criterion with respect to a center.

In classification problems the output layer is typically a sigmoid function of a linear combination of hidden layer values, representing a posterior probability.

A common solution is to associate each data point with its own centre, although this can expand the linear system to be solved in the final layer and requires shrinkage techniques to avoid overfitting.

All three approaches use a non-linear kernel function to project the input data into a space where the learning problem can be solved using a linear model.

Alternatively, if 9-NN classification is used and the closest 9 points are considered, then the effect of the surrounding 8 positive points may outweigh the closest 9 (negative) point.

The Euclidean distance is computed from the new point to the center of each neuron, and a radial basis function (RBF) (also called a kernel function) is applied to the distance to compute the weight (influence) for each neuron.

For supervised learning in discrete time settings, training sequences of real-valued input vectors become sequences of activations of the input nodes, one input vector at a time.

At each time step, each non-input unit computes its current activation as a nonlinear function of the weighted sum of the activations of all units from which it receives connections.

To minimize total error, gradient descent can be used to change each weight in proportion to its derivative with respect to the error, provided the non-linear activation functions are differentiable.

The standard method is called 'backpropagation through time' or BPTT, a generalization of back-propagation for feedforward networks.[23][24] A more computationally expensive online variant is called 'Real-Time Recurrent Learning' or RTRL.[25][26] Unlike BPTT this algorithm is local in time but not local in space.[27][28] An online hybrid between BPTT and RTRL with intermediate complexity exists,[29][30] with variants for continuous time.[31] A major problem with gradient descent for standard RNN architectures is that error gradients vanish exponentially quickly with the size of the time lag between important events.[32][33] The Long short-term memory architecture overcomes these problems.[34] In reinforcement learning settings, no teacher provides target signals.

Instead a fitness function or reward function or utility function is occasionally used to evaluate performance, which influences its input stream through output units connected to actuators that affect the environment.

These units connect from the hidden layer or the output layer with a fixed weight of one.[35] At each time step, the input is propagated in a standard feedforward fashion, and then a backpropagation-like learning rule is applied (not performing gradient descent).

ESN are good at reproducing certain time series.[36] A variant for spiking neurons is known as liquid state machines.[37] The long short-term memory (LSTM)[34] has no vanishing gradient problem.

LSTM RNN outperformed other RNN and other sequence learning methods such as HMM in applications such as language learning[38] and connected handwriting recognition.[39] Bi-directional RNNs, or BRNNs, use a finite sequence to predict or label each element of the sequence based on both the past and the future context of the element.[40] This is done by adding the outputs of two RNNs: one processing the sequence from left to right, the other one from right to left.

Because neural networks suffer from local minima, starting with the same architecture and training but using randomly different initial weights often gives vastly different results.[citation needed] A CoM tends to stabilize the result.

The CoM is similar to the general machine learning bagging method, except that the necessary variety of machines in the committee is obtained by training from different starting weights rather than training on different randomly selected subsets of the training data.

SNNs are also a form of pulse computer.[46] Spiking neural networks with axonal conduction delays exhibit polychronization, and hence could have a very large memory capacity.[47] SNNs and the temporal correlations of neural assemblies in such networks—have been used to model figure/ground separation and region linking in the visual system.

It uses multiple types of units, (originally two, called simple and complex cells), as a cascading model for use in pattern recognition tasks.[49][50][51] Local features are extracted by S-cells whose deformation is tolerated by C-cells.

Local features in the input are integrated gradually and classified at higher layers.[52] Among the various kinds of neocognitron[53] are systems that can detect multiple patterns in the same input by using back propagation to achieve selective attention.[54] It has been used for pattern recognition tasks and inspired convolutional neural networks.[55] Dynamic neural networks address nonlinear multivariate behaviour and include (learning of) time-dependent behaviour, such as transient phenomena and delay effects.

Instead of just adjusting the weights in a network of fixed topology,[56] Cascade-Correlation begins with a minimal network, then automatically trains and adds new hidden units one by one, creating a multi-layer structure.

It is done by creating a specific memory structure, which assigns each new pattern to an orthogonal plane using adjacently connected hierarchical arrays.[57] The network offers real-time pattern recognition and high scalability;

HTM combines and extends approaches used in Bayesian networks, spatial and temporal clustering algorithms, while using a tree-shaped hierarchy of nodes that is common in neural networks.

Transfer learning The art of using Pre-trained Models in Deep Learning

Neural networks are a different breed of models compared to the supervised machine learning algorithms.

So I am picking on a concept touched on by Tim Urban from one of his recent articles on waitbutwhy.com Tim explains that before language was invented, every generation of humans had to re-invent the knowledge for themselves and this is how knowledge growth was happening from one generation to other:

So, transfer learning by passing on weights is equivalent of language used to disseminate knowledge over generations in human evolution.

Instead of building a model from scratch to solve a similar problem, you use the model trained on other problem as a starting point.

You can spend years to build a decent image recognition algorithm from scratch or you can take inception model (a pre-trained model) from Google which was built on ImageNet data to identify images in those pictures.

pre-trained model may not be 100% accurate in your application, but it saves huge efforts required to re-invent the wheel.

This was an image classification problem where we were given 4591 images in the training dataset and 1200 images in the test dataset.

To simplify the above architecture after flattening the input image [224 X 224 X 3] into [150528], I used three hidden layers with 500, 500 and 500 neurons respectively.

Increasing the hidden layers and the number of neurons, caused 20 seconds to run a single epoch on my Titan X GPU with 12 GB VRAM.

used 3 convolutional blocks with each block following the below architecture- The result obtained after the final convolutional block was flattened into a size [256] and passed into a single hidden layer of with 64 neurons.

Though my accuracy increased in comparison to the MLP output, it also increased the time taken to run a single epoch – 21 seconds.

The only change that I made to the VGG16 existing architecture is changing the softmax layer with 1000 outputs to 16 categories suitable for our problem and re-training the dense layer.

Also, the biggest benefit of using the VGG16 pre-trained model was almost negligible time to train the dense layer with greater accuracy.

So, I moved forward with this approach of using a pre-trained model and the next step was to fine tune my VGG16 model to suit this problem.

By using pre-trained models which have been previously trained on large datasets, we can directly use the weights and architecture obtained and apply the learning on our problem statement.

the prediction we would get would be very inaccurate. For example, a model previously trained for speech recognition would work horribly if we try to use it to identify objects using it.

Imagenet data set has been widely used to build various architectures since it is large enough (1.2M images) to create a generalized model. The problem statement is to train a model that can correctly classify the images into 1,000 separate object categories.

These 1,000 image categories represent object classes that we come across in our day-to-day lives, such as species of dogs, cats, various household objects, vehicle types etc.

These pre-trained networks demonstrate a strong ability to generalize to images outside the ImageNet dataset via transfer learning.

The below diagram should help you decide on how to proceed on using the pre trained model in your case –

In this case all we do is just modify the dense layers and the final softmax layer to output 2 categories instead of a 1000.

Since the new data set has low similarity it is significant to retrain and customize the higher layers according to the new dataset.

 The small size of the data set is compensated by the fact that the initial layers are kept pretrained(which have been trained on a large dataset previously) and the weights for those layers are frozen.

  train_img.append(temp_img) #converting train images to array and applying mean subtraction processing train_img=np.array(train_img) train_img=preprocess_input(train_img)# applying the same procedure with the test dataset test_img=[]for i in range(len(test)):  

Extracting features from the train dataset using the VGG16 pre-trained model features_train=model.predict(train_img)# Extracting features from the train dataset using the VGG16 pre-trained model features_test=model.predict(test_img) #

flattening the layers to conform to MLP input train_x=features_train.reshape(49000,25088)# converting target variable to array train_y=np.asarray(train['label'])# performing one-hot encoding for the target variable train_y=pd.get_dummies(train_y)train_y=np.array(train_y)# creating training and validation set from sklearn.model_selection import train_test_split X_train,

Freeze the weights of first few layers – Here what we do is we freeze the weights of the first 8 layers of the vgg16 network, while we retrain the subsequent layers. This is because the first few layers capture universal features like curves and edges that are also relevant to our new problem.

  return model train_y=np.asarray(train['label']) le = LabelEncoder() train_y = le.fit_transform(train_y) train_y=to_categorical(train_y) train_y=np.array(train_y) from sklearn.model_selection import train_test_split X_train,

There are various architectures people have tried on different types of data sets and I strongly encourage you to go through these architectures and apply them on your own problem statements.

NIPS: Oral Session 4 - Jason Yosinski

How transferable are features in deep neural networks? Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on ...

Transfer Learning | Lecture 9

Let's talk about the fastest and easiest way you can build a deep learning model, without worrying too much about how much data you have. Deep Learning ...

Lecture 7 | Training Neural Networks II

Lecture 7 continues our discussion of practical issues for training neural networks. We discuss different update rules commonly used to optimize neural networks ...

Lecture 4: Word Window Classification and Neural Networks

Lecture 4 introduces single and multilayer neural networks, and how they can be used for classification purposes. Key phrases: Neural networks. Forward ...

MIT 6.S094: Deep Reinforcement Learning for Motion Planning

This is lecture 2 of course 6.S094: Deep Learning for Self-Driving Cars taught in Winter 2017. This lecture introduces types of machine learning, the neuron as a ...

Dr. Yann LeCun, "How Could Machines Learn as Efficiently as Animals and Humans?"

Brown Statistics, NESS Seminar and Charles K. Colver Lectureship Series Deep learning has caused revolutions in computer perception and natural language ...

Can Depression Be Cured? New Research on Depression and its Treatments

Four medical researchers at the forefront of developing treatments for depression present new findings in a special conference held at the Library's John W.

Dense Associative Memories and Deep Learning

Dense Associative Memories are generalizations of Hopfield nets to higher order (higher than quadratic) interactions between the spins/neurons. I will describe ...

Word Embedding Explained and Visualized - word2vec and wevi

This is a talk I gave at Ann Arbor Deep Learning Event (a2-dlearn) hosted by Daniel Pressel et al. I gave an introduction to the working mechanism of the ...

The Ethics and Governance of AI: Opening Event

Chapter 1: 0:04 - Joi Ito Chapter 2: 1:03:27 - Jonathan Zittrain Chapter 3: 2:32:59 - Panel 1 Chapter 4: 3:19:13 - Panel 2 More information at: ...