AI News, Efficient probabilistic inference in generic neural networks trained with non-probabilistic feedback

Efficient probabilistic inference in generic neural networks trained with non-probabilistic feedback

We trained generic feedforward or recurrent neural networks on nine probabilistic psychophysical tasks that are commonly studied in the experimental and computational literature.

Networks were trained to minimize mean squared error or cross-entropy in tasks with continuous or categorical outputs, respectively.

To make sure that optimal performance in our tasks cannot be easily mimicked by heuristic, non-probabilistic models, we also calculated the performance of non-probabilistic reference models that did not take the reliabilities of the inputs into account (Fig.2c, d, blue).

Representation of posterior uncertainty is evident in the Kalman filtering task, where accurate encoding of the posterior mean at a particular moment already requires the encoding of the posterior mean and the posterior width at the previous moment and the optimal integration of these with the current sensory information in the recurrent activity of the network.

The combined coordinate transformation-cue combination (CT + CC) network also illustrates the generic nature of the representations learned by the hidden layers of our networks: the hidden layer of a network trained on the coordinate transformation task can be combined, without modification, with an additional input population to perform a different task, i.e., cue combination in this example.

It has been argued that truly Bayesian computation requires that the components of the Bayesian computation, i.e., sensory likelihoods and the prior, be individually meaningful to the brain25.

1, g

2) = (5, 5) and g = (25, 25), and tested it on all gain combinations of the form (g

1, g

2), where g

1, g

2 ∈ {5, 10, 15, 20, 25} with up to fivefold gain differences between the two input populations (note that these gains are higher than those used in the main simulations to make the optimal combination rule approximately linear).

To demonstrate that the trained networks performed qualitatively correct probabilistic inference, we set up cue conflict conditions similar to the cue conflict conditions in psychophysical studies2, where we presented slightly different stimuli to the two input populations and manipulated the degree of conflict between the cues.


2, are shown in Fig.4a both for the network and for the optimal rule.

The successful generalization performance of the neural networks is a result of two factors.


, as a function of the input gain g, the mean input μ to the hidden unit for unit gain, and the mean μ


to variations in g: where the prime represents the derivative with respect to g, and numerically minimized T

var with respect to μ, μ

, subject to the constraint that the mean response across different gains be equal to a positive constant K.

var was minimized for a negative mean input μ, positive μ

var values.

causes a large proportion of the input distribution to be above the threshold for low gains.

The negativity of the mean input μ implies that as the gain g increases, the distribution of the total input to the unit shifts to the left (Fig.5b, top) and becomes wider, causing a smaller proportion of the distribution to remain above the threshold (represented by the dashed line in Fig.5b), hence decreasing the probability that the neuron will have a non-zero response (Fig.5c, top).

and negative μ causes the sparsification of the hidden unit responses with increasing g.

We demonstrate this sparsification mechanism for a network trained on the coordinate transformation task in Fig.5f–i.

are close to 0, the probability of non-zero responses as a function of g stays roughly constant (Fig.5c, bottom), causing the mean response to increase with g (Fig.5e, bottom).

On the basis of our simple mean-field model, we therefore predicted that for those tasks where the net input to the output unit is approximately g-invariant, there should be a positive correlation between the sparsity of hidden unit responses and the input gain and no (or only a weak) correlation between the mean hidden unit response and the input gain.

On the other hand, in tasks such as causal inference, where the net input to the output unit has a strong g-dependence, we predicted a positive correlation between the mean hidden unit response and the input gain and no (or only a weak) correlation between the sparsity of hidden unit responses and the input gain.

The difference between these two types of tasks (g-invariant and g-dependent) was also reflected in the tuning functions that developed in the hidden layers of the networks.

For approximately g-invariant tasks, such as coordinate transformation, increasing the input gain g sharpens the tuning of the hidden units (Fig.7a), whereas for g-dependent tasks, such as causal inference, input gain acts more like a multiplicative factor scaling the tuning functions without changing their shape (Fig.7b).

We finally emphasize that these results depend on the linear read-out of hidden layer responses.

A well-known theoretical result can explain the inefficiency of random networks39: the approximation error of neural networks with adjustable hidden units scales as O(1/n) with n denoting the number of hidden units, whereas for networks with fixed hidden units, as in our random networks, the scaling is much worse: O(1/n

2/d), where d is the dimensionality of the problem, suggesting that they need exponentially more neurons than fully trained networks in order to achieve the same level of performance.

So far, we have only considered feedforward networks with undifferentiated neurons.

Besides providing a good fit to the learning curve of the subject (Fig.9c), the neural networks also correctly predicted the progression of the models that best fit the subject’s data, i.e., early on in the training the QUAD model, then the LIN model (Fig.9d).

*, required to achieve a given level of performance (15% information loss for visual search, 10% fractional RMSE, or information loss for the other tasks) as a function of the total number of input units, d, in our generic networks.

* with d was better than O(d), i.e., sublinear, in all our tasks (Fig.10b and Supplementary Table1).



, leading to an estimate of O(1) hidden units in terms of d.

* with d was approximately constant over the range of d values tested, suggesting smoothness properties similar to a d-dimensional standard Gaussian.

* on log d yields a slope of 0.56 (R

The efficiency of our generic networks contrasts sharply with the inefficiency of the manually crafted networks in earlier PPC studies7, 9,10,11,12,13: except for the linear cue combination task, these hand-crafted networks used a quadratic expansion, which requires at least O(d

2) hidden units.

Lecture 6 | Training Neural Networks I

In Lecture 6 we discuss many practical issues for training modern neural networks. We discuss different activation functions, the importance of data ...

Lecture 11: Gated Recurrent Units and Further Topics in NMT

Lecture 11 provides a final look at gated recurrent units like GRUs/LSTMs followed by machine translation evaluation, dealing with large vocabulary output, and ...

Deep Learning with Tensorflow - Convolution and Feature Learning

Enroll in the course for free at: Deep Learning with TensorFlow Introduction The majority of data ..

But what *is* a Neural Network? | Chapter 1, deep learning

Subscribe to stay notified about new videos: Support more videos like this on Patreon: Special .

Convolutional Neural Networks - Ep. 8 (Deep Learning SIMPLIFIED)

Out of all the current Deep Learning applications, machine vision remains one of the most popular. Since Convolutional Neural Nets (CNN) are one of the best ...

Recurrent Neural Network - The Math of Intelligence (Week 5)

Recurrent neural networks let us learn from sequential data (time series, music, audio, video frames, etc ). We're going to build one from scratch in numpy ...

HMM-based Speech Synthesis: Fundamentals and Its Recent Advances

The task of speech synthesis is to convert normal language text into speech. In recent years, hidden Markov model (HMM) has been successfully applied to ...

Energy Based Models: From Relational Regression to Similarity Metric Learning

In this talk I will first give a brief introduction to Energy Based Models. I will then discuss in detail the two areas where I have applied these techniques, namely ...

Lecture 10: Neural Machine Translation and Models with Attention

Lecture 10 introduces translation, machine translation, and neural machine translation. Google's new NMT is highlighted followed by sequence models with ...

Lecture 16: Dynamic Neural Networks for Question Answering

Lecture 16 addresses the question ""Can all NLP tasks be seen as question answering problems?"". Key phrases: Coreference Resolution, Dynamic Memory ...