AI News, NIPS Proceedingsβ

NIPS Proceedingsβ

Part of: Advances in Neural Information Processing Systems 27 (NIPS 2014) We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video.

Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both.

Convolutional neural network

In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural networks, most commonly applied to analyzing visual imagery.

CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing.[1] They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics.[2][3] Convolutional networks were inspired by biological processes[4] in that the connectivity pattern between neurons resembles the organization of the animal visual cortex.

A very high number of neurons would be necessary, even in a shallow (opposite of deep) architecture, due to the very large input sizes associated with images, where each pixel is a relevant variable.

The convolution operation brings a solution to this problem as it reduces the number of free parameters, allowing the network to be deeper with fewer parameters.[8] For instance, regardless of image size, tiling regions of size 5 x 5, each with the same shared weights, requires only 25 learnable parameters.

In this way, it resolves the vanishing or exploding gradients problem in training traditional multi-layer neural networks with many layers by using backpropagation[citation needed].

Convolutional networks may include local or global pooling layers[clarification needed], which combine the outputs of neuron clusters at one layer into a single neuron in the next layer.[9][10] For example, max pooling uses the maximum value from each of a cluster of neurons at the prior layer.[11] Another example is average pooling, which uses the average value from each of a cluster of neurons at the prior layer[citation needed].

Work by Hubel and Wiesel in the 1950s and 1960s showed that cat and monkey visual cortexes contain neurons that individually respond to small regions of the visual field.

Provided the eyes are not moving, the region of visual space within which visual stimuli affect the firing of a single neuron is known as its receptive field[citation needed].

Their 1968 paper[12] identified two basic visual cell types in the brain: The neocognitron [13] was introduced in 1980.[11][14] The neocognitron does not require units located at multiple network positions to have the same trainable weights.

This idea appears in 1986 in the book version of the original backpropagation paper.[15]:Figure 14 Neocognitrons were developed in 1988 for temporal signals.[clarification needed][16] Their design was improved in 1998,[17] generalized in 2003[18] and simplified in the same year.[19] LeNet-5, a pioneering 7-level convolutional network by LeCun et al.

Similarly, a shift invariant neural network was proposed for image character recognition in 1988.[2][3] The architecture and training algorithm were modified in 1991[20] and applied for medical image processing[21] and automatic detection of breast cancer in mammograms.[22] A

This design was modified in 1989 to other de-convolution-based designs.[24][25] The feed-forward architecture of convolutional neural networks was extended in the neural abstraction pyramid[26] by lateral and feedback connections.

Following the 2005 paper that established the value of GPGPU for machine learning,[27] several publications described more efficient ways to train convolutional neural networks using GPUs.[28][29][30][31] In 2011, they were refined and implemented on a GPU, with impressive results.[9] In 2012, Ciresan et al.

significantly improved on the best performance in the literature for multiple image databases, including the MNIST database, the NORB database, the HWDB1.0 dataset (Chinese characters), the CIFAR10 dataset (dataset of 60000 32x32 labeled RGB images),[11] and the ImageNet dataset.[32] While traditional multilayer perceptron (MLP) models were successfully used for image recognition[example needed], due to the full connectivity between nodes they suffer from the curse of dimensionality, and thus do not scale well to higher resolution images.

For example, in CIFAR-10, images are only of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully connected neuron in a first hidden layer of a regular neural network would have 32*32*3 = 3,072 weights.

Also, such network architecture does not take into account the spatial structure of data, treating input pixels which are far apart in the same way as pixels that are close together.

Weight sharing dramatically reduces the number of free parameters learned, thus lowering the memory requirements for running the network and allowing the training of larger, more powerful networks.

The layer's parameters consist of a set of learnable filters (or kernels), which have a small receptive field, but extend through the full depth of the input volume.

During the forward pass, each filter is convolved across the width and height of the input volume, computing the dot product between the entries of the filter and the input and producing a 2-dimensional activation map of that filter.

Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation map.

When dealing with high-dimensional inputs such as images, it is impractical to connect neurons to all neurons in the previous volume because such a network architecture does not take the spatial structure of the data into account.

Convolutional networks exploit spatially local correlation by enforcing a local connectivity pattern between neurons of adjacent layers: each neuron is connected to only a small region of the input volume.

Since all neurons in a single depth slice share the same parameters, then the forward pass in each depth slice of the CONV layer can be computed as a convolution of the neuron's weights with the input volume (hence the name: convolutional layer).

The result of this convolution is an activation map, and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume.

The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters and amount of computation in the network, and hence to also control overfitting.

The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples at every depth slice in the input by 2 along both width and height, discarding 75% of the activations.

Average pooling was often used historically but has recently fallen out of favor compared to max pooling, which works better in practice.[34] Due to the aggressive reduction in the size of the representation, the trend is towards using smaller filters[35] or discarding the pooling layer altogether.[36] Region of Interest pooling (also known as RoI pooling) is a variant of max pooling, in which output size is fixed and input rectangle is a parameter.[37] Pooling is an important component of convolutional neural networks for object detection based on Fast R-CNN[38] architecture.

Preserving more information about the input would require keeping the total number of activations (number of feature maps times number of pixel positions) non-decreasing from one layer to the next.

In stochastic pooling,[43] the conventional deterministic pooling operations are replaced with a stochastic procedure, where the activation within each pooling region is picked randomly according to a multinomial distribution, given by the activities within the pooling region.

Since the degree of model overfitting is determined by both its power and the amount of training it receives, providing a convolutional network with more training examples can reduce overfitting.

Since these networks are usually trained with all available data, one approach is to either generate new data from scratch (if possible) or perturb existing data to create new ones.

For example, input images could be asymmetrically cropped by a few percent to create new examples with the same label as the original.[45] One of the simplest methods to prevent overfitting of a network is to simply stop the training before overfitting has had a chance to occur.

Another simple way to prevent overfitting is to limit the number of parameters, typically by limiting the number of hidden units in each layer or limiting network depth.

Limiting the number of parameters restricts the predictive power of the network directly, reducing the complexity of the function that it can perform on the data, and thus limits the amount of overfitting.

simple form of added regularizer is weight decay, which simply adds an additional error, proportional to the sum of weights (L1 norm) or squared magnitude (L2 norm) of the weight vector, to the error at each node.

after seeing a new shape once they can recognize it from a different viewpoint.[47] Currently, the common way to deal with this problem is to train the network on transformed data in different orientations, scales, lighting, etc.

The pose relative to retina is the relationship between the coordinate frame of the retina and the intrinsic features' coordinate frame.[48] Thus, one way of representing something is to embed the coordinate frame within it.

The vectors of neuronal activity that represent pose ('pose vectors') allow spatial transformations modeled as linear operations that make it easier for the network to learn the hierarchy of visual entities and generalize across viewpoints.

In 2012 an error rate of 0.23 percent on the MNIST database was reported.[11] Another paper on using CNN for image classification reported that the learning process was 'surprisingly fast';

in the same paper, the best published results as of 2011 were achieved in the MNIST database and the NORB database.[9] When applied to facial recognition, CNNs achieved a large decrease in error rate.[50] Another paper reported a 97.6 percent recognition rate on '5,600 still images of more than 10 subjects'.[4] CNNs were used to assess video quality in an objective way after manual training;

the resulting system had a very low root mean square error.[51] The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object classification and detection, with millions of images and hundreds of object classes.

The winner GoogLeNet[53] (the foundation of DeepDream) increased the mean average precision of object detection to 0.439329, and reduced classification error to 0.06656, the best result to date.

That performance of convolutional neural networks on the ImageNet tests was close to that of humans.[54] The best algorithms still struggle with objects that are small or thin, such as a small ant on a stem of a flower or a person holding a quill in their hand.

For example, they are not good at classifying objects into fine-grained categories such as the particular breed of dog or species of bird, whereas convolutional neural networks handle this.

One approach is to treat space and time as equivalent dimensions of the input and perform convolutions in both time and space.[56][57] Another way is to fuse the features of two convolutional neural networks, one for the spatial and one for the temporal stream [58][59][60].

CNN models are effective for various NLP problems and achieved excellent results in semantic parsing,[63] search query retrieval,[64] sentence modeling,[65] classification,[66][67] prediction[68] and other traditional NLP tasks.[69] CNNs have been used in drug discovery.

In 2015, Atomwise introduced AtomNet, the first deep learning neural network for structure-based rational drug design.[70] The system trains directly on 3-dimensional representations of chemical interactions.

Similar to how image recognition networks learn to compose smaller, spatially proximate features into larger, complex structures,[71] AtomNet discovers chemical features, such as aromaticity, sp3 carbons and hydrogen bonding.

Subsequently, AtomNet was used to predict novel candidate biomolecules for multiple disease targets, most notably treatments for the Ebola virus[72] and multiple sclerosis.[73] CNNs have been used in the game of checkers.

The learning process did not use prior human professional games, but rather focused on a minimal set of information contained in the checkerboard: the location and type of pieces, and the piece differential[clarify].

Ultimately, the program (Blondie24) was tested on 165 games against players and ranked in the highest 0.4%.[74][75] It also earned a win against the program Chinook at its 'expert' level of play.[76] CNNs have been used in computer Go.

In December 2014, Clark and Storkey published a paper showing that a CNN trained by supervised learning from a database of human professional games could outperform GNU Go and win some games against Monte Carlo tree search Fuego 1.1 in a fraction of the time it took Fuego to play.[77] Later it was announced that a large 12-layer convolutional neural network had correctly predicted the professional move in 55% of positions, equalling the accuracy of a 6 dan human player.

When the trained convolutional network was used directly to play games of Go, without any search, it beat the traditional search program GNU Go in 97% of games, and matched the performance of the Monte Carlo tree search program Fuego simulating ten thousand playouts (about a million positions) per move.[78] A

couple of CNNs for choosing moves to try ('policy network') and evaluating positions ('value network') driving MCTS were used by AlphaGo, the first to beat the best human player at the time.[79] For many applications, little training data is available.

Other deep reinforcement learning models preceded it.[82] Convolutional deep belief networks (CDBN) have structure very similar to convolutional neural networks and are trained similarly to deep belief networks.

time delay neural network allows speech signals to be processed time-invariantly, analogous to the translation invariance offered by CNNs.[85] They were introduced in the early 1980s.

Lecture 14 | Deep Reinforcement Learning

In Lecture 14 we move from supervised learning to reinforcement learning (RL), in which an agent must learn to interact with an environment in order to ...

Graph neural networks: Variations and applications

Many real-world tasks require understanding interactions between a set of entities. Examples include interacting atoms in chemical molecules, people in social ...

MIT 6.S094: Introduction to Deep Learning and Self-Driving Cars

This is lecture 1 of course 6.S094: Deep Learning for Self-Driving Cars taught in Winter 2017. Course website: Lecture 1 slides: ..

Unit 7 Panel: Vision and Audition

MIT RES.9-003 Brains, Minds and Machines Summer Course, Summer 2015 View the complete course: Instructor: Josh ..

Resilience and Security in Cyber-Physical Systems: Self-Driving Cars and Smart Devices

The future will be defined by autonomous computer systems that are tightly integrated with the environment, also known as Cyber-Physical systems (CPS).