AI News, From Colah’s Blog: Neural Networks, Manifolds, and Topology

From Colah’s Blog: Neural Networks, Manifolds, and Topology

is an interesting blog post that explores the links between machine learning, in this case neural networks, and aspects of mathematics.

‘Recently, there’s been a great deal of excitement and interest in deep neural networks because they’ve achieved breakthrough results in areas such as computer vision.1 However, there remain a number of concerns about them.

While it is challenging to understand the behavior of deep neural networks in general, it turns out to be much easier to explore low-dimensional deep neural networks – networks that only have a few neurons in each layer.

This perspective will allow us to gain deeper intuition about the behavior of neural networks and observe a connection linking neural networks to an area of mathematics called topology.

number of interesting things follow from this, including fundamental lower-bounds on the complexity of a neural network capable of classifying certain datasets.‘

‘K-Nearest Neighbor Layers I’ve also begun to think that linear separability may be a huge, and possibly unreasonable, amount to demand of a neural network.

I think a nice approach is to classify each element of the mini-batch based on the classes of other elements of the mini-batch, giving each one a weight of 1/(distance from classification target).9 Sadly, even with sophisticated architecture, using k-NN only gets down to 5-4% test error – and using simpler architectures gets worse results.

Conclusion Topological properties of data, such as links, may make it impossible to linearly separate classes using low-dimensional networks, regardless of depth.

Neural Networks, Manifolds, and Topology

Recently, there’s been a great deal of excitement and interest in deep neural networks because they’ve achieved breakthrough results in areas such as computer vision.1 However, there remain a number of concerns about them.

While it is challenging to understand the behavior of deep neural networks in general, it turns out to be much easier to explore low-dimensional deep neural networks – networks that only have a few neurons in each layer.

This perspective will allow us to gain deeper intuition about the behavior of neural networks and observe a connection linking neural networks to an area of mathematics called topology.

The obvious way to visualize the behavior of a neural network – or any classification algorithm, for that matter – is to simply look at how it classifies every possible data point.

A tanh layer \(\tanh(Wx+b)\) consists of: We can visualize this as a continuous transformation, as follows: The story is much the same for other standard layers, consisting of an affine transformation followed by pointwise application of a monotone activation function.

(Andrej Karpathy has made a nice demo based on ConvnetJS that allows you to interactively explore networks with this sort of visualization of training!) Each layer stretches and squishes space, but it never cuts, breaks, or folds it.

As mentioned previously, classification with a sigmoid unit or a softmax layer is equivalent to trying to find a hyperplane (or in this case a line) that separates \(A\) and \(B\) in the final represenation.

Since we’re dealing with something homeomorphic to the original dataset, \(A\) is surrounded by \(B\), and collapsing on any axis means we will have some points of \(A\) and \(B\) mix and become impossible to distinguish between.

To get a better sense of what’s going on, let’s consider an even simpler dataset that’s 1-dimensional: \[A = [-\frac{1}{3}, \frac{1}{3}]\] \[B = [-1, -\frac{2}{3}] \cup [\frac{2}{3}, 1]\] Without using a layer of two or more hidden units, we can’t classify this dataset.

Sometimes when we see a link, it isn’t immediately obvious whether it’s an unlink (a bunch of things that are tangled together, but can be separated by continuous deformation) or not.

(Question: Can all unlinks be classified by a network with only 3 units, theoretically?) From this knot perspective, our continuous visualization of the representations produced by a neural network isn’t just a nice animation, it’s a procedure for untangling links.

Theorem: There is an ambient isotopy between the input and a network layer’s representation if: a) \(W\) isn’t singular, b) we are willing to permute the neurons in the hidden layer, and c) there is more than 1 hidden unit.

imagine there is probably interest in programs automatically discovering such ambient isotopies and automatically proving the equivalence of certain links, or that certain links are separable.

This doesn’t bode well for neural networks.) The sort of links we’ve talked about so far don’t seem likely to turn up in real world data, but there are higher dimensional generalizations.

All \(n\)-dimensional manifolds can be untangled in \(2n+2\) dimensions.6 (I know very little about knot theory and really need to learn more about what’s known regarding dimensionality and links.

If we know a manifold can be embedded in n-dimensional space, instead of the dimensionality of the manifold, what limit do we have?) The natural thing for a neural net to do, the very easy route, is to try and pull the manifolds apart naively and stretch the parts that are tangled as thin as possible.

We know these things happen.7 Contractive penalties, penalizing the derivatives of the layers at data points, are the natural way to fight this.8 Since these sort of local minima are absolutely useless from the perspective of trying to solve topological problems, topological problems may provide a nice motivation to explore fighting these issues.

In particular, in an optimization problem where local minima are a big problem, picking an architecture that can’t genuinely solve the problem seems like a recipe for bad performance.) The more I think about standard neural network layers – that is, with an affine transformation followed by a point-wise activation function – the more disenchanted I feel.

The thing that feels natural to me is to learn a vector field with the direction we want to shift the manifold: And then deform space based on it: One could learn the vector field at fixed points (just take some fixed points from the training set to use as anchors) and interpolate in some manner.

The vector field above is of the form: \[F(x) = \frac{v_0f_0(x) + v_1f_1(x)}{1+f_0(x)+f_1(x)}\] Where \(v_0\) and \(v_1\) are vectors and \(f_0(x)\) and \(f_1(x)\) are n-dimensional gaussians.

I think a nice approach is to classify each element of the mini-batch based on the classes of other elements of the mini-batch, giving each one a weight of 1/(distance from classification target).9 Sadly, even with sophisticated architecture, using k-NN only gets down to 5-4% test error – and using simpler architectures gets worse results.

How kNN algorithm works

In this video I describe how the k Nearest Neighbors algorithm works, and provide a simple example using 2-dimensional data and k = 3.

Lecture 2 | Image Classification

Lecture 2 formalizes the problem of image classification. We discuss the inherent difficulties of image classification, and introduce data-driven approaches.

Synthetic Gradients Tutorial - How to Speed Up Deep Learning Training

Synthetic Gradients were introduced in 2016 by Max Jaderberg and other researchers at DeepMind. They are designed to replace backpropagation, and they ...

Lecture 13 | Generative Models

In Lecture 13 we move beyond supervised learning, and discuss generative modeling as a form of unsupervised learning. We cover the autoregressive ...

Curse of Dimensionality Two - Georgia Tech - Machine Learning

Watch on Udacity: Check out the full Advanced Operating Systems course for free ..

Machine Learning - Dimensionality Reduction - Feature Extraction & Selection

Enroll in the course for free at: Machine Learning can be an incredibly beneficial tool to ..

Gradient Boosted Large Margin Nearest Neighbors

Quadratic Assignment on Different Data Models

Quadratic assignment is a very general problem in theoretical computer science. It includes graph matching, the traveling salesman problem, and the ...

Dealing with data: classification, clustering and ranking

This talk will be focused on the following three pieces of work that we have done: (1)   How to utilize unlabeled data in classification? In many real-world ...

Unsupervised Machine Learning MNIST Handwritten Digits with Isomap

Learn how to use Isomap manifold learning to perform dimensionality reduction on MNIST handwritten digit dataset. Following along using freely available ...