AI News, Kernels and Quantum Gravity Part 3: Coherent States

Kernels and Quantum Gravity Part 3: Coherent States

This would not be a pedagogic machine learning blog if I did not go into some overly abstract formalism…here we introduce the Kernel formalism using the languages of Coherent States: We define a space of labels (that is isomorphic to , or, more generally, just a locally compact space), and an abstract Hilbert space  We seek a map between the two:  such that

So rather than introduce an expression for , we introduce a operator acting on the Hilbert space that encapsulates the fact that our basis set is non-orthogonal (and perhaps overcomplete?)

   The main difference is that in other fields, one (usually) tries to use their prior knowledge of the problem to actually find the solution and does not just guess random Kernels and crossvalidate (although there are important cases where it does seem like this, such as in Quantum Chemical Density Functional Theory).

would accept this recent paper on Reproducing Kernel Banach Spaces with the ℓ1 Norm In Physics, we may think of the labels as the Classical variables of phase space and the Hilbert space as   the space of Quantum Mechanical wavefunctions .

 More importantly, for understanding machine learning, we will see the mathematical formulation of Frame Quantization and the attempts to capture the mathematics of coherent states under a single mathematical formalism (and how and when this is doable)  .

Understanding Convolution in Deep Learning

There are already some blog post regarding convolution in deep learning, but I found all of them highly confusing with unnecessary mathematical details that do not further the understanding in any meaningful way.

The second part of this blog post includes advanced concepts and is aimed to further and enhance the understanding of convolution for deep learning researchers and specialists.

Convolutions are heavily used in physics and engineering to simplify such complex equations and in the second part — after a short mathematical development of convolution — we will relate and integrate ideas between these fields of science and deep learning to gain a deeper understanding of convolution.

We mix two buckets of information: The first bucket is the input image, which has a total of three matrices of pixels — one matrix each for the red, blue and green color channels;

The second bucket is the convolution kernel, a single matrix of floating point numbers where the pattern and the size of the numbers can be thought of as a recipe for how to intertwine the input image with the kernel in the convolution operation.

One way to apply convolution is to take an image patch from the input image of the size of the kernel — here we have a 100×100 image, and a 3×3 kernel, so we would take 3×3 patches — and then do an element wise multiplication with the image patch and convolution kernel.

In one project I wanted to build a fashion image search with deep autoencoders: You upload an image of a fashion item and the autoencoder should find images that contain clothes with similar style.

My colleague Jannek Thomas preprocessed the data and applied a Sobel edge detector (similar to the kernel above) to filter everything out of the image except the outlines of the shape of an object — this is why the application of convolution is often called filtering, and the kernels are often called filters (a more exact definition of this filtering processes will follow below).

The resulting feature map from the edge detector kernel will be very helpful if you want to differentiate between different types of clothes, because only relevant shape information remains.

which sharpen the image (more details), or which blur the image (less details), and each feature map may help our algorithm to do better on its task (details, like 3 instead of 2 buttons on your jacket might be important).

Feature engineering is so difficult because for each type of data and each type of problem, different features do well: Knowledge of feature engineering for image tasks will be quite useless for time series data;

For example a 32x16x16 kernel applied to a 256×256 image would produce 32 feature maps of size 241×241 (this is the standard size, the size may vary from implementation to implementation;

Once we learned our hierarchical features, we simply pass them to a fully connected, simple neural network that combines them in order to classify the input image into classes.

To develop the concept of convolution further, we make use of the convolution theorem, which relates convolution in the time/space domain — where convolution features an unwieldy integral or sum — to a mere element wise multiplication in the frequency/Fourier domain.

time is one dimensional (one second after the other), images are two dimensional (pixels have rows and columns), videos are three dimensional (pixels have rows and columns, and images come one after another).

This is very apparent from the next image and its log Fourier transforms (applying the log to the real values decreases the differences in pixel intensity in the image — we see information more easily this way).

This is an important insight: Due to the convolution theorem, we can imagine that convolutional nets operate on images in the Fourier domain and from the images above we now know that images in that domain contain a lot of information about orientation.

Thus convolutional nets should be better than traditional algorithms when it comes to rotated images and this is indeed the case (although convolutional nets are still very bad at this when we compare them to human vision).

If we transform the original image with a Fourier transform and then multiply it by a circle padded by zeros (zeros=black) in the Fourier domain, we filter out all high frequency values (they will be set to zero, due to the zero padded values).

Note that the filtered image still has the same striped pattern, but its quality is much worse now — this is how jpeg compression works (although a different but similar transform is used), we transform the image, keep only certain frequencies and transform back to the spatial image domain;

It turns out, this is exactly one part of the convolution for the diffusion equation solution: One part is simply the initial concentrations of a certain fluid in a certain area — or in image terms — the initial image with its initial pixel intensities.

We can imagine the operation of convolution as a two part diffusion process: Firstly, there is strong diffusion where pixel intensities change (from black to white, or from yellow to blue, etc.) and secondly, the diffusion process in an area is regulated by the probability distribution of the convolution kernel.

However, if you take a tiny piece of fluid, say a tiny drop of water, you still have millions of water molecules in that tiny drop of water, and while a single molecule behaves stochastically according to the probability distribution of the propagator, a whole bunch of molecules have quasi deterministic behavior —this is an important interpretation from statistical mechanics and thus also for diffusion in fluid mechanics.

In quantum mechanics a particle can be in a superposition where it has two or more properties which usually exclude themselves in our empirical world: For example, in quantum mechanics a particle can be at two places at the same time —  that is a single object in two places.

If we have entangled particles (spooky action at a distance), a few particles can hold hundreds or even millions of different states at the same time — this is the power promised by quantum computers.

So if we use this interpretation for deep learning, we can think that the pixels in an image are in a superposition state, so that in each image patch, each pixel is in 9 positions at the same time (if our kernel is 3×3).

Once we apply the convolution we make a measurement and the superposition of each pixel collapses into a single position as described by the probability distribution of the convolution kernel, or in other words: For each pixel, we choose one pixel of the 9 pixels at random (with the probability of the kernel) and the resulting pixel is the average of all these pixels.

Cross-correlation is an operation which takes a small piece of information (a few seconds of a song) to filter a large piece of information (the whole song) for similarity (similar techniques are used on youtube to automatically tag videos for copyrights infringements).

While cross correlation seems unwieldy, there is a trick with which we can easily relate it to convolution in deep learning: For images we can simply turn the search image upside down to perform cross-correlation through convolution.

When we perform convolution of an image of a person with an upside image of a face, then the result will be an image with one or multiple bright pixels at the location where the face was matched with the person.

There are versions which require different padding schemes: Some implementation warp the kernel around itself and require only padding for the kernel, and yet other implementations perform divide-and-conquer steps and require no padding at all.

It is imaginable that the bright pixels from the cross-correlation will be redirected to units which detect faces (the Google brain project has some units in its architecture which are dedicated to faces, cats etc.;

Two important statistical models for time series data are the weighted moving average and the autoregressive models which can be combined into the ARIMA model (autoregressive integrated moving average model).

ARIMA models are rather weak when compared to models like long short-term recurrent neural networks, but ARIMA models are extremely robust when you have low dimensional data (1-5 dimensions).

The Gaussian smoothing kernel can be interpreted as a weighted average of the pixels in each pixel’s neighborhood, or in other words, the pixels are averaged in their neighborhood (pixels “blend in”, edges are smoothed).

While a single kernel cannot create both, autoregressive and weighted moving average features, we usually have multiple kernels and in combination all these kernels might contain some features which are like a weighted moving average model and some which are like an autoregressive model.

We developed convolutions by Fourier transforms and saw that Fourier transforms contain a lot of information about orientation of an image. With the powerful convolution theorem we then developed an interpretation of convolution as the diffusion of information across pixels.

Seth Lloyd: Quantum Machine Learning

Seth Lloyd visited the Quantum AI Lab at Google LA to give a tech talk on "Quantum Machine Learning." This talk took place on January 29, 2014. Speaker Info: ...

Inside a Neural Network - Computerphile

Just what is happening inside a Convolutional Neural Network? Dr Mike Pound shows us the images in between the input and the result. How Blurs & Filters ...

10-701 Machine Learning Fall 2014 - Lecture 6

Topics: reproducing kernel Hilbert space, kernel perceptron algorithm and analysis Lecturer: Geoff Gordon ...

Linear transformations and matrices | Essence of linear algebra, chapter 3

Matrices can be thought of as transforming space, and understanding how this work is crucial for understanding many other ideas that follow in linear algebra.

Support Vector Machines - The Math of Intelligence (Week 1)

Support Vector Machines are a very popular type of machine learning model used for classification when you have a small dataset. We'll go through when to use ...

Lecture 1 | Composition operators on the Dirichlet space of the disk | Hervé Queffélec | Лекториум

Lecture 1 | Курс: Workshop and Winter School «Spaces of Analytic Functions and Singular Integrals (SAFSI2014)» | Лектор: Hervé Queffélec | Организатор: ...

The Tensor Algebra Compiler

Linear algebra is a work-horse of numerical computing. Tensor algebra is a generalization of linear algebra with applications in scientific computing, machine ...

Quantum Expanders

Aram Harrow, Massachusetts Institute of Technology Complexity Meets Condensed Matter

Topological Treatment of Neural Activity and the Quantum Question Order Effect

Seth Lloyd - Mechanical Engineering, MIT.