# AI News, deepdish Serving Up Chicago-Style Deep Learning. Latest post: The building blocks of Deep Learning 21 Nov 2015 Archive GitHub project @deepdishio &copy; 2015. All rights reserved.

## deepdish Serving Up Chicago-Style Deep Learning. Latest post: The building blocks of Deep Learning 21 Nov 2015 Archive GitHub project @deepdishio &copy; 2015. All rights reserved.

Next, computing the derivative of a single element of one of the inputs may look like (superscript omitted): \frac{ \partial L }{ \partial x _ i } = \sum _ j \frac{ \partial L }{ \partial z _ j } \frac{ \partial z _ j }{\partial x _ i }

It can also be written as $$\frac{ \partial L }{ \partial \mathbf{x} } = \left(\frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{x} }\right) ^ \intercal \frac{ \partial L }{ \partial \mathbf{z} } \quad\quad \left[ \mathbb{R}^ { A \times 1} = \mathbb{R} ^ {A \times B} \mathbb{R} ^ {B \times 1} \right]$$

The derivative $$\frac{ \partial L }{ \partial \mathbf{z} } \in \mathbb{R} ^ {B}$$ is something that needs to be given to the building block from the outside (this is the gradient being back-propagated).

All we need is to define the function $$\frac{ \partial L }{ \partial \mathbf{x} } = \mathrm{backward}\left(\mathbf{x}, \mathbf{z}, \frac{ \partial L }{ \partial \mathbf{z} }\right)$$

In our code examples, we will adopt this as well, meaning we will be defining: $$\left(\frac{ \partial L }{ \partial \mathbf{x}^1 }, \dots, \frac{ \partial L }{ \partial \mathbf{x}^n }\right) = \mathrm{backward}\left((\mathbf{x}^1, \dots, \mathbf{x}^n), \mathbf{z}, \frac{ \partial L }{ \partial \mathbf{z} }\right)$$

which could translate to something like this in Python (who uses pseudo-code anymore?): For the backward pass, the Jacobian will be a diagonal matrix, with entries \frac{\partial z _ i}{\partial x _ i} = 1 _ {\{ x _ i &gt;

We can now write the gradient of the loss as \frac{ \partial L }{ \partial \mathbf{x} } = \left(\frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{x} }\right) ^ \intercal \frac{ \partial L }{ \partial \mathbf{z} } = \mathbf{1} _ {\{ \mathbf{x} &gt;

Moving on to the dense (fully connected) layer where $$\mathbf{z} = \mathbf{W} ^ \intercal \mathbf{x} + \mathbf{b} \quad\quad (\mathbb{R}^{B \times 1} = \mathbb{R}^{B \times A} \mathbb{R}^{A \times 1} + \mathbb{R}^{B \times 1})$$

However, remember that we make no distinction between static and dynamic input, and from the point of view of our Dense node it simply looks like: $$\mathbf{z} = \mathbf{x} ^ 2 \mathbf{x} ^ 1 + \mathbf{x} ^ 3$$

L}{\partial \mathbf{x}} = \left(\frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{x} }\right) ^ \intercal \frac{ \partial L }{ \partial \mathbf{z} }= \mathbf{W} \frac{ \partial L }{ \partial \mathbf{z} } \quad\quad

We know the bias will drop off, so we can write the output that we will be taking the Jacobian of as: $$\mathbf{z}’ = \left( \sum _ {j = 1} ^ A W _ {j, 1} x _ j, \dots, \sum _ {j = 1} ^ A W _ {j, B} x _ j \right)$$

Now, let’s compute the derivative of $$z’ _ i$$ (and thus $$z _ i$$) with respect to $$W _ {j, k}$$: \frac{\partial z _ i}{\partial W _ {j, k}} = \left\{

L}{\partial \mathbf{W}} = \left(\frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{W} }\right) ^ \intercal \frac{ \partial L }{ \partial \mathbf{z} } = \mathbf{x} \left( \frac{ \partial L }{ \partial \mathbf{z} } \right)^\intercal \quad\quad

L}{\partial \mathbf{b}} = \left(\frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{b} }\right) ^ \intercal \frac{ \partial L }{ \partial \mathbf{z} }= \frac{ \partial L }{ \partial \mathbf{z} } \quad\quad

## Transmission-line matrix method

The transmission-line matrix (TLM) method is a space and time discretising method for computation of electromagnetic fields.

It is based on the analogy between the electromagnetic field and a mesh of transmission lines.

The TLM method allows the computation of complex three-dimensional electromagnetic structures and has proven to be one of the most powerful time-domain methods along with the finite difference time domain (FDTD) method.

The TLM method is based on Huygens' model of wave propagation and scattering and the analogy between field propagation and transmission lines.

Therefore, it considers the computational domain as a mesh of transmission lines, interconnected at nodes.

In the figure on the right is considered a simple example of a 2D TLM mesh with a voltage pulse of amplitude 1 V incident on the central node.

This pulse will be partially reflected and transmitted according to the transmission-line theory.

If we assume that each line has a characteristic impedance

{\displaystyle Z}

,

then the incident pulse sees effectively three transmission lines in parallel with a total impedance of

/

3

{\displaystyle Z/3}

.

The reflection coefficient and the transmission coefficient are given by The energy injected into the node by the incident pulse and the total energy of the scattered pulses are correspondingly Therefore, the energy conservation law is fulfilled by the model.

The next scattering event excites the neighbouring nodes according to the principle described above.

It can be seen that every node turns into a secondary source of spherical wave.

These waves combine to form the overall waveform.

This is in accordance with Huygens principle of light propagation.

In order to show the TLM schema we will use time and space discretisation.

The time-step will be denoted with

&#x0394;

t

{\displaystyle \Delta t}

and the space discretisation intervals with

&#x0394;

x

{\displaystyle \Delta x}

,

&#x0394;

y

{\displaystyle \Delta y}

and

&#x0394;

z

{\displaystyle \Delta z}

The absolute time and space will therefore be

{\displaystyle t=k\,\Delta t}

{\displaystyle x=l\,\Delta x}

{\displaystyle y=m\,\Delta y}

{\displaystyle z=n\,\Delta z}

{\displaystyle k=0,1,2,\ldots }

is the time instant and

{\displaystyle m,n,l}

are the cell coordinates.

In case

{\displaystyle \Delta x=\Delta y=\Delta z}

the value

{\displaystyle \Delta l}

will be used, which is the lattice constant.

In this case the following holds: where

{\displaystyle c_{0}}

is the free space speed of light.

If we consider an electromagnetic field distribution in which the only non-zero components are

a TE-mode distribution), then Maxwell's equations in Cartesian coordinates reduce to We can combine these equations to obtain The figure on the right presents a structure referred to as a series node.

It describes a block of space dimensions

{\displaystyle \Delta x}

{\displaystyle \Delta y}

{\displaystyle \Delta z}

that consists of four ports.

are the distributed inductance and capacitance of the transmission lines.

It is possible to show that a series node is equivalent to a TE-wave, more precisely the mesh current I, the x-direction voltages (ports 1 and 3) and the y-direction voltages (ports 2 and 4) may be related to the field components

If the voltages on the ports are considered,

and the polarity from the above figure holds, then the following is valid where

{\displaystyle \Delta x=\Delta y=\Delta l}

and dividing both sides by

{\displaystyle \Delta x\Delta y}

{\displaystyle \Delta x=\Delta y=\Delta z=\Delta l}

and substituting

{\displaystyle I=H_{z}\,\Delta z}

gives This reduces to Maxwell's equations when

{\displaystyle \Delta l\rightarrow 0}

Similarly, using the conditions across the capacitors on ports 1 and 4, it can be shown that the corresponding two other Maxwell equations are the following: Having these results, it is possible to compute the scattering matrix of a shunt node.

The incident voltage pulse on port 1 at time-step k is denoted as

Replacing the four line segments from the above figure with their Thevenin equivalent it is possible to show that the following equation for the reflected voltage pulse holds: If all incident waves as well as all reflected waves are collected in one vector, then this equation may be written down for all ports in matrix form: where

{\displaystyle _{k}\mathbf {V} ^{i}}

{\displaystyle _{k}\mathbf {V} ^{r}}

are the incident and the reflected pulse amplitude vectors.

For a series node the scattering matrix S has the following form In order to describe the connection between adjacent nodes by a mesh of series nodes, look at the figure on the right.

As the incident pulse in timestep k+1 on a node is the scattered pulse from an adjacent node in timestep k, the following connection equations are derived: By modifying the scattering matrix

{\displaystyle {\textbf {S}}}

inhomogeneous and lossy materials can be modelled.

By adjusting the connection equations it is possible to simulate different boundaries.

Apart from the series node, described above there is also the shunt TLM node, which represents a TM-mode field distribution.

The only non-zero components of such wave are

With similar considerations as for the series node the scattering matrix of the shunt node can be derived.

Most problems in electromagnetics require a three-dimensional grid.

As we now have structures that describe TE and TM-field distributions, intuitively it seems possible to define a combination of shunt and series nodes providing a full description of the electromagnetic field.

Such attempts have been made, but because of the complexity of the resulting structures they proved to be not very useful.

Using the analogy that was presented above leads to calculation of the different field components at physically separated points.

This causes difficulties in providing simple and efficient boundary definitions.

A solution to these problems was provided by Johns in 1987, when he proposed the structure known as the symmetrical condensed node (SCN), presented in the figure on the right.

It consists of 12 ports because two field polarisations are to be assigned to each of the 6 sides of a mesh cell.

The topology of the SCN cannot be analysed using Thevenin equivalent circuits.

More general energy and charge conservation principles are to be used.

The electric and the magnetic fields on the sides of the SCN node number (l,m,n) at time instant k may be summarised in 12-dimensional vectors They can be linked with the incident and scattered amplitude vectors via where

{\displaystyle Z_{F}={\sqrt {\frac {\mu }{\varepsilon }}}}

is the field impedance,

{\displaystyle _{k}\mathbf {a} _{l,m,n}}

is the vector of the amplitudes of the incident waves to the node, and

{\displaystyle _{k}\mathbf {b} _{l,m,n}}

is the vector of the scattered amplitudes.

The relation between the incident and scattered waves is given by the matrix equation The scattering matrix S can be calculated.

For the symmetrical condensed node with ports defined as in the figure the following result is obtained where the following matrix was used The connection between different SCNs is done in the same manner as for the 2D nodes.

The George Green Institute for Electromagnetics Research (GGIEMR) has open-sourced an efficient implementation of 3D-TLM, capable of parallel computation by means of MPI named GGITLM and available online.

[1]

## Introduction To Neural Networks Part 2 - A Worked Example

Our goal is to build and train a neural network that can identify whether a new 2×2 image has the stairs pattern.

That means our network could have a single output node that predicts the probability that an incoming image represents stairs.

However, we’ll choose to interpret the problem as a multi-class classification problem – one where our output layer has two nodes that represent “probability of stairs” and “probability of something else”.

Our measure of success might be something like accuracy rate, but to implement backpropagation (the fitting procedure) we need to choose a convenient, differentiable loss function like cross entropy.

Each image is 2 pixels wide by 2 pixels tall, each pixel representing an intensity between 0 (white) and 255 (black).

If we label each pixel intensity as , , , , we can represent each image as a numeric vector which we can feed into our neural network.

To make the optimization process a bit simpler, we’ll treat the bias terms as weights for an additional input node which we’ll fix equal to 1.

Finally, we’ll squash each incoming signal to the hidden layer with a sigmoid function and we’ll squash each incoming signal to the output layer with the softmax function to ensure the predictions for each sample are in the range [0, 1] and sum to 1.

And for each weight matrix, the term  represents the weight from the th node in the th layer to the th node in the th layer.

Since keeping track of notation is tricky and critical, we will supplement our algebra with this sample of training data The matrices that go along with out neural network graph are

Since we have a set of initial predictions for the training samples we’ll start by measuring the model’s current performance using our loss function, cross entropy.

For example, if we were doing a 3-class prediction problem and  = [0, 1, 0], then  = [0, 0.5, 0.5] and  = [0.25, 0.5, 0.25] would both have .

In light of this, let’s concentrate on calculating , “How much will  of the first training sample change with respect to a small change in ?”.

If we can calculate this, we can calculate  and so forth, and then average the partials to determine the overall expected change in  with respect to a small change in .

Now we have expressions that we can easily use to compute how cross entropy of the first training sample should change with respect to a small change in each of the weights.

## Softmax function

In mathematics, the softmax function, or normalized exponential function,[1]:198 is a generalization of the logistic function that 'squashes' a K-dimensional vector

The function is given by In probability theory, the output of the softmax function can be used to represent a categorical distribution – that is, a probability distribution over K different possible outcomes.

In fact, it is the gradient-log-normalizer of the categorical probability distribution.[further explanation needed][citation needed] The softmax function is used in various multiclass classification methods, such as multinomial logistic regression (also known as softmax regression)[1]:206–209 [1], multiclass linear discriminant analysis, naive Bayes classifiers, and artificial neural networks.[2] Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of K distinct linear functions, and the predicted probability for the j'th class given a sample vector x and a weighting vector w is: This can be seen as the composition of K linear functions

x

T

w

1

x

T

w

{\displaystyle \mathbf {x} \mapsto \mathbf {x} ^{\mathsf {T}}\mathbf {w} _{1},\ldots ,\mathbf {x} \mapsto \mathbf {x} ^{\mathsf {T}}\mathbf {w} _{K}}

x

T

If we take an input of [1, 2, 3, 4, 1, 2, 3], the softmax of that is [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175].

This is what the function is normally used for: to highlight the largest values and suppress values which are significantly below the maximum value.

But note: softmax is not scale invariant, so if the input were [0.1, 0.2, 0.3, 0.4, 0.1, 0.2, 0.3] (which sums to 1.6) the softmax would be [0.125, 0.138, 0.153, 0.169, 0.125, 0.138, 0.153].

This shows that for values between 0 and 1 softmax in fact de-emphasizes the maximum value (note that 0.169 is not only less than 0.475, it is also less than the initial value of 0.4).

Computation of this example using simple Python code: Here is an example of Julia code: The softmax function is often used in the final layer of a neural network-based classifier.

Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.

Since the function maps a vector and a specific index i to a real value, the derivative needs to take the index into account: Here, the Kronecker delta is used for simplicity (cf.

t

+

It is useful given outlier data, which we wish to include in the dataset while still preserving the significance of data within a standard deviation of the mean.

The logistic sigmoid function:[4] The hyperbolic tangent function, tanh:[4] The sigmoid function limits the range of the normalized data to values between 0 and 1.

The sigmoid function is almost linear near the mean and has smooth nonlinearity at both extremes, ensuring that all data points are within a limited range.

This ensures that optimization and numerical integration algorithms can continue to rely on the derivative to estimate changes to the output (normalized value) that will be produced by changes to the input in the region near any linearisation point.

i

&#x2212;

&#x03B5;

i

/

k

B

## labelling with arrows in an automated way

In adding-labels-to-a-formula is a tikz scheme that puts rounded boxes around parts of an equation so that they can be labelled.

think it looks quite nice for short formulas (equation (1)), but not for short ones that have long text (equation (2)).

I see this is possible with tikz too, and copying from the beamer arrows page something like the following can be produced

Is there an easier way to do this than the multi-part code in example taken from the beamer arrows page above?