# AI News, Sequence to Sequence (seq2seq) Recurrent Neural Network (RNN) for Time Series Prediction

## Sequence to Sequence (seq2seq) Recurrent Neural Network (RNN) for Time Series Prediction

The goal of this project of mine is to bring users to try and experiment with the seq2seq neural network architecture.

Normally, seq2seq architectures may be used for other more sophisticated purposes than for signal prediction, let's say, language modeling, but this project is an interesting tutorial in order to then get to more complicated stuff.

You can find the French, original, version of this project in the French Git branch: https://github.com/guillaume-chevalier/seq2seq-signal-prediction/tree/francais Except the fact I made available an '.py' Python version of this tutorial within the repository, it is more convenient to run the code inside the notebook.

It is then that the notebook application (IDE) will open in your browser as a local server and it will be possible to open the .ipynb notebook file and to run code cells with CTRL+ENTER and SHIFT+ENTER, it is also possible to restart the kernel and run all cells at once with the menus.

Most of the time, you will have to edit the neural networks' training parameter to succeed in doing the exercise, but at a certain point, changes in the architecture itself will be asked and required.

A simple example would be to receive as an argument the past values of multiple stock market symbols in order to predict the future values of all those symbols with the neural network, which values are evolving together in time.

Note that it would be possible to obtain better results with a smaller neural network, provided better training hyperparameters and a longer training, adding dropout, and on.

Here is a prediction made on the actual future values, the neural network has not been trained on the future values shown here and this is a legitimate prediction, given a well-enough model trained on the task:

Disclaimer: this prediction of the future values was really good and you should not expect predictions to be always that good using as few data as actually (side note: the other prediction charts in this project are all 'average' except this one).

Other more creative input data could be sine waves (or other-type-shaped waves such as saw waves or triangles or two signals for cos and sin) representing the fluctuation of minutes, hours, days, weeks, months, years, moon cycles, and on.

It is also interesting to know where is the bitcoin most used: http://images.google.com/search?tbm=isch&q=bitcoin+heatmap+world With all the above-mentionned examples, it would be possible to have all of this as input features, at every time steps: (BTC/USD, BTC/EUR, Dow_Jones, SP_500, hours, days, weeks, months, years, moons, meteo_USA, meteo_EUROPE, Twitter_sentiment).

It involves subtracting the mean across every individual feature in the data, and has the geometric interpretation of centering the cloud of data around the origin along every dimension.

It only makes sense to apply this preprocessing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm. In

case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional preprocessing step.

Then, we can compute the covariance matrix that tells us about the correlation structure in the data: The (i,j) element of the data covariance matrix contains the covariance between i-th and j-th dimension of the data.

To decorrelate the data, we project the original (but zero-centered) data into the eigenbasis: Notice that the columns of U are a set of orthonormal vectors (norm of 1, and orthogonal to each other), so they can be regarded as basis vectors.

This is also sometimes refereed to as Principal Component Analysis (PCA) dimensionality reduction: After this operation, we would have reduced the original dataset of size [N x D] to one of size [N x 100], keeping the 100 dimensions of the data that contain the most variance.

The geometric interpretation of this transformation is that if the input data is a multivariable gaussian, then the whitened data will be a gaussian with zero mean and identity covariance matrix.

One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input.

Note that we do not know what the final value of every weight should be in the trained network, but with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative.

The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network.

The implementation for one weight matrix might look like W = 0.01* np.random.randn(D,H), where randn samples from a zero mean, unit standard deviation gaussian.

With this formulation, every neuron’s weight vector is initialized as a random vector sampled from a multi-dimensional gaussian, so the neurons point in random direction in the input space.

That is, the recommended heuristic is to initialize each neuron’s weight vector as: w = np.random.randn(n) / sqrt(n), where n is the number of its inputs.

The sketch of the derivation is as follows: Consider the inner product $$s = \sum_i^n w_i x_i$$ between the weights $$w$$ and input $$x$$, which gives the raw activation of a neuron before the non-linearity.

And since $$\text{Var}(aX) = a^2\text{Var}(X)$$ for a random variable $$X$$ and a scalar $$a$$, this implies that we should draw from unit gaussian and then scale it by $$a = \sqrt{1/n}$$, to make its variance $$1/n$$.

In this paper, the authors end up recommending an initialization of the form $$\text{Var}(w) = 2/(n_{in} + n_{out})$$ where $$n_{in}, n_{out}$$ are the number of units in the previous layer and the next layer.

A more recent paper on this topic, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification by He et al., derives an initialization specifically for ReLU neurons, reaching the conclusion that the variance of neurons in the network should be $$2.0/n$$.

This gives the initialization w = np.random.randn(n) * sqrt(2.0/n), and is the current recommendation for use in practice in the specific case of neural networks with ReLU neurons.

Another way to address the uncalibrated variances problem is to set all weight matrices to zero, but to break symmetry every neuron is randomly connected (with weights sampled from a small gaussian as above) to a fixed number of neurons below it.

For ReLU non-linearities, some people like to use small constant value such as 0.01 for all biases because this ensures that all ReLU units fire in the beginning and therefore obtain and propagate some gradient.

However, it is not clear if this provides a consistent improvement (in fact some results seem to indicate that this performs worse) and it is more common to simply use 0 bias initialization.

A recently developed technique by Ioffe and Szegedy called Batch Normalization alleviates a lot of headaches with properly initializing neural networks by explicitly forcing the activations throughout a network to take on a unit gaussian distribution at the beginning of the training.

In the implementation, applying this technique usually amounts to insert the BatchNorm layer immediately after fully connected layers (or convolutional layers, as we’ll soon see), and before non-linearities.

It is common to see the factor of $$\frac{1}{2}$$ in front because then the gradient of this term with respect to the parameter $$w$$ is simply $$\lambda w$$ instead of $$2 \lambda w$$.

Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: W += -lambda * W towards zero.

L1 regularization is another relatively common form of regularization, where for each weight $$w$$ we add the term $$\lambda \mid w \mid$$ to the objective.

Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint.

In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector $$\vec{w}$$ of every neuron to satisfy $$\Vert \vec{w} \Vert_2 &lt; Vanilla dropout in an example 3-layer Neural Network would be implemented as follows: In the code above, inside the train_step function we have performed dropout twice: on the first hidden layer and on the second hidden layer. It can also be shown that performing this attenuation at test time can be related to the process of iterating over all the possible binary masks (and therefore all the exponentially many sub-networks) and computing their ensemble prediction. Since test-time performance is so critical, it is always preferable to use inverted dropout, which performs the scaling at train time, leaving the forward pass at test time untouched. Inverted dropout looks as follows: There has a been a large amount of research after the first introduction of dropout that tries to understand the source of its power in practice, and its relation to the other regularization techniques. As we already mentioned in the Linear Classification section, it is not common to regularize the bias parameters because they do not interact with the data through multiplicative interactions, and therefore do not have the interpretation of controlling the influence of a data dimension on the final objective. For example, a binary classifier for each category independently would take the form: where the sum is over all categories \(j$$, and $$y_{ij}$$ is either +1 or -1 depending on whether the i-th example is labeled with the j-th attribute, and the score vector $$f_j$$ will be positive when the class is predicted to be present and negative otherwise.

A binary logistic regression classifier has only two classes (0,1), and calculates the probability of class 1 as: Since the probabilities of class 1 and 0 sum to one, the probability for class 0 is $$P(y = 0 \mid x; The expression above can look scary but the gradient on \(f$$ is in fact extremely simple and intuitive: $$\partial{L_i} / \partial{f_j} = y_{ij} - \sigma(f_j)$$ (as you can double check yourself by taking the derivatives).

The L2 norm squared would compute the loss for a single example of the form: The reason the L2 norm is squared in the objective is that the gradient becomes much simpler, without changing the optimal parameters since squaring is a monotonic operation.

For example, if you are predicting star rating for a product, it might work much better to use 5 independent classifiers for ratings of 1-5 stars instead of a regression loss.

If you’re certain that classification is not appropriate, use the L2 but be careful: For example, the L2 is more fragile and applying dropout in the network (especially in the layer right before the L2 loss) is not a great idea.

## BMC Research Notes

We constructed the statistical model of ANN using the information on female participants in Taiji study, T town, Wakayama, Japan.

We have reported several studies on osteoporosis based on the data of Taiji study concerning BMD [8], risk factors affecting BMD [9] and determinants of bone loss [10].

Input variables consisted of eleven variables: age, weight, height, age at menopause, age at menarche, durations after menopause, body mass index (BMI), percent of body fat, fat mass, lean body mass, and lumbar (L2–L4) or femoral BMD values which were measured in 1993, respectively.

Output variables consisted of two variables: BMD at lumbar site (L2–L4) (LBMD) or BMD at proximal femur site (FBMD) values measured in 2003 and the BLR, respectively, calculated by the difference in BMD values from 1993 to 2003 divided by 10 (Additional file 2: Figure S1a, b).

The statistical comparison of each statistical model was performed using Akaike’s information criterion (AIC), Schwartz’s Bayesian information criterion (BIC), and multiple correlation coefficients (R2) values corrected by degrees of freedom (R2 values), respectively.

These parameters were the age, weight, height, age at menopause, age at menarche, durations after menopause, BMI, percent of body fat, fat mass, lean body mass, and lumbar (L2–L4) or femoral BMD values.

Since the bone mineral density is responsible for 70% of the bone strength and because the occurrence of bone fractures is thought to be associated with the bone strength [12], the discovery of a new tool in this study to predict future BMD values may be useful to reduce the bone fracture rate in post-osteoporosis women.

Because of the possible presence of different personal risks for the severity of osteoporosis in the future, our findings showing the ability to predict individual BMD in 10years are thought to be useful as a tool of tailored medicine, which might contribute to some extent to the prevention of and decisions regarding early therapy for post-menopausal osteoporosis.

The advantages of ANN include their ability to extract hidden features from input information and their robustness against assumptions concerning the type of distribution of input data and against the influence of diagnostic noise.

In addition, the sensitivity, specificity, and c-index for the predicted diagnosis of osteoporosis of the LBMD in 10years using this model were 80.0, 90.5%, and 0.825 while the same values for FBMD were 80.6, 93.3%, and 0.870, respectively.

It is concluded that the application of ANN to predict future BMD in advance of the first visit of a patient to an osteoporosis clinic may lead to early intervention to avoid possible fragile bone fractures due to severe post-menopausal osteoporosis.

## How to build your own Neural Network from scratch in Python

Motivation: As part of my personal journey to gain a better understanding of Deep Learning, I’ve decided to build a Neural Network from scratch without a deep learning library like TensorFlow.

Without delving into brain analogies, I find it easier to simply describe Neural Networks as a mathematical function that maps a given input to a desired output.

Neural Networks consist of the following components The diagram below shows the architecture of a 2-layer Neural Network (note that the input layer is typically excluded when counting the number of layers in a Neural Network) Creating a Neural Network class in Python is easy.

Training the Neural Network The output ŷ of a simple 2-layer Neural Network is: You might notice that in the equation above, the weights W and the biases b are the only variables that affects the output ŷ.

As we’ve seen in the sequential graph above, feedforward is just simple calculus and for a basic 2-layer neural network, the output of the Neural Network is: Let’s add a feedforward function in our python code to do exactly that.

In order to know the appropriate amount to adjust the weights and biases by, we need to know the derivative of the loss function with respect to the weights and biases.

However, we can’t directly calculate the derivative of the loss function with respect to the weights and biases because the equation of the loss function does not contain the weights and biases.

Although Deep Learning libraries such as TensorFlow and Keras makes it easy to build deep nets without fully understanding the inner workings of a Neural Network, I find that it’s beneficial for aspiring data scientist to gain a deeper understanding of Neural Networks.

## A simple deep learning model for stock price prediction using TensorFlow

For a recent hackathon that we did at STATWORX, some of our team members scraped minutely S&P 500 data from the Google Finance API.

Having this data at hand, the idea of developing a deep learning model for predicting the S&P 500 index based on the 500 constituents prices one minute ago came immediately on my mind.

Playing around with the data and building the deep learning model with TensorFlow was fun and so I decided to write my first Medium.com story: a little TensorFlow tutorial on predicting S&P 500 stock prices.

The dataset contains n = 41266 minutes of data ranging from April to August 2017 on 500 stocks as well as the total S&P 500 index price.

The data was already cleaned and prepared, meaning missing stock and index prices were LOCF’ed (last observation carried forward), so that the file did not contain any missing values.

There are a lot of different approaches to time series cross validation, such as rolling forecasts with and without refitting or more elaborate concepts such as time series bootstrap resampling.

The latter involves repeated samples from the remainder of the seasonal decomposition of the time series in order to simulate samples that follow the same seasonal pattern as the original time series but are not exact copies of its values.

Because most common activation functions of the network’s neurons such as tanh or sigmoid are defined on the [-1, 1] or [0, 1] interval respectively.

Nowadays, rectified linear unit (ReLU) activations are commonly used activations which are unbounded on the axis of possible activation values.

Since neural networks are actually graphs of data and mathematical operations, TensorFlow is just perfect for neural networks and deep learning.

Check out this simple example (stolen from our deep learning introduction from our blog): In the figure above, two numbers are supposed to be added.

The following code implements the toy example from above in TensorFlow: After having imported the TensorFlow library, two placeholders are defined using tf.placeholder().

We need two placeholders in order to fit our model: X contains the network&#39;s inputs (the stock prices of all S&P 500 constituents at time T = t) and Y the network&#39;s outputs (the index value of the S&P 500 at time T = t + 1).

The None argument indicates that at this point we do not yet know the number of observations that flow through the neural net graph in each batch, so we keep if flexible.

While placeholders are used to store input and target data in the graph, variables are used as flexible containers within the graph that are allowed to change during graph execution.

As a rule of thumb in multilayer perceptrons (MLPs, the type of networks used here), the second dimension of the previous layer is the first dimension in the current layer for weight matrices.

The biases dimension equals the second dimension of the current layer’s weight matrix, which corresponds the number of neurons in this layer.

The cost function of the network is used to generate a measure of deviation between the network’s predictions and the actual observed training targets.

The optimizer takes care of the necessary computations that are used to adapt the network’s weight and bias variables during training.

Those computations invoke the calculation of so called gradients, that indicate the direction in which the weights and biases have to be changed during training in order to minimize the network’s cost function.

Since neural networks are trained using numerical optimization techniques, the starting point of the optimization problem is one the key factors to find good solutions to the underlying problem.

During minibatch training random data samples of n = batch_size are drawn from the training data and fed into the network.

The training of the network stops once the maximum number of epochs is reached or another stopping criterion defined by the user applies.

During the training, we evaluate the networks predictions on the test set — the data which is not learned, but set aside — for every 5th batch and visualize it.

The model quickly learns the shape and location of the time series in the test data and is able to produce an accurate prediction after some epochs.

Please note that there are tons of ways of further improving this result: design of layers and neurons, choosing different initialization and activation schemes, introduction of dropout layers of neurons, early stopping and so on.

## Can neural networks predict trended time series?

First, I should say that I am thinking of the common types of neural networks that are comprised by neurons that use some type of sigmoid transfer function, although the arguments discussed here are applicable to other types of neural networks.

The neural network arranges several such neurons in a network, effectively passing the inputs through multiple (typically) nonlinear regressions and combining the results in the output node.

In principle we could make direct connections from the inputs to layers deeper in the network or the output directly (resulting in nonlinear-linear models) or feedback loops (resulting in recurrent networks).

In the following interactive example you can choose: The first plot shows the input-output values, the plot of the transfer function and with cyan background the area of values that can be considered by the neuron given selected weight and constant.

As a side note, although I do not see MLP as anything to do with simulating biological networks, the sigmoid-type transfer functions are partly inspired by the stimulated or not states of biological neurons.

(1) suggests, in a neural network the output of a neuron is multiplied by a weight and shifted by a constant, so it is relatively easy to achieve output values much greater than the bounds of a single neuron.

and reach a minimum/maximum value and cannot decrease/increase perpetually, unless non-squashing neurons are used as well (this is for example a case where direct connections to a linear output become useful).

Suppose we want to predict the future values of a deterministic upward trend with no noise, of the form: yt = xt and xt = (1, 2, 3, 4, &#8230;).

The network is able to provide a very good fit in the training set and for most of test set A, but as the values increase (test set B) we can see that the networks starts to saturate (the individual nodes reach the upper bounds of the values they can output and eventually the whole network) and the predicted trend tapers off.

One would argue that with careful scaling of data (see good fit in test set A) it is possible to predict trends, but that implies that one knows the range that the future values would be in, to accommodate them with appropriate scaling.

Past research that I have been part of has shown that using differences is reliable and effective (for example see the specifications of neural networks here and here), even though there are unresolved problems with differencing.

Predicting Multiple Discrete Values with Multinomials, Neural Networks and { nnet } - ML with R

Using R and the multinom function from the { nnet } package we can easily predict discrete / factors of more than 2 levels. With the help of Repeated Cross ...

Data Predictor Using Neural Networks

In this project , I built a program using neural networks in MATLAB for predicting the pollution in a lake near chemical plant in Saudi Arabia. I received the daily ...

Data Mining- Forecasting using Neural Networks in RStudio

The main concept of this Data Mining project is to forecast the Closing prices of the stock market based on the past data sets. Note: Watch with Sub-titles :)

Regression forecasting and predicting - Practical Machine Learning Tutorial with Python p.5

In this video, make sure you define the X's like so. I flipped the last two lines by mistake: X = np.array(df.drop(['label'],1)) X = preprocessing.scale(X) X_lately ...

BigQuery and Cloud Machine Learning: advancing neural network predictions (Google Cloud Next '17)

The real value of BigQuery is not its speed. It's the power of "democratizing enterprise data." Because of BigQuery's scalability, you can isolate any workload on ...

Neural Networks - The Math of Intelligence #4

Have you ever wondered what the math behind neural networks looks like? What gives them such incredible power? We're going to cover 4 different neural ...

Samy Bengio: A note of caution about training sequence prediction/generation models

Recorded at the Music, Art & Machine Intelligence 2016 workshop in San Francisco. Samy introduced a new approach to training neural networks called ...

Recurrent neural network predicting the Fibonacci sequence

A recurrent neural network learning to predict the Fibonacci sequence. Input is a single number, so the recurrent layer needs to remember previous inputs to ...

Neural Networks

This short video shows how changes in values of features and weights can trigger or not the purchase by a customer, given the values and the threshold, ...

Neural Network Model - Deep Learning with Neural Networks and TensorFlow

Welcome to part three of Deep Learning with Neural Networks and TensorFlow, and part 45 of the Machine Learning tutorial series. In this tutorial, we're going to ...