# AI News, Sequence to Sequence (seq2seq) Recurrent Neural Network (RNN) for Time Series Prediction

## Sequence to Sequence (seq2seq) Recurrent Neural Network (RNN) for Time Series Prediction

The goal of this project of mine is to bring users to try and experiment with the seq2seq neural network architecture.

Normally, seq2seq architectures may be used for other more sophisticated purposes than for signal prediction, let's say, language modeling, but this project is an interesting tutorial in order to then get to more complicated stuff.

You can find the French, original, version of this project in the French Git branch: https://github.com/guillaume-chevalier/seq2seq-signal-prediction/tree/francais Except the fact I made available an '.py' Python version of this tutorial within the repository, it is more convenient to run the code inside the notebook.

It is then that the notebook application (IDE) will open in your browser as a local server and it will be possible to open the .ipynb notebook file and to run code cells with CTRL+ENTER and SHIFT+ENTER, it is also possible to restart the kernel and run all cells at once with the menus.

Most of the time, you will have to edit the neural networks' training parameter to succeed in doing the exercise, but at a certain point, changes in the architecture itself will be asked and required.

A simple example would be to receive as an argument the past values of multiple stock market symbols in order to predict the future values of all those symbols with the neural network, which values are evolving together in time.

Note that it would be possible to obtain better results with a smaller neural network, provided better training hyperparameters and a longer training, adding dropout, and on.

Here is a prediction made on the actual future values, the neural network has not been trained on the future values shown here and this is a legitimate prediction, given a well-enough model trained on the task:

Disclaimer: this prediction of the future values was really good and you should not expect predictions to be always that good using as few data as actually (side note: the other prediction charts in this project are all 'average' except this one).

Other more creative input data could be sine waves (or other-type-shaped waves such as saw waves or triangles or two signals for cos and sin) representing the fluctuation of minutes, hours, days, weeks, months, years, moon cycles, and on.

It is also interesting to know where is the bitcoin most used: http://images.google.com/search?tbm=ischq=bitcoin+heatmap+world With all the above-mentionned examples, it would be possible to have all of this as input features, at every time steps: (BTC/USD, BTC/EUR, Dow_Jones, SP_500, hours, days, weeks, months, years, moons, meteo_USA, meteo_EUROPE, Twitter_sentiment).

It involves subtracting the mean across every individual feature in the data, and has the geometric interpretation of centering the cloud of data around the origin along every dimension.

It only makes sense to apply this preprocessing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm. In

case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional preprocessing step.

Then, we can compute the covariance matrix that tells us about the correlation structure in the data: The (i,j) element of the data covariance matrix contains the covariance between i-th and j-th dimension of the data.

To decorrelate the data, we project the original (but zero-centered) data into the eigenbasis: Notice that the columns of U are a set of orthonormal vectors (norm of 1, and orthogonal to each other), so they can be regarded as basis vectors.

This is also sometimes refereed to as Principal Component Analysis (PCA) dimensionality reduction: After this operation, we would have reduced the original dataset of size [N x D] to one of size [N x 100], keeping the 100 dimensions of the data that contain the most variance.

The geometric interpretation of this transformation is that if the input data is a multivariable gaussian, then the whitened data will be a gaussian with zero mean and identity covariance matrix.

One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input.

Note that we do not know what the final value of every weight should be in the trained network, but with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative.

The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network.

The implementation for one weight matrix might look like W = 0.01* np.random.randn(D,H), where randn samples from a zero mean, unit standard deviation gaussian.

With this formulation, every neuron’s weight vector is initialized as a random vector sampled from a multi-dimensional gaussian, so the neurons point in random direction in the input space.

That is, the recommended heuristic is to initialize each neuron’s weight vector as: w = np.random.randn(n) / sqrt(n), where n is the number of its inputs.

The sketch of the derivation is as follows: Consider the inner product $$s = \sum_i^n w_i x_i$$ between the weights $$w$$ and input $$x$$, which gives the raw activation of a neuron before the non-linearity.

And since $$\text{Var}(aX) = a^2\text{Var}(X)$$ for a random variable $$X$$ and a scalar $$a$$, this implies that we should draw from unit gaussian and then scale it by $$a = \sqrt{1/n}$$, to make its variance $$1/n$$.

In this paper, the authors end up recommending an initialization of the form $$\text{Var}(w) = 2/(n_{in} + n_{out})$$ where $$n_{in}, n_{out}$$ are the number of units in the previous layer and the next layer.

A more recent paper on this topic, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification by He et al., derives an initialization specifically for ReLU neurons, reaching the conclusion that the variance of neurons in the network should be $$2.0/n$$.

This gives the initialization w = np.random.randn(n) * sqrt(2.0/n), and is the current recommendation for use in practice in the specific case of neural networks with ReLU neurons.

Another way to address the uncalibrated variances problem is to set all weight matrices to zero, but to break symmetry every neuron is randomly connected (with weights sampled from a small gaussian as above) to a fixed number of neurons below it.

For ReLU non-linearities, some people like to use small constant value such as 0.01 for all biases because this ensures that all ReLU units fire in the beginning and therefore obtain and propagate some gradient.

However, it is not clear if this provides a consistent improvement (in fact some results seem to indicate that this performs worse) and it is more common to simply use 0 bias initialization.

A recently developed technique by Ioffe and Szegedy called Batch Normalization alleviates a lot of headaches with properly initializing neural networks by explicitly forcing the activations throughout a network to take on a unit gaussian distribution at the beginning of the training.

In the implementation, applying this technique usually amounts to insert the BatchNorm layer immediately after fully connected layers (or convolutional layers, as we’ll soon see), and before non-linearities.

It is common to see the factor of $$\frac{1}{2}$$ in front because then the gradient of this term with respect to the parameter $$w$$ is simply $$\lambda w$$ instead of $$2 \lambda w$$.

Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: W += -lambda * W towards zero.

L1 regularization is another relatively common form of regularization, where for each weight $$w$$ we add the term $$\lambda \mid w \mid$$ to the objective.

Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint.

In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector $$\vec{w}$$ of every neuron to satisfy $$\Vert \vec{w} \Vert_2 &lt; Vanilla dropout in an example 3-layer Neural Network would be implemented as follows: In the code above, inside the train_step function we have performed dropout twice: on the first hidden layer and on the second hidden layer. It can also be shown that performing this attenuation at test time can be related to the process of iterating over all the possible binary masks (and therefore all the exponentially many sub-networks) and computing their ensemble prediction. Since test-time performance is so critical, it is always preferable to use inverted dropout, which performs the scaling at train time, leaving the forward pass at test time untouched. Inverted dropout looks as follows: There has a been a large amount of research after the first introduction of dropout that tries to understand the source of its power in practice, and its relation to the other regularization techniques. As we already mentioned in the Linear Classification section, it is not common to regularize the bias parameters because they do not interact with the data through multiplicative interactions, and therefore do not have the interpretation of controlling the influence of a data dimension on the final objective. For example, a binary classifier for each category independently would take the form: where the sum is over all categories \(j$$, and $$y_{ij}$$ is either +1 or -1 depending on whether the i-th example is labeled with the j-th attribute, and the score vector $$f_j$$ will be positive when the class is predicted to be present and negative otherwise.

A binary logistic regression classifier has only two classes (0,1), and calculates the probability of class 1 as: Since the probabilities of class 1 and 0 sum to one, the probability for class 0 is $$P(y = 0 \mid x; The expression above can look scary but the gradient on \(f$$ is in fact extremely simple and intuitive: $$\partial{L_i} / \partial{f_j} = y_{ij} - \sigma(f_j)$$ (as you can double check yourself by taking the derivatives).

The L2 norm squared would compute the loss for a single example of the form: The reason the L2 norm is squared in the objective is that the gradient becomes much simpler, without changing the optimal parameters since squaring is a monotonic operation.

For example, if you are predicting star rating for a product, it might work much better to use 5 independent classifiers for ratings of 1-5 stars instead of a regression loss.

If you’re certain that classification is not appropriate, use the L2 but be careful: For example, the L2 is more fragile and applying dropout in the network (especially in the layer right before the L2 loss) is not a great idea.

## A simple deep learning model for stock price prediction using TensorFlow

For a recent hackathon that we did at STATWORX, some of our team members scraped minutely SP 500 data from the Google Finance API.

Having this data at hand, the idea of developing a deep learning model for predicting the SP 500 index based on the 500 constituents prices one minute ago came immediately on my mind.

Playing around with the data and building the deep learning model with TensorFlow was fun and so I decided to write my first Medium.com story: a little TensorFlow tutorial on predicting SP 500 stock prices.

The dataset contains n = 41266 minutes of data ranging from April to August 2017 on 500 stocks as well as the total SP 500 index price.

The data was already cleaned and prepared, meaning missing stock and index prices were LOCF’ed (last observation carried forward), so that the file did not contain any missing values.

quick look at the SP time series using pyplot.plot(data[&#39;SP500&#39;]): Note: This is actually the lead of the SP 500 index, meaning, its value is shifted 1 minute into the future.

There are a lot of different approaches to time series cross validation, such as rolling forecasts with and without refitting or more elaborate concepts such as time series bootstrap resampling.

The latter involves repeated samples from the remainder of the seasonal decomposition of the time series in order to simulate samples that follow the same seasonal pattern as the original time series but are not exact copies of its values.

Because most common activation functions of the network’s neurons such as tanh or sigmoid are defined on the [-1, 1] or [0, 1] interval respectively.

Nowadays, rectified linear unit (ReLU) activations are commonly used activations which are unbounded on the axis of possible activation values.

Since neural networks are actually graphs of data and mathematical operations, TensorFlow is just perfect for neural networks and deep learning.

Check out this simple example (stolen from our deep learning introduction from our blog): In the figure above, two numbers are supposed to be added.

The following code implements the toy example from above in TensorFlow: After having imported the TensorFlow library, two placeholders are defined using tf.placeholder().

We need two placeholders in order to fit our model: X contains the network&#39;s inputs (the stock prices of all SP 500 constituents at time T = t) and Y the network&#39;s outputs (the index value of the SP 500 at time T = t + 1).

The None argument indicates that at this point we do not yet know the number of observations that flow through the neural net graph in each batch, so we keep if flexible.

While placeholders are used to store input and target data in the graph, variables are used as flexible containers within the graph that are allowed to change during graph execution.

As a rule of thumb in multilayer perceptrons (MLPs, the type of networks used here), the second dimension of the previous layer is the first dimension in the current layer for weight matrices.

The biases dimension equals the second dimension of the current layer’s weight matrix, which corresponds the number of neurons in this layer.

The cost function of the network is used to generate a measure of deviation between the network’s predictions and the actual observed training targets.

The optimizer takes care of the necessary computations that are used to adapt the network’s weight and bias variables during training.

Those computations invoke the calculation of so called gradients, that indicate the direction in which the weights and biases have to be changed during training in order to minimize the network’s cost function.

Since neural networks are trained using numerical optimization techniques, the starting point of the optimization problem is one the key factors to find good solutions to the underlying problem.

During minibatch training random data samples of n = batch_size are drawn from the training data and fed into the network.

During the training, we evaluate the networks predictions on the test set — the data which is not learned, but set aside — for every 5th batch and visualize it.

The model quickly learns the shape und location of the time series in the test data and is able to produce an accurate prediction after some epochs.

Please note that there are tons of ways of further improving this result: design of layers and neurons, choosing different initialization and activation schemes, introduction of dropout layers of neurons, early stopping and so on.

## Stock market prediction

Stock market prediction is the act of trying to determine the future value of a company stock or other financial instrument traded on an exchange.

The efficient market hypothesis posits that stock prices are a function of information and rational expectations, and that newly revealed information about a company's prospects is almost immediately reflected in the current stock price.

A number of empirical tests support the notion that the theory applies generally, as most portfolios managed by professional stock predictors do not outperform the market average return after accounting for the managers' fees.

Another meaning of fundamental analysis is beyond bottom-up company analysis, it refers to top-down analysis from first analyzing the global economy, followed by country analysis and then sector analysis, and finally the company level analysis.

Outputs from the individual 'low' and 'high' networks can also be input into a final network that would also incorporate volume, intermarket data or statistical summaries of prices, leading to a final ensemble output that would trigger buying, selling, or market directional change.

function approximation) using outputs in the form of buy(y=+1) and sell(y=-1) results in better predictive reliability than a quantitative output such as low or high price.[2] This is explained by the fact that an ANN can predict class better than a quantitative value as in function approximation—since ANNs occasionally learn more about the noise in the input data.

introduced a method to identify online precursors for stock market moves, using trading strategies based on search volume data provided by Google Trends.[3] Their analysis of Google search volume for 98 terms of varying financial relevance, published in Scientific Reports,[4] suggests that increases in search volume for financially relevant search terms tend to precede large losses in financial markets.[5][6][7][8][9][10][11][12] Out of these terms, three were significant at the 5% level (|z|

In a study published in Scientific Reports in 2013,[13] Helen Susannah Moat, Tobias Preis and colleagues demonstrated a link between changes in the number of views of English Wikipedia articles relating to financial topics and subsequent large stock market moves.[14] The use of Text Mining together with Machine Learning algorithms received more attention in the last years,[15] with the use of textual content from Internet as input to predict price changes in Stocks and other financial markets.

Finance and Google Finance were used as news feeding in a Text mining process, to forecast the Stocks price movements from Dow Jones Industrial Average.[18] Using new statistical analysis tools of complexity theory, researchers at the New England Complex Systems Institute (NECSI) performed research on predicting stock market crashes.[19][20][21] It has long been thought that market crashes are triggered by panics that may or may not be justified by external news.

## Can neural networks predict trended time series?

First, I should say that I am thinking of the common types of neural networks that are comprised by neurons that use some type of sigmoid transfer function, although the arguments discussed here are applicable to other types of neural networks.

The neural network arranges several such neurons in a network, effectively passing the inputs through multiple (typically) nonlinear regressions and combining the results in the output node.

In principle we could make direct connections from the inputs to layers deeper in the network or the output directly (resulting in nonlinear-linear models) or feedback loops (resulting in recurrent networks).

In the following interactive example you can choose: The first plot shows the input-output values, the plot of the transfer function and with cyan background the area of values that can be considered by the neuron given selected weight and constant.

As a side note, although I do not see MLP as anything to do with simulating biological networks, the sigmoid-type transfer functions are partly inspired by the stimulated or not states of biological neurons.

(1) suggests, in a neural network the output of a neuron is multiplied by a weight and shifted by a constant, so it is relatively easy to achieve output values much greater than the bounds of a single neuron.

and reach a minimum/maximum value and cannot decrease/increase perpetually, unless non-squashing neurons are used as well (this is for example a case where direct connections to a linear output become useful).

Suppose we want to predict the future values of a deterministic upward trend with no noise, of the form: yt = xt and xt = (1, 2, 3, 4, &#8230;).

The network is able to provide a very good fit in the training set and for most of test set A, but as the values increase (test set B) we can see that the networks starts to saturate (the individual nodes reach the upper bounds of the values they can output and eventually the whole network) and the predicted trend tapers off.

One would argue that with careful scaling of data (see good fit in test set A) it is possible to predict trends, but that implies that one knows the range that the future values would be in, to accommodate them with appropriate scaling.

Past research that I have been part of has shown that using differences is reliable and effective (for example see the specifications of neural networks here and here), even though there are unresolved problems with differencing.

## The Box–Jenkins analysis and neural networks: prediction and time series modelling

Two approaches, namely the Box–Jenkins (BJ) approach and the artificial neural networks (ANN) approach were combined to model time series data of water consumption in Kuwait.

It is interesting to note that the lagged or delayed variables obtained from the BJ approach and used in neural networks provide a better ANN model than the one obtained either blindly in blackbox mode as has been suggested or from traditional known methods.

Predicting Multiple Discrete Values with Multinomials, Neural Networks and { nnet } - ML with R

Using R and the multinom function from the { nnet } package we can easily predict discrete / factors of more than 2 levels. With the help of Repeated Cross Validation, we understand the importance...

Data Predictor Using Neural Networks

In this project , I built a program using neural networks in MATLAB for predicting the pollution in a lake near chemical plant in Saudi Arabia. I received the daily measured pollution for...

Recurrent neural network predicting the Fibonacci sequence

A recurrent neural network learning to predict the Fibonacci sequence. Input is a single number, so the recurrent layer needs to remember previous inputs to produce the correct output when input is 1.

Regression forecasting and predicting - Practical Machine Learning Tutorial with Python p.5

In this video, make sure you define the X's like so. I flipped the last two lines by mistake: X = np.array(df.drop(['label'],1)) X = preprocessing.scale(X) X_lately = X[-forecast_out:] X...

BigQuery and Cloud Machine Learning: advancing neural network predictions (Google Cloud Next '17)

The real value of BigQuery is not its speed. It's the power of "democratizing enterprise data." Because of BigQuery's scalability, you can isolate any workload on BigQuery from others. That...

Neural Networks - The Math of Intelligence #4

Have you ever wondered what the math behind neural networks looks like? What gives them such incredible power? We're going to cover 4 different neural networks in this video to develop an intuition...

Recursive Neural Tensor Nets - Ep. 11 (Deep Learning SIMPLIFIED)

Certain patterns are innately hierarchical, like the underlying parse tree of a natural language sentence. A Recursive Neural Tensor Network (RNTN) is a powerful tool for deciphering and labelling...

How to Make a Text Summarizer - Intro to Deep Learning #10

I'll show you how you can turn an article into a one-sentence summary in Python with the Keras machine learning library. We'll go over word embeddings, encoder-decoder architecture, and the...

Techniques to improve the accuracy of your Predictive Models

Code from talk: The Heritage Health Prize was a two year predictive analytics competition..

Neural Network Model - Deep Learning with Neural Networks and TensorFlow

Welcome to part three of Deep Learning with Neural Networks and TensorFlow, and part 45 of the Machine Learning tutorial series. In this tutorial, we're going to be heading (falling) down the...