AI News, Machine Learning for Intraday Stock Price Prediction 1: Linear Models

Machine Learning for Intraday Stock Price Prediction 1: Linear Models

This is the first of a series of posts on the task of applying machine learning for intraday stock price/return prediction.

Let’s look at the individual points in the above graph - there are more than 200,000 datapoints there, but we will just look at the first few to understand what it is that we want to predict.

Please note that it’s typically better to predict returns rather than price difference because models/techniques designed to predict returns can scale across various securities relatively better.

$0.13 price difference for a $154 stock is not much compared to $0.13 price difference for a $20 stock - $0.13 has a different meaning for a $20 stock.

In this post though, we will only use the features derived from the market data to predict the next 1 min price change.

(We will explore news/text data effects in a separate post in future.) The feature set can be broadly classified into two categories: Features describing current market snapshot and Features describing recent history

However, please note that these features try to capture the current market conditions as well as the recent past.

While the price of AAPL could range from 153 to 155 in a day, the volume over last 5 min could range from 100 to 1000000.

However, the model outputing 0 is of absolutely no value - we want opinionated models that can be useful for trading or execution.

Therefore, comparing the standard deviation of the predicted value h(x) with the standard deviation of y is necessary.

Adding a weight penalty to the error term is a simple way to regularize the model - this helps stabilize the training and the model is often better at generalization.

The following results were obtained using a 2-layer feed forward neural network with hidden_size1=100 and hidden_size2=50.

Boston Home Prices Prediction and Evaluation

It is difficult to measure the quality of a given model without quantifying its performance over training and testing.

This is typically done using some type of performance metric, whether it is through calculating some type of error, the goodness of fit, or some other useful measurement.

The values for R2 range from 0 to 1, which captures the percentage of squared correlation between the predicted and actual values of the target variable.

A simple deep learning model for stock price prediction using TensorFlow

For a recent hackathon that we did at STATWORX, some of our team members scraped minutely SP 500 data from the Google Finance API.

Having this data at hand, the idea of developing a deep learning model for predicting the SP 500 index based on the 500 constituents prices one minute ago came immediately on my mind.

Playing around with the data and building the deep learning model with TensorFlow was fun and so I decided to write my first Medium.com story: a little TensorFlow tutorial on predicting SP 500 stock prices.

The dataset contains n = 41266 minutes of data ranging from April to August 2017 on 500 stocks as well as the total SP 500 index price.

The data was already cleaned and prepared, meaning missing stock and index prices were LOCF’ed (last observation carried forward), so that the file did not contain any missing values.

quick look at the SP time series using pyplot.plot(data['SP500']): Note: This is actually the lead of the SP 500 index, meaning, its value is shifted 1 minute into the future.

There are a lot of different approaches to time series cross validation, such as rolling forecasts with and without refitting or more elaborate concepts such as time series bootstrap resampling.

The latter involves repeated samples from the remainder of the seasonal decomposition of the time series in order to simulate samples that follow the same seasonal pattern as the original time series but are not exact copies of its values.

Because most common activation functions of the network’s neurons such as tanh or sigmoid are defined on the [-1, 1] or [0, 1] interval respectively.

Nowadays, rectified linear unit (ReLU) activations are commonly used activations which are unbounded on the axis of possible activation values.

Since neural networks are actually graphs of data and mathematical operations, TensorFlow is just perfect for neural networks and deep learning.

Check out this simple example (stolen from our deep learning introduction from our blog): In the figure above, two numbers are supposed to be added.

The following code implements the toy example from above in TensorFlow: After having imported the TensorFlow library, two placeholders are defined using tf.placeholder().

We need two placeholders in order to fit our model: X contains the network's inputs (the stock prices of all SP 500 constituents at time T = t) and Y the network's outputs (the index value of the SP 500 at time T = t + 1).

The None argument indicates that at this point we do not yet know the number of observations that flow through the neural net graph in each batch, so we keep if flexible.

While placeholders are used to store input and target data in the graph, variables are used as flexible containers within the graph that are allowed to change during graph execution.

As a rule of thumb in multilayer perceptrons (MLPs, the type of networks used here), the second dimension of the previous layer is the first dimension in the current layer for weight matrices.

The biases dimension equals the second dimension of the current layer’s weight matrix, which corresponds the number of neurons in this layer.

The cost function of the network is used to generate a measure of deviation between the network’s predictions and the actual observed training targets.

The optimizer takes care of the necessary computations that are used to adapt the network’s weight and bias variables during training.

Those computations invoke the calculation of so called gradients, that indicate the direction in which the weights and biases have to be changed during training in order to minimize the network’s cost function.

Since neural networks are trained using numerical optimization techniques, the starting point of the optimization problem is one the key factors to find good solutions to the underlying problem.

During minibatch training random data samples of n = batch_size are drawn from the training data and fed into the network.

During the training, we evaluate the networks predictions on the test set — the data which is not learned, but set aside — for every 5th batch and visualize it.

The model quickly learns the shape und location of the time series in the test data and is able to produce an accurate prediction after some epochs.

Please note that there are tons of ways of further improving this result: design of layers and neurons, choosing different initialization and activation schemes, introduction of dropout layers of neurons, early stopping and so on.

Machine learning tutorial: Create your first data science experiment in Azure Machine Learning Studio

The experiment will test an analytical model that predicts the price of an automobile based on different variables such as make and technical specifications.

When your model is ready, you can publish it as a web service so that others can send it new data and get predictions in return.

Sign in to Machine Learning Studio In this machine learning tutorial, you'll follow five basic steps to build an experiment in Machine Learning Studio to create, train, and score your model: The first thing you need to perform machine learning is data. There

dataset includes entries for various individual automobiles, including information such as make, model, technical specifications, and price.

In this sample dataset, each instance of an automobile appears as a row, and the variables associated with each automobile appear as columns.

Given the variables for a specific automobile, we're going to try to predict the price in far-right column (column 26, titled "price").

First we add a module that removes the normalized-losses column completely, and then we add another module that removes any row that has missing data.

But to start, let's try the following features: This produces a filtered dataset containing only the features we want to pass to the learning algorithm we'll use in the next step.

Then we'll test the model - we'll give it a set of features for automobiles we're familiar with and see how close the model comes to predicting the known price.

After running, the experiment should now look something like this Now that we've trained the model using 75 percent of our data, we can use it to score the other 25 percent of the data to see how well our model functions.

The final experiment should look something like this: The final experiment Now that you've completed the first machine learning tutorial and have your experiment set up, you can continue to improve the model and then deploy it as a predictive web service.

For a more extensive and detailed walkthrough of the process of creating, training, scoring, and deploying a model, see Develop a predictive solution by using Azure Machine Learning.