AI News, Presenting “Hello, TensorFlow!”

Presenting “Hello, TensorFlow!”

If you've used software like R's neuralnet package or Python's PyBrain, you've seen interfaces where you typically define the number of neurons or 'hidden units' that you want in each layer, and the connections between them (and perhaps some other components) are set up automatically one way or another.

But if you're looking at the math you have to evaluate inside these neural net models, the 'neurons' hardly exist at all.

The weights on the connections certainly exist, and there are operations that connect values as they flow through the network, but in the end you don't really need 'neurons' to be a dominant abstraction in the system.

Focusing on (typically fixed-dimension) tensors does arguably make it more awkward to attempt esoteric techniques like dynamically changing the architecture of your neural net by adding neurons, for example.

(This comment was inspired by Let your Networks Grow, Neural Nets Back to the Future at ICML 2016.) The linear algebra you actually have to think about is not too tricky, but for our purposes I'm not going to use anything more than individual numbers, so there's no chance of being distracted by linear algebra.

The relevant point is that while some other software focuses on neurons and mostly ignores weights, in the case of TensorFlow we focus on weights and mostly ignore neurons as a top-level abstraction.

The advantage of graphs is that the software can manipulate them and 'reason about' how to answer questions you're asking before it commits to particular execution paths.

Outside of software mostly for deep learning, Spark works with a kind of computation graph to manage distributed processing.

The TensorBoard tooling that comes with TensorFlow is a great logging and visualizing system that gives you access to a lot of internals that could otherwise be a lot of work to expose.

So if you develop a TensorFlow model on your laptop, you can (relatively) easily build a deployment system to serve that model in a serious way.

Tinker With a Neural Network Right Here in Your Browser.Don’t Worry, You Can’t Break It. We Promise.

Orange and blue are used throughout the visualization in slightly different ways, but in general orange shows negative values while blue shows positive values.

The data points (represented by small circles) are initially colored orange or blue, which correspond to positive one and negative one.

10 misconceptions about Neural Networks

In quantitative finance neural networks are often used for time-series forecasting, constructing proprietary indicators, algorithmic trading, securities classification and credit risk modelling.

One reason why I believe current generation neural networks are not capable of sentience (a different concept to intelligence) is because I believe that biological neurons are much more complex than artificial neurons.

In the context of quantitative finance I think it is important to remember that because whilst it may sound cool to say that something is 'inspired by the brain', this statement may result unrealistic expectations or fear.

The difference between a multiple linear regression and a perceptron is that a perceptron feeds the signal generated by a multiple linear regression into an activation function which may or may not be non-linear.

The input layer receives input patterns and the output layer could contain a list of classifications or output signals to which those input patterns may map.

Given a pattern, p, the objective of this network would be to minimize the error of the output signal, o_p, relative to some known target value for some given training pattern, t_p.

For example, if the neuron was supposed to map p to -1 but it mapped it to 1 then the error, as measured by sum-squared distance, of the neuron would be 4, (-1 - 1)^2. 

think that one of the problems facing the use of deep neural networks for trading (in addition to the obvious risk of overfitting) is that the inputs into the neural network are almost always heavily pre-processed meaning that there may be few features to actually extract because the inputs are already to some extent features.

Sum squared error (SSE), \epsilon = \sum^{P_T}_{p=1} \big ( t_p - o_p \big )^2 Given that the objective of the network is to minimize \epsilon we can use an optimization algorithm to adjust the weights in the neural network.

The most common learning algorithm for neural networks is the gradient descent algorithm although other and potentially better optimization algorithms can be used. Gradient descent works by calculating the partial derivative of the error with respect to the weights for each layer in the neural network and then moving in the opposite direction to the gradient (because we want to minimize the error of the neural network).

Expressed mathematically the update rule for the weights in the neural network (\textbf{v}) is given by, v_i(t) = v_i(t - 1) + \delta v_i(t) where \delta v_i(t) = \eta(-\frac{\partial \epsilon}{\partial v_i}) where \frac{\partial \epsilon}{\partial v_i} = -2(t_p - o_p) \frac{\partial f}{\partial net_p}z_{i,p} where \eta is the learning rate which controls how quickly or slowly the neural network converges.

It is worth nothing that the calculation of the partial derivative of f with respect to the net input signal for a pattern p represents a problem for any discontinuous activation functions;

That having been said I do agree that some practitioners like to treat neural networks as a 'black box' which can be thrown at any problem without first taking the time to understand the nature of the problem and whether or not neural networks are an appropriate choice.

Many modern day advances in the field of machine learning do not come from rethinking the way that perceptrons and optimization algorithms work but rather from being creative regarding how these components fit together.

Below I discuss some very interesting and creative neural network architectures which have been developed over time,  Recurrent Neural Networks - some or all connections flow backwards meaning that feed back loops exist in the network.

Deep neural networks have become extremely popular in more recent years due to their unparalleled success in image and voice recognition problems. The number of deep neural network architectures is growing quite quickly but some of the most popular architectures include deep belief networks, convolutional neural networks, deep restricted Boltzmann machines, stacked auto-encoders, and many more.

Radial basis networks - although not a different type of architecture in the sense of perceptrons and connections, radial basis functions make use of radial basis functions as their activation functions, these are real valued functions whose output depends on the distance from a particular point.

The most commonly used radial basis functions is the Gaussian distribution. Because radial basis functions can take on much more complex forms, they were originally used for performing function interpolation.

As such, quantitative analysts interested in using neural networks should probably test multiple neural network architectures and consider combining their outputs together in an ensemble to maximize their investment performance.

The reasons why these questions are important is because if the neural network is too large (too small) the neural network could potentially overfit (underfit) the data meaning that the network would not generalize well out of sample.

There are two popular approaches used in industry namely early stopping and regularization and then there is my personal favourite approach, global search, Early stopping involves splitting your training set into the main training set and a validation set.

This is the equivalent of adding a prior which essentially makes the neural network believe that the function it is approximating is smooth, \epsilon = \beta \sum^{P_T}_{p=1} \big ( t_p - o_p \big )^2 + \alpha \sum^n_{j=1} v_j^2 where n is the number of weights in the neural network.

This condition is typically either when the error of the network reaches an acceptable level of accuracy on the training set, when the error of the network on the validation set begins to deteriorate, or when the specified computational budget has been exhausted. The most common learning algorithm for neural networks is the backpropagation algorithm which uses stochastic gradient descent which was discussed earlier on in this article.

Adjusting all the weights at once can result in a significant movement of the neural network in weight space, the gradient descent algorithm is quite slow, and is susceptible to local minima.

Here is how they can be used to train neural networks: Neural network vector representation - by encoding the neural network as a vector of weights, each representing the weight of a connection in the neural network, we can train neural networks using most meta-heuristic search algorithms.

The fitness function is calculated as the sum-squared error of the reconstructed neural network after completing one feedforward pass of the training data set.

These three operators are, In addition to these population-based metaheuristic search algorithms, other algorithms have been used to train of neural networks including backpropagation with added momentum, differential evolution, Levenberg Marquardt, simulated annealing, and many more.

Neural networks can use one of three learning strategies namely a supervised learning strategy, an unsupervised learning strategy, or a reinforcement learning strategy. Supervised learning require at least two data sets, a training set which consists of inputs with the expected output, and a testing set which consists of inputs without the expected output.

Reinforcement learning are based on the simple premise of rewarding neural networks for good behaviours and punishing them for bad behaviours. Because unsupervised and reinforcement learning strategies do not require that data be labelled they can be applied to under-formulated problems where the correct output is not known.

Self Organizing Maps are essentially a multi-dimensional scaling technique which construct an approximation of the probability density function of some underlying data set, \textbf{Z}, whilst preserving the topological structure of that data set. This is done by mapping input vectors, \textbf{z}_i, in the data set, \textbf{Z}, to weight vectors, \textbf{v}_j, (neurons) in the feature map, \textbf{V}.

In the context of financial markets (and game playing) reinforcement learning strategies are particularly useful because the neural network learns to optimize a particular quantity such as an appropriate measure of risk adjusted return.  Back to the top

One of the inputs is the price of the security and we are using the Sigmoid activation function. However, most of the securities cost between 5$ and 15$ per share and the output of the Sigmoid function approaches 1.0.

Neural networks trained on unprocessed data produce models where 'the lights are on but nobody's home' Outlier removal - an outlier is value that is much smaller or larger than most of the other values in some set of data.

Outliers can cause problems with statistical techniques like regression analysis and curve fitting because when the model tries to 'accommodate' the outlier, performance of the model across all other data deteriorates, The illustration shows that trying to accommodate an outlier into the linear regression model results in a poor fits of the data set.

Remove redundancy - when two or more of the independent variables being fed into the neural network are highly correlated (multiplecolinearity) this can negatively affect the neural networks learning ability. Highly correlated inputs also mean that the amount of unique information presented by each variable is small, so the less significant input can be removed.

For example, fund managers wouldn't know how a neural network makes trading decisions, so it is impossible to assess the risks of the trading strategies learned by the neural network. Similarly, banks using neural networks for credit risk modelling would not be able to justify why a customer has a particular credit rating, which is a regulatory requirement. That having been said, state of the art rule-extraction algorithms have been developed to vitrify some neural network architectures.

This is the difference between predicate and propositional logic. If we had a simple neural network which Price (P), Simple Moving Average (SMA), and Exponential Moving Average (EMA) as inputs and we extracted a trend following strategy from the neural network in propositional logic, we might get rules like this,

Therefore for traders there is no way to determine the confidence of these results. Fuzzy logic overcomes this limitation by introducing a membership function which specifies how much a variable belongs to a particular domain.

This article describes how to evolve security analysis decision trees using genetic programming. Decision tree induction is the term given to the process of extracting decision trees from neural networks.

Webpage - http://h2o.ai/ GitHub Repositories - https://github.com/h2oai H2O is not strictly a package for machine learning, instead they expose an API for doing fast and scalable machine learning for smarter applications which use big data.

Webpage - https://azure.microsoft.com/en-us/services/machine-learning GitHub Repositories - https://github.com/Azure?utf8=%E2%9C%93&query=MachineLearning  The machine learning / predictive analytics platform in Microsoft Azure is a fully managed cloud service that enables you to easily build, deploy, and share predictive analytics solutions.

This software basically allows you to drag and drop pre-built components (including machine learning models) and custom-built components which manipulate data sets into a process. This flow-chart is then compiled into a program and can be deployed as a web-service.

We have designed it with the following functionality in mind: 1) Support for commonly used models and examples: convnets, MLPs, RNNs, LSTMs, autoencoders, 2) Tight integration with nervanagpu kernels for fp16 and fp32 (benchmarks) on Maxwell GPUs, 3) Basic automatic differentiation support, 4) Framework for visualization, and 5) Swappable hardware backends ...'

A summary of core features include an N-dimensional array, routines for indexing, slicing, transposing, an interface to C, via LuaJIT, linear algebra routines, neural network, energy-based models, numeric optimization routines, Fast and efficient GPU support, Embeddable, with ports to iOS, Android and FPGA' - Torch Webpage (November 2015).

It is built on NumPy, SciPy, and matplotlib Open source, and exposes implementations of various machine learning models for classification, regression, clustering, dimensionality reduction, model selection, and data preprocessing.

Before committing to any one solution I would recommend doing a best-fit analysis to see which open source or proprietary machine learning package or software best matches your use-cases.

Despite this, they have a bad reputation due to the many unsuccessful attempts to use them in practice. In most cases, unsuccessful neural network implementations can be traced back to inappropriate neural network design decisions and general misconceptions about how they work.

For readers interested in getting more information, I have found the following books to be quite instructional when it comes to neural networks and their role in financial modelling and algorithmic trading. 

An Introduction to Implementing Neural Networks using TensorFlow

versions are making tremendous breakthroughs in many fields such as image recognition, speech and natural language processing etc.

So as every ML algorithm, it follows the usual ML workflow of data preprocessing, model building and model evaluation.

Our brain can look at the image and understand the complete picture in a few seconds. On the other hand, computer sees image as just an array of numbers.

For example, a face always has a specific structure which is somewhat preserved in every human, such as the position of eyes, nose or the shape of our face.

Fast forward to 2012, a deep neural network architecture won the ImageNet challenge, a prestigious challenge to recognise objects from natural scenes.

It continued to reign its sovereignty in all the upcoming ImageNet challenges, thus proving the usefulness to solve image problems.

The most popular libraries, to name a few, are: Now, that  you understand how an image is stored and which are the common libraries used, let us look at what TensorFlow has to offer.

Lets start with the official definition, “TensorFlow is an open source software library for numerical computation using dataflow graphs.

Nodes in the graph represents mathematical operations, while graph edges represent multi-dimensional data arrays (aka tensors) communicated between them.

The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.”

For example, when implementing scikit-learn, you first create object of the desired algorithm, then build a model on train and get predictions on test set, something like this: As I said earlier, TensorFlow follows a lazy approach.

Note: We could have used a different neural network architecture to solve this problem, but for the sake of simplicity, we settle on feed forward multilayer perceptron with an in depth implementation.

typical implementation of Neural Network would be as follows: Here we solve our deep learning practice problem – Identify the Digits.  Let’s for a moment take a look at our problem statement.

The dataset contains a zipped file of all the images in the dataset and both the train.csv and test.csv have the name of corresponding train and test images.

Also deployment of TensorFlow models is already supported which makes it easier to use for industrial purposes, giving a fight to commercial libraries such as Deeplearning4j, H2O and Turi.

And to gain expertise in working in neural network don’t forget to try out our deep learning practice problem – Identify the Digits.

Why Deep Learning Is Suddenly Changing Your Life

Over the past four years, readers have doubtlessly noticed quantum leaps in the quality of a wide range of everyday technologies.

To gather up dog pictures, the app must identify anything from a Chihuahua to a German shepherd and not be tripped up if the pup is upside down or partially obscured, at the right of the frame or the left, in fog or snow, sun or shade.

Medical startups claim they’ll soon be able to use computers to read X-rays, MRIs, and CT scans more rapidly and accurately than radiologists, to diagnose cancer earlier and less invasively, and to accelerate the search for life-saving pharmaceuticals.

They’ve all been made possible by a family of artificial intelligence (AI) techniques popularly known as deep learning, though most scientists still prefer to call them by their original academic designation: deep neural networks.

Programmers have, rather, fed the computer a learning algorithm, exposed it to terabytes of data—hundreds of thousands of images or years’ worth of speech samples—to train it, and have then allowed the computer to figure out for itself how to recognize the desired objects, words, or sentences.

“You essentially have software writing software,” says Jen-Hsun Huang, CEO of graphics processing leader Nvidia nvda , which began placing a massive bet on deep learning about five years ago.

What’s changed is that today computer scientists have finally harnessed both the vast computational power and the enormous storehouses of data—images, video, audio, and text files strewn across the Internet—that, it turns out, are essential to making neural nets work well.

“We’re now living in an age,” Chen observes, “where it’s going to be mandatory for people building sophisticated software applications.” People will soon demand, he says, “ ‘Where’s your natural-language processing version?’ ‘How do I talk to your app?

The increased computational power that is making all this possible derives not only from Moore’s law but also from the realization in the late 2000s that graphics processing units (GPUs) made by Nvidia—the powerful chips that were first designed to give gamers rich, 3D visual experiences—were 20 to 50 times more efficient than traditional central processing units (CPUs) for deep-learning computations.

Its chief financial officer told investors that “the vast majority of the growth comes from deep learning by far.” The term “deep learning” came up 81 times during the 83-minute earnings call.

I think five years from now there will be a number of S&P 500 CEOs that will wish they’d started thinking earlier about their AI strategy.” Even the Internet metaphor doesn’t do justice to what AI with deep learning will mean, in Ng’s view.

Multi-Layer Neural Networks with Sigmoid Function— Deep Learning for Rookies (2)

Welcome back to my second post of the series Deep Learning for Rookies (DLFR), by yours truly, a rookie ;) Feel free to refer back to my first post here or my blog if you find it hard to follow.

You’ll be able to brag about your understanding soon ;) Last time, we introduced the field of Deep Learning and examined a simple a neural network — perceptron……or a dinosaur……ok, seriously, a single-layer perceptron.

After all, most problems in the real world are non-linear, and as individual humans, you and I are pretty darn good at the decision-making of linear or binary problems like should I study Deep Learning or not without needing a perceptron.

Fast forward almost two decades to 1986, Geoffrey Hinton, David Rumelhart, and Ronald Williams published a paper “Learning representations by back-propagating errors”, which introduced: If you are completely new to DL, you should remember Geoffrey Hinton, who plays a pivotal role in the progress of DL.

Remember that we iterated the importance of designing a neural network so that the network can learn from the difference between the desired output (what the fact is) and actual output (what the network returns) and then send a signal back to the weights and ask the weights to adjust themselves?

Secondly, when we multiply each of the m features with a weight (w1, w2, …, wm) and sum them all together, this is a dot product: So here are the takeaways for now: The procedure of how input values are forward propagated into the hidden layer, and then from hidden layer to the output is the same as in Graph 1.

One thing to remember is: If the activation function is linear, then you can stack as many hidden layers in the neural network as you wish, and the final output is still a linear combination of the original input data.

So basically, a small change in any weight in the input layer of our perceptron network could possibly lead to one neuron to suddenly flip from 0 to 1, which could again affect the hidden layer’s behavior, and then affect the final outcome.

Non-linear just means that the output we get from the neuron, which is the dot product of some inputs x (x1, x2, …, xm) and weights w (w1, w2, …,wm) plus bias and then put into a sigmoid function, cannot be represented by a linear combination of the input x (x1, x2, …,xm).

This non-linear activation function, when used by each neuron in a multi-layer neural network, produces a new “representation” of the original data, and ultimately allows for non-linear decision boundary, such as XOR.

if our output value is on the lower flat area on the two corners, then it’s false or 0 since it’s not right to say the weather is both hot and cold or neither hot or cold (ok, I guess the weather could be neither hot or cold…you get what I mean though…right?).

You can memorize these takeaways since they’re facts, but I encourage you to google a bit on the internet and see if you can understand the concept better (it is natural that we take some time to understand these concepts).

From the XOR example above, you’ve seen that adding two hidden neurons in 1 hidden layer could reshape our problem into a different space, which magically created a way for us to classify XOR with a ridge.

Now, the computer can’t really “see” a digit like we humans do, but if we dissect the image into an array of 784 numbers like [0, 0, 180, 16, 230, …, 4, 77, 0, 0, 0], then we can feed this array into our neural network.

So if the neural network thinks the handwritten digit is a zero, then we should get an output array of [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], the first output in this array that senses the digit to be a zero is “fired” to be 1 by our neural network, and the rest are 0.

If the neural network thinks the handwritten digit is a 5, then we should get [0, 0, 0, 0, 0, 1, 0, 0, 0, 0].

Remember we mentioned that neural networks become better by repetitively training themselves on data so that they can adjust the weights in each layer of the network to get the final results/actual output closer to the desired output?

For the sake of argument, let’s imagine the following case in Graph 14, which I borrow from Michael Nielsen’s online book: After training the neural network with rounds and rounds of labeled data in supervised learning, assume the first 4 hidden neurons learned to recognize the patterns above in the left side of Graph 14.

Then, if we feed the neural network an array of a handwritten digit zero, the network should correctly trigger the top 4 hidden neurons in the hidden layer while the other hidden neurons are silent, and then again trigger the first output neuron while the rest are silent.

If you train the neural network with a new set of randomized weights, it might produce the following network instead (compare Graph 15 with Graph 14), since the weights are randomized and we never know which one will learn which or what pattern.

Gradient descent, how neural networks learn | Chapter 2, deep learning

Subscribe for more (part 3 will be on backpropagation): Thanks to everybody supporting on Patreon

But what *is* a Neural Network? | Chapter 1, deep learning

Subscribe to stay notified about new videos: Support more videos like this on Patreon: Special .

Deep Learning with Tensorflow - Tensors, Variables and Placeholders

Enroll in the course for free at: Deep Learning with TensorFlow Introduction The majority of data ..

Lecture 4 | Introduction to Neural Networks

In Lecture 4 we progress from linear classifiers to fully-connected neural networks. We introduce the backpropagation algorithm for computing gradients and ...

How to Predict Stock Prices Easily - Intro to Deep Learning #7

We're going to predict the closing price of the S&P 500 using a special type of recurrent neural network called an LSTM network. I'll explain why we use ...

Machine Learning Research & Interpreting Neural Networks

Machine learning and neural networks change how computers and humans interact, but they can be complicated to understand. In this episode of Coffee with a ...

Computer evolves to generate baroque music!

I put the word "evolve" in there because you guys like "evolution" videos, but this computer is actually learning with gradient descent! All music in this video is ...

Lecture 6 | Training Neural Networks I

In Lecture 6 we discuss many practical issues for training modern neural networks. We discuss different activation functions, the importance of data ...

Convolutional Neural Networks - The Math of Intelligence (Week 4)

Convolutional Networks allow us to classify images, generate them, and can even be applied to other types of data. We're going to build one in numpy that can ...

Deep Visualization Toolbox

Code and more info: