AI News, Implementing a CNN for Text Classification in TensorFlow

Implementing a CNN for Text Classification in TensorFlow

The model presented in the paper achieves good classification performance across a range of text classification tasks (like Sentiment Analysis) and has since become a standard baseline for new text classification architectures.

Also, the dataset doesn’t come with an official train/test split, so we simply use 10% of the data as a dev set.

won’t go over the data pre-processing code in this post, but it is available on Github and does the following: The network we will build in this post looks roughly as follows:

Next, we max-pool the result of the convolutional layer into a long feature vector, add dropout regularization, and classify the result using a softmax layer.

Because this is an educational post I decided to simplify the model from the original paper a little: It is relatively straightforward (a few dozen lines of code) to add the above extensions to the code here.

To allow various hyperparameter configurations we put our code into a TextCNN class, generating the model graph in the init function.

To instantiate the class we then pass the following arguments: We start by defining the input data that we pass to our network: tf.placeholder creates a placeholder variable that we feed to the network when we execute it at train or test time.

TensorFlow’s convolutional conv2d operation expects a 4-dimensional tensor with dimensions corresponding to batch, width, height and channel.

The result of our embedding doesn’t contain the channel dimension, so we add it manually, leaving us with a layer of shape [None, sequence_length, embedding_size, 1].

Because each convolution produces tensors of different shapes we need to iterate through them, create a layer for each of them, and then merge the results into one big feature vector.

'VALID' padding means that we slide the filter over our sentence without padding the edges, performing a narrow convolution that gives us an output of shape [1, sequence_length - filter_size + 1, 1, 1].

Performing max-pooling over the output of a specific filter size leaves us with a tensor of shape [batch_size, 1, 1, num_filters].

Once we have all the pooled output tensors from each filter size we combine them into one long feature vector of shape [batch_size, num_filters_total].

Using the feature vector from max-pooling (with dropout applied) we can generate predictions by doing a matrix multiplication and picking the class with the highest score.

Here, tf.nn.softmax_cross_entropy_with_logits is a convenience function that calculates the cross-entropy loss for each class, given our scores and the correct input labels.

The allow_soft_placement setting allows TensorFlow to fall back on a device with a certain operation implemented when the preferred device doesn’t exist.

When we instantiate our TextCNN models all the variables and operations defined will be placed into the default graph and session we’ve created above.

Checkpoints can be used to continue training at a later point, or to pick the best parameters setting using early stopping.

Let’s now define a function for a single training step, evaluating the model on a batch of data and updating the model parameters.

We write a similar function to evaluate the loss and accuracy on an arbitrary data set, such as a validation set or the whole training set.

We iterate over batches of our data, call the train_step function for each batch, and occasionally evaluate and checkpoint our model: Here, batch_iter is a helper function I wrote to batch the data, and tf.train.global_step is convenience function that returns the value of global_step.

Our training script writes summaries to an output directory, and by pointing TensorBoard to that directory we can visualize the graph and the summaries we created.

Running the training procedure with default parameters (128-dimensional embeddings, filter sizes of 3, 4 and 5, dropout of 0.5 and 128 filters per filter size) results in the following loss and accuracy plots (blue is training data, red is 10% dev data).

Select a Web Site

During training, the datastore randomly flips the training images along the vertical axis and randomly translates them up to four pixels horizontally and vertically.

numF is the number of convolutional filters in each layer, stride is the stride of the first convolutional layer of the unit, and tag is a character array to prepend to the layer names.

The layers in the convolutional units have names starting with 'SjUk', where j is the stage index and k is the index of the convolutional unit within that stage.

Because you specified the number of inputs to the addition layer to be two when you created the layer, the layer has two inputs with the names 'in1' and 'in2'.

When the layer activations in the convolutional units change size (that is, when they are downsampled spatially and upsampled in the channel dimension), the activations in the residual connections must also change size.

Change the activation sizes in the residual connections by using a 1-by-1 convolutional layer together with its batch normalization layer.

Change the activation size in the residual connection between the second and third stages by another 1-by-1 convolutional layer together with its batch normalization layer.

Create a residual network with nine standard convolutional units (three units per stage) and a width of 16.

Select a learning rate that is proportional to the mini-batch size and reduce the learning rate by a factor of 10 after 60 epochs.

Multi-Layer Neural Networks with Sigmoid Function— Deep Learning for Rookies (2)

Welcome back to my second post of the series Deep Learning for Rookies (DLFR), by yours truly, a rookie ;) Feel free to refer back to my first post here or my blog if you find it hard to follow.

You’ll be able to brag about your understanding soon ;) Last time, we introduced the field of Deep Learning and examined a simple a neural network — perceptron……or a dinosaur……ok, seriously, a single-layer perceptron.

After all, most problems in the real world are non-linear, and as individual humans, you and I are pretty darn good at the decision-making of linear or binary problems like should I study Deep Learning or not without needing a perceptron.

Fast forward almost two decades to 1986, Geoffrey Hinton, David Rumelhart, and Ronald Williams published a paper “Learning representations by back-propagating errors”, which introduced: If you are completely new to DL, you should remember Geoffrey Hinton, who plays a pivotal role in the progress of DL.

Remember that we iterated the importance of designing a neural network so that the network can learn from the difference between the desired output (what the fact is) and actual output (what the network returns) and then send a signal back to the weights and ask the weights to adjust themselves?

Secondly, when we multiply each of the m features with a weight (w1, w2, …, wm) and sum them all together, this is a dot product: So here are the takeaways for now: The procedure of how input values are forward propagated into the hidden layer, and then from hidden layer to the output is the same as in Graph 1.

One thing to remember is: If the activation function is linear, then you can stack as many hidden layers in the neural network as you wish, and the final output is still a linear combination of the original input data.

So basically, a small change in any weight in the input layer of our perceptron network could possibly lead to one neuron to suddenly flip from 0 to 1, which could again affect the hidden layer’s behavior, and then affect the final outcome.

Non-linear just means that the output we get from the neuron, which is the dot product of some inputs x (x1, x2, …, xm) and weights w (w1, w2, …,wm) plus bias and then put into a sigmoid function, cannot be represented by a linear combination of the input x (x1, x2, …,xm).

This non-linear activation function, when used by each neuron in a multi-layer neural network, produces a new “representation” of the original data, and ultimately allows for non-linear decision boundary, such as XOR.

if our output value is on the lower flat area on the two corners, then it’s false or 0 since it’s not right to say the weather is both hot and cold or neither hot or cold (ok, I guess the weather could be neither hot or cold…you get what I mean though…right?).

You can memorize these takeaways since they’re facts, but I encourage you to google a bit on the internet and see if you can understand the concept better (it is natural that we take some time to understand these concepts).

From the XOR example above, you’ve seen that adding two hidden neurons in 1 hidden layer could reshape our problem into a different space, which magically created a way for us to classify XOR with a ridge.

Now, the computer can’t really “see” a digit like we humans do, but if we dissect the image into an array of 784 numbers like [0, 0, 180, 16, 230, …, 4, 77, 0, 0, 0], then we can feed this array into our neural network.

So if the neural network thinks the handwritten digit is a zero, then we should get an output array of [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], the first output in this array that senses the digit to be a zero is “fired” to be 1 by our neural network, and the rest are 0.

If the neural network thinks the handwritten digit is a 5, then we should get [0, 0, 0, 0, 0, 1, 0, 0, 0, 0].

Remember we mentioned that neural networks become better by repetitively training themselves on data so that they can adjust the weights in each layer of the network to get the final results/actual output closer to the desired output?

For the sake of argument, let’s imagine the following case in Graph 14, which I borrow from Michael Nielsen’s online book: After training the neural network with rounds and rounds of labeled data in supervised learning, assume the first 4 hidden neurons learned to recognize the patterns above in the left side of Graph 14.

Then, if we feed the neural network an array of a handwritten digit zero, the network should correctly trigger the top 4 hidden neurons in the hidden layer while the other hidden neurons are silent, and then again trigger the first output neuron while the rest are silent.

If you train the neural network with a new set of randomized weights, it might produce the following network instead (compare Graph 15 with Graph 14), since the weights are randomized and we never know which one will learn which or what pattern.

How to train neural Network in Matlab ??

This tutorial video teaches about training a neural network in Matlab .....( Download Matlab Code Here:

5 Key Metrics To Analyse Your Power Data

In association with Training Peaks. These are the key numbers you need to focus on when analysing your power data. Subscribe to GCN: ...

17.9: Sound Visualization: Graphing Amplitude - p5.js Sound Tutorial

In this "p5.js Sound Tutorial", I use the getLevel() function from the p5.js Sound Library to graph the amplitude (volume) over time. Support this channel on ...

Beginner Intro to Neural Networks 1: Data and Graphing

Hey everyone! This is the first in a series of videos teaching you everything you could possibly want to know about neural networks, from the math behind them ...

Working with Time Series Data in MATLAB

See what's new in the latest release of MATLAB and Simulink: Download a trial: A key challenge with the growing .

Prediction Artificial Neural Network using Matlab

Import Data and Analyze with MATLAB

Data are frequently available in text file format. This tutorial reviews how to import data, create trends and custom calculations, and then export the data in text file ...

Matplotlib Tutorial 16 - Live graphs

In this Matplotlib tutorial, we're going to cover how to create live updating graphs that can update their plots live as the data-source updates. You may want to ...

Partitioning data into training and validation datasets using R

Link to download data file: Includes example of data partition or data splitting with R

How SOM (Self Organizing Maps) algorithm works

In this video I describe how the self organizing maps algorithm works, how the neurons converge in the attribute space to the data. It is important to state that I ...