Machine Learning Basics with Naive Bayes

After researching and looking into the different algorithms associated with Machine Learning, I’ve found that there is an abundance of great material showing you how to use certain algorithms in a specific language.

The algorithm then uses this combo of data item and outcome/answer in order to “learn” what sorts of things dictate a certain answer.

When provided with data it has never seen before, that isn’t labelled, this trained model can then predict the answer based on what it has seen before.

For example, given a set of emails and people that wrote them, Naive Bayes can be used to build a model to understand the writing styles of each email author.

I’ve taken the Kaggle Simpsons data set and used the script and character data to try and train a machine learning model, using Naive Bayes, to predict whether it was Homer or Bart that said a certain phrase.

To get the main bulk of the code that would help you vectorise the phrases and preprare them into a training and test data set, see the Udacity Intro to Machine Learning Github repo and take a look at their Naive Bayes examples.

Firstly, filter and split your Simpsons data up - you can do this manually - to get a file that contains one id on every line that is either a Bart id (8) or a Homer id (2).

Make another file and put the normalised text for this filtered data on each line (make sure its in the same order as the id’s so row 1’s id matches row 1’s text etc…).

You can now add further Bart and Homer id’s from the data set (as there were multiple for their different characters) and start tweaking parameters to see if you can improve the accuracy.

