AI News, Would You Survive the Titanic? A Guide to Machine Learning in Python
- On Wednesday, October 17, 2018
- By Read More
Would You Survive the Titanic? A Guide to Machine Learning in Python
From deciding which movie you might want to watch next on Netflix to predicting stock market trends, machine learning has a profound impact on how data is understood in the modern era.
In just 20 minutes, you will learn how to use Python to apply different machine learning techniques, from decision trees to deep neural networks, to a sample dataset.
By examining factors such as class, sex, and age, we will experiment with different machine learning algorithms and build a program that can predict whether a given passenger would have survived this disaster.
I recommend using the “pip” Python package manager, which will allow you to simply run “pip3 install <packagename>” to install each of the dependencies: For actually writing and running the code I recommend using IPython, which will allow you to run modular blocks of code and immediately the view output values and data visualizations, along with the Jupyter Notebook as a graphical interface.
3 = 3rd) name - Name sex - Sex age - Age sibsp - Number of Siblings/Spouses Aboard parch - Number of Parents/Children Aboard ticket - Ticket Number fare - Passenger Fare cabin - Cabin embarked - Port of Embarkation (C = Cherbourg;
Social classes were heavily stratified in the early 20th, and this was especially true on the Titanic where the luxurious 1st class areas were completely off limits to the middle-class passengers of 2nd class, and especially to those who carried a 3rd class “economy price” ticket.
This could be done through an elaborate system of nested if-else statements with some sort of weighted scoring system, but such a program would be long, tedious to write, difficult to generalize, and would require extensive fine tuning.
The classification algorithms will compare the attribute values of “X” to the corresponding values of “y” in order to detect patterns in how different attributes values tend to affect the survival of a passenger.
Interestingly, after splitting by class, the main deciding factor determining the survival of women is the ticket fare that they paid, while the deciding factor for men is their age(with children being much more likely to survive).
By passing this shuffle validator as a parameter to the “cross_val_score” function, we can score our classifier against each of the different splits, and compute the average accuracy and standard deviation from the results.
The “Random Forest” classification algorithm will create a multitude of (generally very poor) trees for the dataset using different random subsets of the input variables, and will return whichever prediction was returned by the most trees.
This helps to avoid “overfitting”, a problem that occurs when a model is so tightly fitted to arbitrary correlations in the training data that it performs poorly on test data.
For instance, if the gradient boosting classifier predicts that a passenger will not survive, but the decision tree and random forest classifiers predict that they will live, the voting classifier will chose the later.
This has been a very brief and non-technical overview of each technique, so I encourage you to learn more about the mathematical implementations of all of these algorithms to obtain a deeper understanding of their relative strengths and weaknesses.
For instance, in our Titanic data set, node connections transmitting the passenger sex and class will likely be weighted very heavily, since these are important for determining the survival of a passenger.
Each layer of nodes is able to aggregate and recombine the outputs from the previous layer, allowing the network to gradually piece together and make sense of unstructured data(such as an image).
Such networks can also be heavily optimized due to their modular nature, allowing the operations of each node layer to be parallelized en masse across multiple CPUs, and even GPUs.
We have barely begun to skim the surface of explaining nueral networks, for a more in depth explanation of the inner workings of DNNs, this is a good resource. This awesome tool allows you to visualize and modify an active deep neural network.
The major advantage of neural networks over traditional machine learning techniques is their ability to find patterns in unstructured data(such as images or natural language).
An emerging powerhouse in programing neural networks is an open source library from Google called TensorFlow. This library is the foundation for many of the most recent advances in machine learning, such as being used to train computer programs to create unique works of music and visual art.
The syntax for using TensorFlow is somewhat abstract, but there is a wrapper included within the TensorFlow package, called “skflow”, which allows us to build deep neural networks using the now familiar scikit-learn syntax.
Our defined model is very basic, for more advanced examples of how to work within this syntax see the skflow documentation here. Despite the increased power and lengthier runtime of these neural network models, you will notice that the accuracy is still about the same as what we achieved using more traditional tree based methods.
still, however, think that running the passenger data of a 104 year old shipwreck through a cutting edge deep neural network is pretty cool.
The above code forms a test dataset of the first 20 listed passengers for each class, and trains a deep neural network against the remaining data.
Once the model is trained we can use it to predict the survival of passengers in the test dataset, and compare these to the known survival of each passenger using the original dataset.
The above table show all of the passengers in our test dataset whose survival(or lack thereof) was incorrectly classified by the neural network model.
Once the ship sank, however, he was able to stay alive by swimming for 20 minutes in the frigid North Atlantic water before joining other survivors on a waterlogged collapsible boat and rowing through the night.
This principle will be especially important going forward, as machine learning is increasingly applied to human datasets by organizations such as insurance companies, big banks, and law enforcement agencies.
From here you can fine-tune the machine learning algorithms to achieve better accuracy on this dataset, design your own neural networks using TensorFlow, discover more fascinating stories of passengers whose survival does not match the model, and apply all of these techniques to any other dataset(check out this Game of Thrones dataset).
- On Sunday, January 20, 2019
Natalie Hockham: Machine learning with imbalanced data sets
Classification algorithms tend to perform poorly when data is skewed towards one class, as is often the case when tackling real-world problems such as fraud ...
Ajinkya More | Resampling techniques and other strategies
PyData SF 2016 Ajinkya More | Resampling techniques and other strategies for handling highly unbalanced datasets in classification Many real world machine ...
Training a machine learning model with scikit-learn
Now that we're familiar with the famous iris dataset, let's actually use a classification model in scikit-learn to predict the species of an iris! We'll learn how the ...
Practical XGBoost in Python - 2.6 - Handle Imbalanced Dataset
Video from “Practical XGBoost in Python” ESCO Course. FREE COURSE:
Getting started in scikit-learn with the famous iris dataset
Now that we've set up Python for machine learning, let's get started by loading an example dataset into scikit-learn! We'll explore the famous "iris" dataset, learn ...
The Best Way to Visualize a Dataset Easily
In this video, we'll visualize a dataset of body metrics collected by giving people a fitness tracking device. We'll go over the steps necessary to preprocess the ...
Create Geodatabase, Feature Dataset & Import Shapefile as Feature Class
Geodatabase holds geographic (spatial and attribute data) of various kinds together. Feature datasets stores thematically and spatially related feature classes ...
IRIS Flower data set tutorial in artificial neural network in matlab
Complete tutorial on
How to Make a Data Science Project with Kaggle (AI Adventures)
It can take a lot of tools to do data science, but Kaggle is a one-stop shop that provides all the tools to share and collaborate on data science projects.
Building dataset - p.4 Data Analysis with Python and Pandas Tutorial
In this part of Data Analysis with Python and Pandas tutorial series, we're going to expand things a bit. Let's consider that we're multi-billionaires, ...