AI News, scikit-learn video #3: Machine learning first steps with the Iris dataset
scikit-learn video #3: Machine learning first steps with the Iris dataset
Last week, we discussed the pros and cons of scikit-learn, showed how to install scikit-learn independently or as part of the Anaconda distribution of Python, walked through the IPython Notebook interface, and covered a few resources for learning Python if you don't already know the language.
You would ideally end up with the same result as shown in the video, with the features stored in a NumPy array called 'X' and the response stored in a NumPy array called 'y', each with the proper shape.
Getting started in scikit-learn with the famous iris dataset
We'll explore the famous 'iris' dataset, learn some important machine learning terminology, and discuss the four key requirements for working with data in scikit-learn.This is the third video in the series: 'Introduction to machine learning with scikit-learn'.
Your First Machine Learning Project in Python Step-By-Step
Do you want to do machine learning using Python, but you’re having trouble getting started?
In this step-by-step tutorial you will: If you are a machine learning beginner and looking to finally get started using Python, this tutorial was designed for you.
The best way to learn machine learning is by designing and completing small projects.
machine learning project may not be linear, but it has a number of well known steps: The best way to really come to terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps.
You can fill in the gaps such as further data preparation and improving result tasks later, once you have more confidence.
The best small project to start with on a new tool is the classification of iris flowers (e.g.
The scipy installation page provides excellent instructions for installing the above libraries on multiple different platforms, such as Linux, mac OS X and Windows.
recommend working directly in the interpreter or writing your scripts and running them on the command line rather than big editors and IDEs.
If you do have network problems, you can download the iris.data file into your working directory and load it using the same method, changing URL to the local file name.
In this step we are going to take a look at the data a few different ways: Don’t worry, each look at the data is one command.
We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.
We are going to look at two types of plots: We start with some univariate plots, that is, plots of each individual variable.
This gives us a much clearer idea of the distribution of the input attributes: We can also create a histogram of each input variable to get an idea of the distribution.
We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.
That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.
We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.
This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.
This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g.
We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.
We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits.
Running the example above, we get the following raw results: We can see that it looks like KNN has the largest estimated accuracy score.
We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model.
There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).
It is valuable to keep a validation set just in case you made a slip during training, such as overfitting to the training set or a data leak.
We can run the KNN model directly on the validation set and summarize the results as a final accuracy score, a confusion matrix and a classification report.
Finally, the classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (granted the validation dataset was small).
You can learn about the benefits and limitations of various algorithms later, and there are plenty of posts that you can read later to brush up on the steps of a machine learning project and the importance of evaluating accuracy using cross validation.
You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with a new platform.
A Complete Tutorial to Learn Data Science with Python from Scratch
After working on SAS for more than 5 years, I decided to move out of my comfort zone. Being a data scientist, my hunt for other useful tools was ON!
But, over the years, with strong community support, this language got dedicated library for data analysis and predictive modeling.
Due to lack of resource on python for data science, I decided to create this tutorial to help many others to learn python faster. In this tutorial, we will take bite sized information about how to use Python for Data Analysis, chew it till we are comfortable and practice it at our own end.
The imitation of this approach is you have to wait for the entire package to be upgraded, even if you are interested in the latest version of a single library.
It provides a lot of good features for documenting while writing the code itself and you can choose to run the code in blocks (rather than the line by line execution) We will use iPython environment for this complete tutorial.
The most commonly used construct is if-else, with following syntax: For instance, if we want to print whether the number N is even or odd: Now that you are familiar with Python fundamentals, let’s take a step further.
What if you have to perform the following tasks: If you try to write code from scratch, its going to be a nightmare and you won’t stay on Python for more than 2 days!
Following are a list of libraries, you will need for any scientific computations and data analysis: Additional libraries, you might need: Now that we are familiar with Python fundamentals and additional libraries, lets take a deep dive into problem solving through Python.
We will now use Pandas to read a data set from an Analytics Vidhya competition, perform exploratory analysis and build our first basic categorization algorithm for solving this problem.
The essential difference being that column names and row numbers are known as column and row index, in case of dataframes.
To begin, start iPython interface in Inline Pylab mode by typing following on your terminal / windows command prompt: This opens up iPython notebook in pylab environment, which has a few useful libraries already imported.
You can check whether the environment has loaded correctly, by typing the following command (and getting the output as seen in the figure below):
describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its output (Read this article to refresh basic statistics to understand population distribution) Here are a few inferences, you can draw by looking at the output of describe() function: Please note that we can get an idea of a possible skew in the data by comparing the mean to the median, i.e.
The frequency table can be printed by following command: Similarly, we can look at unique values of port of credit history.
Now we will look at the steps required to generate a similar insight using Python. Please refer to this article for getting a hang of the different data manipulation techniques in Pandas.
If you have not realized already, we have just created two basic classification algorithms here, one based on credit history, while other on 2 categorical variables (including gender).
Next let’s explore ApplicantIncome and LoanStatus variables further, perform data munging and create a dataset for applying various modeling techniques.
Let us look at missing values in all the variables because most of the models don’t work with missing data and even if they do, imputing them helps more often than not.
So, let us check the number of nulls / NaNs in the dataset This command should tell us the number of missing values in each column as isnull() returns 1, if the value is null.
the simplest being replacement by mean, which can be done by following code: The other extreme could be to build a supervised learning model to predict loan amount on the basis of other variables and then use age along with other variables to predict survival.
Since, the purpose now is to bring out the steps in data munging, I’ll rather take an approach, which lies some where in between these 2 extremes.
This can be done using the following code: Now, we will create a Pivot table, which provides us median values for all the groups of unique values of Self_Employed and Education features.
Next, we define a function, which returns the values of these cells and apply it to fill the missing values of loan amount: This should provide you a good way to impute missing values of loan amount.
So instead of treating them as outliers, let’s try a log transformation to nullify their effect: Looking at the histogram again:
For example, creating a column for LoanAmount/TotalIncome might make sense as it gives an idea of how well the applicant is suited to pay back his loan.
After, we have made the data useful for modeling, let’s now look at the python code to create a predictive model on our data set.
One way would be to take all the variables into the model but this might result in overfitting (don’t worry if you’re unaware of this terminology yet).
In simple words, taking all variables might result in the model understanding complex relations specific to the data and will not generalize well.
Accuracy : 80.945% Cross-Validation Score : 80.946% Accuracy : 80.945% Cross-Validation Score : 80.946% Generally we expect the accuracy to increase on adding variables.
Accuracy : 81.930% Cross-Validation Score : 76.656% Here the model based on categorical variables is unable to have an impact because Credit History is dominating over them.
Let’s try a few numerical variables: Accuracy : 92.345% Cross-Validation Score : 71.009% Here we observed that although the accuracy went up on adding variables, the cross-validation error went down.
Also, we will modify the parameters of random forest model a little bit: Accuracy : 82.899% Cross-Validation Score : 81.461% Notice that although accuracy reduced, but the cross-validation score is improving showing that the model is generalizing well.
You would have noticed that even after some basic parameter tuning on random forest, we have reached a cross-validation accuracy only slightly better than the original logistic regression model.
I am sure this not only gave you an idea about basic data analysis methods but it also showed you how to implement some of the more sophisticated techniques available today.
If you come across any difficulty while practicing Python, or you have any thoughts / suggestions / feedback on the post, please feel free to post them through comments below.
How to Load Data in Python with Scikit-Learn
Before you can build machine learning models, you need to load your data into memory.
These datasets are useful for getting a handle on a given machine learning algorithm or library feature before using it in your own work.
You learned a way of opening CSV files from the web using the urllib library and how you can read that data as a NumPy matrix for use in scikit-learn.
- On Sunday, March 24, 2019
Data science in Python: pandas, seaborn, scikit-learn
In this video, we'll cover the data science pipeline from data ingestion (with pandas) to data visualization (with seaborn) to machine learning (with scikit-learn). We'll learn how to train...
Logistic Regression Machine Learning Method Using Scikit Learn and Pandas Python - Tutorial 31
In this Python for Data Science Tutorial, You will learn about how to do Logistic regression, a Machine learning method, using Scikit learn and Pandas scipy in python using Jupyter notebook....
Logistic Regression using Python (Sklearn, NumPy, MNIST, Handwriting Recognition, Matplotlib)
Logistic Regression using Python (Sklearn, NumPy, MNIST, Handwriting Recognition, Matplotlib). This tutorial goes over logistic regression using sklearn on the digits and MNIST datasets including...
Scikit Learn - KMeans Clustering Analysis with the Iris Data Set
Scikit Learn - KMeans Clustering Analysis with the Iris Data Set.
Logistic Regression Classifiers
Making Predictions with Data and Python : Logistic Regression | packtpub.com
This playlist/video has been uploaded for Marketing purposes and contains only selective videos. For the entire video course and code, visit [ Mention the types of..
Machine Learning K Means Clustering in SciKit Learn with Iris Data Part 3
This tutorial will explain to apply k means clustering algorithm on standard built in Iris dataset available in scikit learning librray with sklearn cluster K mean module.
Regression forecasting and predicting - Practical Machine Learning Tutorial with Python p.5
In this video, make sure you define the X's like so. I flipped the last two lines by mistake: X = np.array(df.drop(['label'],1)) X = preprocessing.scale(X) X_lately = X[-forecast_out:] X...
Python Exercise on Decision Tree and Linear Regression
This is the first Machine Learning with Python Exercise of the Introduction to Machine Learning MOOC on NPTEL. It teaches how to perform use linear models and decision trees of Scikit Learn...
Machine learning with Python and sklearn - Hierarchical Clustering (E-commerce dataset example)
In this Machine Learning & Python video tutorial I demonstrate Hierarchical Clustering method. Hierarchical Clustering is a part of Machine Learning and belongs to Clustering family: - Connectivi...