AI News, scikit-learn video #4: Model training and prediction with K-nearest neighbors

scikit-learn video #4: Model training and prediction with K-nearest neighbors

Welcome back to my series of video tutorials on effective machine learning with Python's scikit-learn library.

In the first three videos, we discussed what machine learning is and how it works, we set up Python for machine learning, and we explored the famous iris dataset.

First, you might visualize your training data on a coordinate plane, with the x and y coordinates representing the feature values and the color representing the response class:

KNN can predict the response class for a future observation by calculating the 'distance' to all training observations and assuming that the response class of nearby observations is likely to be similar.

Training a machine learning model with scikit-learn

Now that we're familiar with the famous iris dataset, let's actually use a classification model in scikit-learn to predict the species of an iris!

Read more about the video here:http://blog.kaggle.com/2015/04/30/sci...The IPython notebook shown in the video is available on GitHub:https://github.com/justmarkham/scikit...== RESOURCES ==Iris dataset in UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/dataset...Nearest Neighbors user guide: http://scikit-learn.org/stable/module...KNeighborsClassifier class documentation: http://scikit-learn.org/stable/module...Logistic Regression user guide: http://scikit-learn.org/stable/module...LogisticRegression class documentation: http://scikit-learn.org/stable/module...Videos from An Introduction to Statistical Learning: http://www.dataschool.io/15-hours-of-...== SUBSCRIBE!

Machine Learning in R for beginners

Now that you have loaded the Iris data set into RStudio, you should try to get a thorough understanding of what your data is about.

You see that there is a high correlation between the sepal length and the sepal width of the Setosa iris flowers, while the correlation is somewhat less high for the Virginica and Versicolor flowers: the data points are more spread out over the graph and don’t form a cluster like you can see in the case of the Setosa flowers.

Of course, you probably need to test this hypothesis a bit further if you want to be really sure of this: You see that when you combined all three species, the correlation was a bit stronger than it is when you look at the different species separately: the overall correlation is 0.96, while for Versicolor this is 0.79.

After a general visualized overview of the data, you can also view the data set by entering However, as you will see from the result of this command, this really isn’t the best way to inspect your data set thoroughly: the data set takes up a lot of space in the console, which will impede you from forming a clear idea about your data.

On the other hand, if you want to check the percentual division of the Species attribute, you can ask for a table of proportions: Note that the round argument rounds the values of the first argument, prop.table(table(iris$Species))*100 to the specified number of digits, which is one digit after the decimal point.

This will give you the minimum value, first quantile, median, mean, third quantile and maximum value of the data set Iris for numeric data types.

For the class variable, the count of factors will be returned: As you can see, the c() function is added to the original command: the columns petal width and sepal width are concatenated and a summary is then asked of just these two columns of the Iris data set.

To illustrate the KNN algorithm, this tutorial works with the package class: If you don’t have this package yet, you can quickly and easily do so by typing the following line of code: Remember the nerd tip: if you’re not sure if you have this package, you can run the following command to find out!

For example, if your dataset has just two attributes, X and Y, and X has values that range from 1 to 1000, while Y has values that only go from 1 to 100, then Y’s influence on the distance function will usually be overpowered by X’s influence.

The Iris data set doesn’t need to be normalized: the Sepal.Length attribute has values that go from 4.3 to 7.9 and Sepal.Width contains values from 2 to 4.4, while Petal.Length’s values range from 1 to 6.9 and Petal.Width goes from 0.1 to 2.5.

You can then use this argument in another command, where you put the results of the normalization in a data frame through as.data.frame() after the function lapply() returns a list of the same length as the data set that you give in.

Tip: to more thoroughly illustrate the effect of normalization on the data set, compare the following result to the summary of the Iris data set that was given in step two.

In practice, the division of your data set into a test and a training sets is disjoint: the most common splitting choice is to take 2/3 of your original data set as the training set, while the 1/3 that remains will compose the test set.

Note that the replace argument is set to TRUE: this means that you assign a 1 or a 2 to a certain row and then reset the vector of 2 to its original state.

Remember that you want your training set to be 2/3 of your original data set: that is why you assign “1” with a probability of 0.67 and the “2”s with a probability of 0.33 to the 150 sample rows.

You can then use the sample that is stored in the variable ind to define your training and test sets: Note that, in addition to the 2/3 and 1/3 proportions specified above, you don’t take into account all attributes to form the training and test sets.

You therefore need to store the class labels in factor vectors and divide them over the training and test sets: After all these preparation steps, you have made sure that all your known (training) data is stored.

An easy way to do these two steps is by using the knn() function, which uses the Euclidian distance measure in order to find the k-nearest neighbours to your new, unknown instance.

In case of classification, the data point with the highest score wins the battle and the unknown instance receives the label of that winning data point.

To build your classifier, you need to take the knn() function and simply add some arguments to it, just like in this example: You store into iris_pred the knn() function that takes as arguments the training set, the test set, the train labels and the amount of neighbours you want to find with this algorithm.

For a more abstract view, you can just compare the results of iris_pred to the test labels that you had defined earlier: You see that the model makes reasonably accurate predictions, with the exception of one wrong classification in row 29, where “Versicolor” was predicted while the test label is “Virginica”.

That’s where the caret package can come in handy: it’s short for “Classification and Regression Training” and offers everything you need to know to solve supervised machine learning problems: it provides a uniform interface to a ton of machine learning algorithms.

You just have to change the method argument, just like in this example: Now that you have trained your model, it’s time to predict the labels of the test set that you have just made and evaluate how the model has done on your data: Additionally, you can try to do the same test as before, to examine the effect of preprocessing, such as scaling and centering, on your model.

Training a machine learning model with scikit-learn

Now that we're familiar with the famous iris dataset, let's actually use a classification model in scikit-learn to predict the species of an iris! We'll learn how the K-nearest neighbors (KNN)...

K Nearest Neighbor (kNN) Algorithm | R Programming | Data Prediction Algorithm

In this video I've talked about how you can implement kNN or k Nearest Neighbor algorithm in R with the help of an example data set freely available on UCL machine learning repository.

Predicting Stock Prices - Learn Python for Data Science #4

In this video, we build an Apple Stock Prediction script in 40 lines of Python using the scikit-learn library and plot the graph using the matplotlib library. The challenge for this video...

Ensemble learners

This video is part of the Udacity course "Machine Learning for Trading". Watch the full course at

Model Fitness - Mean Square Error(Test & Train error)

In this video you will learn how to measure whether the Regression model really fits your data well. You will also learn why to use test error to measure model fitness For all our videos &...

Support Vector Machine (SVM) with R - Classification and Prediction Example

Includes an example with, - brief definition of what is svm? - svm classification model - svm classification plot - interpretation - tuning or hyperparameter optimization - best model selection...

Machine Learning - Supervised Learning K Nearest Neighbors

Enroll in the course for free at: Machine Learning can be an incredibly beneficial tool to uncover hidden insights and predict..

K - Nearest Neighbors - KNN Fun and Easy Machine Learning

K - Nearest Neighbors - KNN Fun and Easy Machine Learning In pattern recognition, the KNN..

Difference between Classification and Regression - Georgia Tech - Machine Learning

Watch on Udacity: Check out the full Advanced Operating Systems course for free at:

Machine Learning Data science interview questions - How is KNN different from kmeans?

Machine Learning interview questions - How is k nearest neighbor algorithm different than kmeans clustering algorithm? K Means : Unsupervised Clusterning No label required KNN : Supervise...