AI News, Another Machine Learning Walk-Through and a Challenge

Another Machine Learning Walk-Through and a Challenge

Feature engineering is the process of creating new features — predictor variables — out of an existing dataset.

For datasets with multiple tables and relationships between the tables, we’ll probably want to use automated feature engineering, but because this problem has a relatively small number of columns and only one table, we can hand-build a few high-value features.

For example, since we know that the cost of a taxi ride is proportional to the distance, we’ll want to use the start and stop points to try and find the distance traveled.

This is still an approximation because it gives distance along a line drawn on the spherical surface of the Earth (I’m told the Earth is a sphere) connecting the two points, and clearly, taxis do not travel along straight lines.

While ensemble models and deep neural networks get all the attention, there’s no reason to use an overly complex model if a simple, interpretable model can achieve nearly the same performance.

The starting model, a linear regression trained on only three features (the abs location differences and the passenger_count) achieved a validation root mean squared error (RMSE) of $5.32 and a mean absolute percentage error of 28.6%.

The benefit to a simple linear regression is that we can inspect the coefficients and find for example that an increase in one passenger raises the fare by $0.02 according to the model.

We also want to compare our model to a naive baseline that uses no machine learning, which in the case of regression can be guessing the mean value of the target on the training set.

eval(ez_write_tag([[728,90],'r_statistics_co-box-3','ezslot_4']));Linear Regression

Linear regression is used to predict the value of an outcome variable Y based on one or more input predictor variables X.

The aim is to establish a linear relationship (a mathematical formula) between the predictor variable(s) and the response variable, so that, we can use this formula to estimate the value of the response Y, when only the predictors (Xs) values are known.

The aim of this exercise is to build a simple regression model that we can use to predict Distance (dist) by establishing a statistically significant linear relationship with Speed (speed).

Typically, for each of the independent variables (predictors), the following plots are drawn to visualize the following behavior: Scatter plots can help visualize any linear relationships between the dependent (response) variable and independent (predictor) variables.

Generally, any datapoint that lies outside the 1.5 * interquartile-range (1.5 * IQR) is considered an outlier, where, IQR is calculated as the distance between the 25th percentile and 75th percentile values for that variable.

[];medianet_versionId = '121199';(function() {var sct = document.createElement('script'), sctHl = document.getElementsByTagName('script')[0], isSSL = 'https:' == document.location.protocol;sct.type = 'text/javascript';sct.src = (isSSL ?

0.2) probably suggests that much of variation of the response variable (Y) is unexplained by the predictor (X), in which case, we should probably look for better explanatory variables.

dist = −17.579 + 3.932∗speed Now the linear model is built and we have a formula that we can use to predict the dist value if a corresponding speed is known.

It is absolutely important for the model to be statistically significant before we can go ahead and use it to predict (or estimate) the dependent variable, otherwise, the confidence in predicted values from that model reduces and may be construed as an event of chance.

eval(ez_write_tag([[336,280],'r_statistics_co-leader-1','ezslot_5']));When the model co-efficients and standard error are known, the formula for calculating t Statistic and p-Value is as follows: $$t−Statistic = {β−coefficient \over Std.Error}$$ The actual information in a data is the total variation it contains, remember?.

$$ R^{2} = 1 - \frac{SSE}{SST}$$ where, SSE is the sum of squared errors given by $SSE = \sum_{i}^{n} \left( y_{i} - \hat{y_{i}} \right) ^{2}$ and $SST = \sum_{i}^{n} \left( y_{i} - \bar{y_{i}} \right) ^{2}$ is the sum of squared total.

This is because, since all the variables in the original model is also present, their contribution to explain the dependent variable will be present in the super-set as well, therefore, whatever new variable we add can only add (if not significantly) to the variation that was already explained.

$$ R^{2}_{adj} = 1 - \frac{MSE}{MST}$$ where, MSE is the mean squared error given by $MSE = \frac{SSE}{\left( n-q \right)}$ and $MST = \frac{SST}{\left( n-1 \right)}$ is the mean squared total, where n is the number of observations and q is the number of coefficients in the model.

Therefore, by moving around the numerators and denominators, the relationship between R2 and Radj2 becomes: $$R^{2}_{adj} = 1 - \left( \frac{\left( 1 - R^{2}\right) \left(n-1\right)}{n-q}\right)$$ Both standard errors and F-statistic are measures of goodness of fit.

Error = \sqrt{MSE} = \sqrt{\frac{SSE}{n-q}}$$ $$F-statistic = \frac{MSR}{MSE}$$ where, n is the number of observations, q is the number of coefficients and MSR is the mean square regression, calculated as, $$MSR=\frac{\sum_{i}^{n}\left( \hat{y_{i} - \bar{y}}\right)}{q-1} = \frac{SST - SSE}{q - 1}$$ The Akaike’s information criterion - AIC (Akaike, 1974) and the Bayesian information criterion - BIC (Schwarz, 1978) are measures of the goodness of fit of an estimated statistical model and can also be used for model selection.

So the preferred practice is to split your dataset into a 80:20 sample (training:test), then, build the model on the 80% sample and then use the model thus built to predict the dependent variable on test data.

A higher correlation accuracy implies that the actuals and predicted values have similar directional movement, i.e.when the actuals values increase the predicteds also increase and vice-versa.

Now lets calculate the Min Max accuracy and MAPE: $$MinMaxAccuracy = mean \left( \frac{min\left(actuals, predicteds\right)}{max\left(actuals, predicteds \right)} \right)$$ $$MeanAbsolutePercentageError \ (MAPE) = mean\left( \frac{abs\left(predicteds−actuals\right)}{actuals}\right)$$ Suppose, the model predicts satisfactorily on the 20% split (test data), is that enough to believe that your model will perform equally well all the time?

30 Questions to test a data scientist on K-Nearest Neighbors (kNN) Algorithm

If you were to ask me 2 most intuitive algorithms in machine learning –

If you are new to machine learning, make sure you test yourself on understanding of both of these algorithms.

Solution: A The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.

In the testing phase, a test point is classified by assigning the label which are most frequent among the k training samples nearest to that query point –

2) In the image below, which would be the best value for k assuming that the algorithm you are using is k-Nearest Neighbor.

In this case the prediction can be based on the mean or the median of the k-most similar instances.

6) Which of the following machine learning algorithm can be used for imputing missing values of both categorical and continuous variables?

Solution: A k-NN algorithm can be used for imputing missing value of both categorical and continuous variables.

1,2 and 3 Solution: A Both Euclidean and Manhattan distances are used in case of continuous variables, whereas hamming distance is used in case of categorical variable.

8 Solution: A sqrt( (1-2)^2 + (3-3)^2) = sqrt(1^2 + 0^2) = 1

8 Solution: A sqrt( mod((1-2)) + mod((3-3))) = sqrt(1 + 0) = 1

Context: 11-12 Suppose, you have given the following data where x and y are the 2 input variables and Class is the dependent variable.

11) Suppose, you want to predict the class of new data point x=1 and y=1 using eucludian distance in 3-NN.

Class C) Can’t say Solution: B Now this point will be classified as – class because there are 4 – class and 3 +class point are in nearest circle.

13) Which of the following value of k in k-NN would minimize the leave one out cross validation accuracy?

None of these Solution: A large K means simple model, simple model always condider as high bias

Neither left or right are a Euclidian Distance Solution: B Left is the graphical depiction of how euclidean distance works, whereas right one is of Manhattan distance.

21) Suppose you have given the following images(1 left, 2 middle and 3 right), Now your task is to find out the value of k in k-NN in each image where k1 is for 1st, k2 is for 2nd and k3 is for 3rd figure.

22) Which of the following value of k in the following graph would you give least leave one out cross validation accuracy?

5 Solution: B If you keep the value of k as 2, it gives the lowest cross validation accuracy.

23) A company has build a kNN classifier that gets 100% accuracy on training data.

Note: Model has successfully deployed and no technical issues are found at client side except the model performance A)

None of these Solution: A In an overfitted module, it seems to be performing well on training data, but it is not generalized enough to give the same results on a new data.

24) You have given the following 2 statements, find which of these option is/are true in case of k-NN?

Before getting the prediction suppose you want to calculate the time taken by k-NN for predicting the class for test data. Note:

More than 250 people participated in the skill test and the highest score obtained was 24.

tried my best to make the solutions as comprehensive as possible but if you have any questions / doubts please drop in your comments below.

I would love to hear your feedback about the skill test. For more such skill tests, check out our current hackathons.

4 Regression

Profit, sales, mortgage rates, house values, square footage, temperature, or distance could all be predicted using regression techniques.

In addition to the value, the data might track the age of the house, square footage, number of rooms, taxes, school district, proximity to shopping centers, and so on.

In the model build (training) process, a regression algorithm estimates the value of the target as a function of the predictors for each case in the build data.

Regression modeling has many applications in trend analysis, business planning, marketing, financial forecasting, time series prediction, biomedical and drug response modeling, and environmental modeling.

Cluster analysis

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

Popular notions of clusters include groups with small distances between cluster members, dense areas of the data space, intervals or particular statistical distributions.

The appropriate clustering algorithm and parameter settings (including parameters such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results.

Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology (from Greek βότρυς 'grape') and typological analysis.

The subtle differences are often in the use of the results: while in data mining, the resulting groups are the matter of interest, in automatic classification the resulting discriminative power is of interest.

At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name 'hierarchical clustering' comes from: these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances.

Apart from the usual choice of distance functions, the user also needs to decide on the linkage criterion (since a cluster consists of multiple objects, there are multiple candidates to compute the distance) to use.

Popular choices are known as single-linkage clustering (the minimum of object distances), complete linkage clustering (the maximum of object distances) or UPGMA ('Unweighted Pair Group Method with Arithmetic Mean', also known as average linkage clustering).

Furthermore, hierarchical clustering can be agglomerative (starting with single elements and aggregating them into clusters) or divisive (starting with the complete data set and dividing it into partitions).

They are not very robust towards outliers, which will either show up as additional clusters or even cause other clusters to merge (known as 'chaining phenomenon', in particular with single-linkage clustering).

When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

Variations of k-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (k-means++) or allowing a fuzzy cluster assignment (fuzzy c-means).

Here, the data set is usually modeled with a fixed (to avoid overfitting) number of Gaussian distributions that are initialized randomly and whose parameters are iteratively optimized to better fit the data set.

A cluster consists of all density-connected objects (which can form a cluster of an arbitrary shape, in contrast to many other methods) plus all objects that are within these objects' range.

Another interesting property of DBSCAN is that its complexity is fairly low – it requires a linear number of range queries on the database – and that it will discover essentially the same results (it is deterministic for core and noise points, but not for border points) in each run, therefore there is no need to run it multiple times.

On data sets with, for example, overlapping Gaussian distributions – a common use case in artificial data – the cluster borders produced by these algorithms will often look arbitrary, because the cluster density decreases continuously.

Besides that, the applicability of the mean-shift algorithm to multidimensional data is hindered by the unsmooth behaviour of the kernel density estimate, which results in over-fragmentation of cluster tails.[13]

With the recent need to process larger and larger data sets (also known as big data), the willingness to trade semantic meaning of the generated clusters for performance has been increasing.

This led to the development of pre-clustering methods such as canopy clustering, which can process huge data sets efficiently, but the resulting 'clusters' are merely a rough pre-partitioning of the data set to then analyze the partitions with existing slower methods such as k-means clustering.

For high-dimensional data, many of the existing methods fail due to the curse of dimensionality, which renders particular distance functions problematic in high-dimensional spaces.

This led to new clustering algorithms for high-dimensional data that focus on subspace clustering (where only some attributes are used, and cluster models include the relevant attributes for the cluster) and correlation clustering that also looks for arbitrary rotated ('correlated') subspace clusters that can be modeled by giving a correlation of their attributes.[18]

Also message passing algorithms, a recent development in computer science and statistical physics, has led to the creation of new types of clustering algorithms.[29]

Popular approaches involve 'internal' evaluation, where the clustering is summarized to a single quality score, 'external' evaluation, where the clustering is compared to an existing 'ground truth' classification, 'manual' evaluation by a human expert, and 'indirect' evaluation by evaluating the utility of the clustering in its intended application.[31]

One drawback of using internal criteria in cluster evaluation is that high scores on an internal measure do not necessarily result in effective information retrieval applications.[33]

Therefore, the internal evaluation measures are best suited to get some insight into situations where one algorithm performs better than another, but this shall not imply that one algorithm produces more valid results than another.[4]

More than a dozen of internal evaluation measures exist, usually based on the intuition that items in the same cluster should be more similar than items in different clusters.[34]:115–121 For example, the following methods can be used to assess the quality of clustering algorithms based on internal criterion:

However, it has recently been discussed whether this is adequate for real data, or only on synthetic data sets with a factual ground truth, since classes can contain internal structure, the attributes present may not allow separation of clusters or the classes may contain anomalies.[36]

In the special scenario of constrained clustering, where meta information (such as class labels) is used already in the clustering process, the hold-out of information for evaluation purposes is non-trivial.[37]

In place of counting the number of times a class was correctly assigned to a single data point (known as true positives), such pair counting metrics assess whether each pair of data points that is truly in the same cluster is predicted to be in the same cluster.[30]

Statistics with R (1) - Linear regression

In this video, I show how to use R to fit a linear regression model using the lm() command. I also introduce how to plot the regression line and the overall ...

The Best Way to Prepare a Dataset Easily

In this video, I go over the 3 steps you need to prepare a dataset to be fed into a machine learning model. (selecting the data, processing it, and transforming it).

Linear Regression Machine Learning (tutorial)

I'll perform linear regression from scratch in Python using a method called 'Gradient Descent' to determine the relationship between student test scores ...

How To... Perform Simple Linear Regression by Hand

Learn how to make predictions using Simple Linear Regression. To do this you need to use the Linear Regression Function (y = a + bx) where "y" is the ...

An Introduction to Linear Regression Analysis

Tutorial introducing the idea of linear regression analysis and the least square method. Typically used in a statistics class. Playlist on Linear Regression ...

Excel - Simple Linear Regression

Simple Linear Regression using Microsoft Excel.

How to calculate linear regression using least square method

An example of how to calculate linear regression line using least squares. A step by step tutorial showing how to develop a linear regression equation. Use of ...

Checking Linear Regression Assumptions in R (R Tutorial 5.2)

How to check the validity of assumptions made when fitting a Linear Regression Model. In this video you will learn how to use residual plots to check the linearity ...

Handling Non-Numeric Data - Practical Machine Learning Tutorial with Python p.35

In this machine learning tutorial, we cover how to work with non-numerical data. This useful with any form of machine learning, all of which require data to be in ...

Linear Regression Models using a Graphing calculator

This video gives step-by-step instructions on how you input data in a graphing calculator and then look at the calculator produced scatterplot, find the linear ...