# AI News, Ten Machine Learning Algorithms You Should Know to Become a Data Scientist ## Ten Machine Learning Algorithms You Should Know to Become a Data Scientist

That said, no one can deny the fact that as practicing Data Scientists, we will have to know basics of some common machine learning algorithms, which would help us engage with a new-domain problem we come across.

Covariance Matrix of data points is analyzed here to understand what dimensions(mostly)/ data points (sometimes) are more important (ie have high variance amongst themselves, but low covariance with others).

As is obvious, use this algorithm to fit simple curves / regression https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.polyfit.html https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/linear_regression.pdf Least Squares can get confused with outliers, spurious fields and noise in data.

As is obvious from the name, you can use this algorithm to create K clusters in dataset http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html https://www.youtube.com/watch?v=hDmNF9JG3lo https://www.datascience.com/blog/k-means-clustering Logistic Regression is constrained Linear Regression with a nonlinearity (sigmoid function is used mostly or you can use tanh too) application after weights are applied, hence restricting the outputs close to +/- classes (which is 1 and 0 in case of sigmoid).

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html https://www.youtube.com/watch?v=-la3q9d7AKQ SVMs are linear models like Linear/ Logistic Regression, the difference is that they have different margin-based loss function (The derivation of Support Vectors is one of the most beautiful mathematical results I have seen along with eigenvalue calculation).

FFNNs can be used to train a classifier or extract features as autoencoders http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html https://github.com/keras-team/keras/blob/master/examples/reuters_mlp_relu_vs_selu.py   http://www.deeplearningbook.org/contents/mlp.html http://www.deeplearningbook.org/contents/autoencoders.html http://www.deeplearningbook.org/contents/representation.html Almost any state of the art Vision based Machine Learning result in the world today has been achieved using Convolutional Neural Networks.

https://developer.nvidia.com/digits https://github.com/kuangliu/torchcv https://github.com/chainer/chainercv https://keras.io/applications/ http://cs231n.github.io/ https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/ RNNs model sequences by applying the same set of weights recursively on the aggregator state at a time t and input at a time t (Given a sequence has inputs at times 0..t..T, and have a hidden state at each time t which is output from t-1 step of RNN).

RNN (If here is a densely connected unit and a nonlinearity, nowadays f is generally LSTMs or GRUs ). LSTM unit which is used instead of a plain dense layer in a pure RNN.

Use RNNs for any sequence modelling task specially text classification, machine translation, language modelling https://github.com/tensorflow/models (Many cool NLP research papers from Google are here) https://github.com/wabyking/TextClassificationBenchmark http://opennmt.net/    http://cs224d.stanford.edu/ http://www.wildml.com/category/neural-networks/recurrent-neural-networks/ http://colah.github.io/posts/2015-08-Understanding-LSTMs/ CRFs are probably the most frequently used models from the family of Probabilitic Graphical Models (PGMs).

Before Neural Machine Translation systems came in CRFs were the state of the art and in many sequence tagging tasks with small datasets, they will still learn better than RNNs which require a larger amount of data to generalize.

The two common decision trees algorithms used nowadays are Random Forests (which build different classifiers on a random subset of attributes and combine them for output) and Boosting Trees (which train a cascade of trees one on top of others, correcting the mistakes of ones below them).

Decision Trees can be used to classify datapoints (and even regression) http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html http://xgboost.readthedocs.io/en/latest/ https://catboost.yandex/ http://xgboost.readthedocs.io/en/latest/model.html https://arxiv.org/abs/1511.05741 https://arxiv.org/abs/1407.7502 http://education.parrotprediction.teachable.com/p/practical-xgboost-in-python If you are still wondering how can any of the above methods solve tasks like defeating Go world champion like DeepMind did, they cannot.

To learn strategy to solve a multi-step problem like winning a game of chess or playing Atari console, we need to let an agent-free in the world and learn from the rewards/penalties it faces.

## The Logistic Regression Algorithm

Like many other machine learning techniques, it is borrowed from the field of statistics and despite its name, it is not an algorithm for regression problems, where you want to predict a continuous outcome.

simple example of a Logistic Regression problem would be an algorithm used for cancer detection that takes screening picture as an input and should tell if a patient has cancer (1) or not (0).

Logistic Regression measures the relationship between the dependent variable (our label, what we want to predict) and the one or more independent variables (our features), by estimating probabilities using it’s underlying logistic function.

The Sigmoid-Function is an S-shaped curve that can take any real-valued number and map it into a value between the range of 0 and 1, but never exactly at those limits.

Below you can see how the logistic function (sigmoid function) looks like: We want to maximize the likelihood that a random data point gets classified correctly, which is called Maximum Likelihood Estimation.

It is a widely used technique because it is very efficient, does not require too many computational resources, it’s highly interpretable, it doesn’t require input features to be scaled, it doesn’t require any tuning, it’s easy to regularize, and it outputs well-calibrated predicted probabilities.

Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other.

Therefore it is required that your data is linearly separable, like the data points in the image below: In other words: You should think about using logistic regression when your Y variable takes on only two values (e.g when you are facing a classification problem).

When you then want to classify an image, you just look at which classifier has the best decision score Here you train a binary classifier for every pair of digits.

This means training a classifier that can distinguish between 0s and 1s, one that can distinguish between 0s and 2s, one that can distinguish between 1s and 2s etc.

Algorithms like Support Vector Machine Classifiers don’t scale well at large datasets, which is why in this case using a binary classification algorithm like Logistic Regression with the OvO strategy would do better, because it is faster to train a lot of classifiers on a small dataset than training just one at a large dataset.

## Essentials of Machine Learning Algorithms (with Python and R Codes)

Note: This article was originally published on Aug 10, 2015 and updated on Sept 9th, 2017

Google&#8217;s self-driving cars and robots get a lot of press, but the company&#8217;s real future is in machine learning, the technology that enables computers to get smarter and more personal.

The idea behind creating this guide is to simplify the journey of aspiring data scientists and machine learning enthusiasts across the world.

How it works: This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables).

Using these set of variables, we generate a function that map inputs to desired outputs. The training process continues until the model achieves a desired level of accuracy on the training data.

This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions.

These algorithms can be applied to almost any data problem: It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on continuous variable(s).

In this equation: These coefficients a and b are derived based on minimizing the sum of squared difference of distance between data points and regression line.

And, Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1) independent variables. While finding best fit line, you can fit a polynomial or curvilinear regression.

It is a classification not a regression algorithm. It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on given set of independent variable(s).

It chooses parameters that maximize the likelihood of observing the sample values rather than that minimize the sum of squared errors (like in ordinary regression).

source: statsexchange In the image above, you can see that population is classified into four different groups based on multiple attributes to identify &#8216;if they will play or not&#8217;.

In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate.

For example, if we only had two features like Height and Hair length of an individual, we&#8217;d first plot these two variables in two dimensional space where each point has two co-ordinates (these co-ordinates are known as Support Vectors)

In the example shown above, the line which splits the data into two differently classified groups is the black line, since the two closest points are the farthest apart from the line.

It is a classification technique based on Bayes’ theorem with an assumption of independence between predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

Step 1: Convert the data set to frequency table Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and probability of playing is 0.64.

Yes) * P(Yes) / P (Sunny) Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64 Now, P (Yes |

However, it is more widely used in classification problems in the industry. K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors.

Its procedure follows a simple and easy  way to classify a given data set through a certain number of  clusters (assume k clusters).

We know that as the number of cluster increases, this value keeps on decreasing but if you plot the result you may see that the sum of squared distance decreases sharply up to some value of k, and then much more slowly after that.

grown as follows: For more details on this algorithm, comparing with decision tree and tuning model parameters, I would suggest you to read these articles: Python R Code

For example: E-commerce companies are capturing more details about customer like their demographics, web crawling history, what they like or dislike, purchase history, feedback and many others to give them personalized attention more than your nearest grocery shopkeeper.

How&#8217;d you identify highly significant variable(s) out 1000 or 2000? In such cases, dimensionality reduction algorithm helps us along with various other algorithms like Decision Tree, Random Forest, PCA, Factor Analysis, Identify based on correlation matrix, missing value ratio and others.

GBM is a boosting algorithm used when we deal with plenty of data to make a prediction with high prediction power. Boosting is actually an ensemble of learning algorithms which combines the prediction of several base estimators in order to improve robustness over a single estimator.

The XGBoost has an immensely high predictive power which makes it the best choice for accuracy in events as it possesses both linear model and the tree learning algorithm, making the algorithm almost 10x faster than existing gradient booster techniques.

It is designed to be distributed and efficient with the following advantages: The framework is a fast and high-performance gradient boosting one based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Since the LightGBM is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise.

So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms.

Catboost can automatically deal with categorical variables without showing the type conversion error, which helps you to focus on tuning your model better rather than sorting out trivial errors.

My sole intention behind writing this article and providing the codes in R and Python is to get you started right away. If you are keen to master machine learning, start right away.

## Selecting the best Machine Learning algorithm for your regression problem

Beginning with the simple case, Single Variable Linear Regression is a technique used to model the relationship between a single input independent variable (feature variable) and an output dependent variable using a linear model i.e a line.

The more general case is Multi Variable Linear Regression where a model is created for the relationship between multiple independent input variables (feature variables) and an output dependent variable.

The input feature variables from the data are passed to these neurons as a multi-variable linear combination, where the values multiplied by each feature variable are known as weights.

Tree induction is the task of taking a set of training instances as input, deciding which attributes are best to split on, splitting the dataset, and recurring on the resulting split datasets until all training instances are categorized.

While building the tree, the goal is to split on the attributes which create the purest child nodes possible, which would keep to a minimum the number of splits that would need to be made in order to classify all instances in our dataset.

In practice, this is measured by comparing entropy, or the amount of information needed to classify a single instance of a current dataset partition, to the amount of information to classify a single instance if the current dataset partition were to be further partitioned on a given attribute.

## Supervised learning

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).

A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.

An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances.

This requires the learning algorithm to generalize from the training data to unseen situations in a 'reasonable' way (see inductive bias).

There is no single learning algorithm that works best on all supervised learning problems (see the No free lunch theorem).

The prediction error of a learned classifier is related to the sum of the bias and the variance of the learning algorithm. Generally, there is a tradeoff between bias and variance.

A key aspect of many supervised learning methods is that they are able to adjust this tradeoff between bias and variance (either automatically or by providing a bias/variance parameter that the user can adjust).

The second issue is the amount of training data available relative to the complexity of the 'true' function (classifier or regression function).

If the true function is simple, then an 'inflexible' learning algorithm with high bias and low variance will be able to learn it from a small amount of data.

But if the true function is highly complex (e.g., because it involves complex interactions among many different input features and behaves differently in different parts of the input space), then the function will only be learnable from a very large amount of training data and using a 'flexible' learning algorithm with low bias and high variance.

If the input feature vectors have very high dimension, the learning problem can be difficult even if the true function only depends on a small number of those features.

Hence, high input dimensionality typically requires tuning the classifier to have low variance and high bias.

In practice, if the engineer can manually remove irrelevant features from the input data, this is likely to improve the accuracy of the learned function.

In addition, there are many algorithms for feature selection that seek to identify the relevant features and discard the irrelevant ones.

This is an instance of the more general strategy of dimensionality reduction, which seeks to map the input data into a lower-dimensional space prior to running the supervised learning algorithm.

fourth issue is the degree of noise in the desired output values (the supervisory target variables).

If the desired output values are often incorrect (because of human error or sensor errors), then the learning algorithm should not attempt to find a function that exactly matches the training examples.

You can overfit even when there are no measurement errors (stochastic noise) if the function you are trying to learn is too complex for your learning model.

In such a situation, the part of the target function that cannot be modeled 'corrupts' your training data - this phenomenon has been called deterministic noise.

In practice, there are several approaches to alleviate noise in the output values such as early stopping to prevent overfitting as well as detecting and removing the noisy training examples prior to training the supervised learning algorithm.

There are several algorithms that identify noisy training examples and removing the suspected noisy training examples prior to training has decreased generalization error with statistical significance. Other factors to consider when choosing and applying a learning algorithm include the following: When considering a new application, the engineer can compare multiple learning algorithms and experimentally determine which one works best on the problem at hand (see cross validation).

Given fixed resources, it is often better to spend more time collecting additional training data and more informative features than it is to spend extra time tuning the learning algorithms.

x

1

y

1

x

N

y

N

x

i

y

i

R

arg

&#x2061;

max

y

|

For example, naive Bayes and linear discriminant analysis are joint probability models, whereas logistic regression is a conditional probability model.

empirical risk minimization and structural risk minimization. Empirical risk minimization seeks the function that best fits the training data.

In both cases, it is assumed that the training set consists of a sample of independent and identically distributed pairs,

x

i

y

i

R

&#x2265;

0

x

i

y

i

y

&#x005E;

y

i

y

&#x005E;

This can be estimated from the training data as In empirical risk minimization, the supervised learning algorithm seeks the function

|

y

&#x005E;

|

contains many candidate functions or the training set is not sufficiently large, empirical risk minimization leads to high variance and poor generalization.

The regularization penalty can be viewed as implementing a form of Occam's razor that prefers simpler functions over more complex ones.

&#x2211;

j

j

2

2

1

&#x2211;

j

|

j

0

j

The training methods described above are discriminative training methods, because they seek to find a function

i

i

i

## Regression vs. Classification Algorithms

We’ve done this before through the lens of whether the data used to train the algorithm should be labeled or not (see our posts onsupervised, unsupervised, or semi-supervised machine learning),but there are also inherent differences in these algorithms based on the format of their outputs.

If these are the questions you’re hoping to answer with machine learning in your business, consider algorithms like naive Bayes, decision trees, logistic regression, kernel approximation, and K-nearest neighbors.

Regression problems with time-ordered inputs are called time-series forecasting problems, like ARIMA forecasting, which allows data scientists to explain seasonal patterns in sales, evaluate the impact of new marketing campaigns, and more.

Though it’s often underrated because of its relative simplicity, it’s a versatile method that can be used to predict housing prices, likelihood of customers to churn, or the revenue a customer will generate.

Linear Regression - Machine Learning Fun and Easy

Linear Regression - Machine Learning Fun and Easy

Linear Regression Algorithm | Linear Regression in R | Data Science Training | Edureka

Data Science Training - ) This Edureka Linear Regression tutorial will help you understand all the basics of linear ..

Regression Features and Labels - Practical Machine Learning Tutorial with Python p.3

We'll be using the numpy module to convert data to numpy arrays, which is what Scikit-learn wants. We will talk more on preprocessing and cross_validation ...

Logistic Regression in R | Machine Learning Algorithms | Data Science Training | Edureka

Data Science Training - ) This Logistic Regression Tutorial shall give you a clear understanding as to how a Logistic ..

3.4: Linear Regression with Gradient Descent - Intelligence and Learning

In this video I continue my Machine Learning series and attempt to explain Linear Regression with Gradient Descent. My Video explaining the Mathematics of ...

Data Science & Machine Learning - Multiple Linear Regression Model - DIY- 11 -of-50

Data Science & Machine Learning - Multiple Linear Regression Model - DIY- 11 -of-50 Do it yourself Tutorial by Bharati DW Consultancy cell: +1-562-646-6746 ...

Logistic Regression Example using Titanic Data Set | Machine Learning

Logistic regression is the most popular classification technique in Machine Learning. It is simple yet very effective when it comes to classifying that has binary ...

35 Types of Regression Models used in Data Science

In this video you will learn 35 varieties of regression equations which includes but not limited to - Simple Linear Regression -Multiple Linear Regression -Logistic ...

Data Science & Machine Learning - Numeric Predictions using Regression Trees - DIY- 14 -of-50

Data Science & Machine Learning - Numeric Predictions using Regression Trees - DIY- 14 -of-50 Do it yourself Tutorial by Bharati DW Consultancy cell: ...

Data Science & Machine Learning - Linear Regression Model - DIY- 10(a) -of-50

Data Science & Machine Learning - Linear Regression Model - DIY- 10(a) -of-50 Do it yourself Tutorial by Bharati DW Consultancy cell: +1-562-646-6746 (Cell ...