# AI News, What you need to know before you board the machine learning train

## What you need to know before you board the machine learning train

In my experience, there are some mistaken concepts or ideas in machine learning that people usually fall for.

You shouldn’t run the machine algorithm on the same data set that was used for coming up with the algorithm and using that accuracy / performance as a guidance for accuracy on future, unseen data.

(In ML parlance, the data set used for coming up with model is called training data set, the data set for determining accuracy is called test data set).

Even if you keep a test data set aside, another mistake you can make is to have too few data points for determining accuracy OR not ensuring test data set to be representative of entire data available to you.

There’s so many ways you can go wrong with accuracy that it’s worth flexing your mental muscles just to understand what accuracy number implies and whether you can trust that number enough to release the machine learning enabled feature to the user.

Imagine if your task is prediction of next order date based on customer’s various attributes (say office location, their last order date, company size, revenue, products they have bought, etc.) For any machine learning algorithm, the potential number of inputs to predict a given output can be many in number.

For example, in our hypothetical case, you could try following three models to predict the next order date from a customer : What if you find out that this simple linear relationship between two variables has an accuracy of 85% while your complex, deep learning algorithm (whose output you cannot interpret and whose implementation will take 8 extra weeks of dev effort) has an accuracy of 90%?

And if you have a data science team and they suggest using neural networks, first demand results on simpler, more interpretable and easy to implement models (such as linear regression, logistic regression, decision trees, k-means clustering).

Moreover, beyond the 0.5% tech-geek persona in your entire customer base, nobody cares whether your new feature is enabled via a machine learning algorithm or via your new bunch of enthusiastic interns.

You can of course mention it as a side note for technically savvy customers, but marketing a feature as “AI” or “machine learning” enabled deprives you of oppurtunity to highlight the real customer value your algorithms create.

## 8 Proven Ways for improving the &#8220;Accuracy&#8221; of a Machine Learning Model

But, if you follow my ways (shared below), you&#8217;d surely achieve high accuracy in your models (given that the data provided is sufficient to make predictions).

In this article, I&#8217;ve shared the 8 proven ways using which you can create a robust machine learning model.

The model development cycle goes through various stages, starting from data collection to model building.

This practice usually helps in building better features later on, which are not biased by the data available in the data-set.

The unwanted presence of missing and outlier values in the training data often reduces the accuracy of a model or leads to a biased model.

It shows that, in presence of missing values, the chances of playing cricket by females is similar as males.

But, if you look at the second table (after treatment of missing values based on salutation of name, &#8220;Miss&#8221;

This step helps to extract more information from existing data. New information is extracted in terms of new features. These features may have a higher ability to explain the variance in the training data.

Feature Selection is a process of finding out the best subset of attributes which better explains the relationship of independent variables with target variable.

Hitting at the right machine learning algorithm is the ideal approach to achieve higher accuracy.

To tune these parameters, you must have a good understanding of these meaning and their individual impact on model. You can repeat this process with a number of well performing models.

This technique simply combines the result of multiple weak models and produce better results. This can be achieved through many ways: To know more about these methods, you can refer article &#8220;Introduction to ensemble learning&#8220;.

Till here, we have seen methods which can improve the accuracy of a model. But, it is not necessary that higher accuracy models always perform better (for unseen data points). Sometimes, the improvement in model&#8217;s accuracy can be due to over-fitting too.

To know more about this cross validation method, you should refer article &#8220;Improve model performance using cross validation&#8220;.

Once you get the data set, follow these proven ways and you&#8217;ll surely get a robust machine learning model. But, these 8 steps can only help you, after you&#8217;ve mastered these steps individually.

## Supervised learning

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs.[1] It infers a function from labeled training data consisting of a set of training examples.[2] In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).

A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.

An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances.

This requires the learning algorithm to generalize from the training data to unseen situations in a 'reasonable' way (see inductive bias).

There is no single learning algorithm that works best on all supervised learning problems (see the No free lunch theorem).

The prediction error of a learned classifier is related to the sum of the bias and the variance of the learning algorithm.[4] Generally, there is a tradeoff between bias and variance.

A key aspect of many supervised learning methods is that they are able to adjust this tradeoff between bias and variance (either automatically or by providing a bias/variance parameter that the user can adjust).

The second issue is the amount of training data available relative to the complexity of the 'true' function (classifier or regression function).

If the true function is simple, then an 'inflexible' learning algorithm with high bias and low variance will be able to learn it from a small amount of data.

But if the true function is highly complex (e.g., because it involves complex interactions among many different input features and behaves differently in different parts of the input space), then the function will only be learnable from a very large amount of training data and using a 'flexible' learning algorithm with low bias and high variance.

If the input feature vectors have very high dimension, the learning problem can be difficult even if the true function only depends on a small number of those features.

Hence, high input dimensionality typically requires tuning the classifier to have low variance and high bias.

In practice, if the engineer can manually remove irrelevant features from the input data, this is likely to improve the accuracy of the learned function.

In addition, there are many algorithms for feature selection that seek to identify the relevant features and discard the irrelevant ones.

This is an instance of the more general strategy of dimensionality reduction, which seeks to map the input data into a lower-dimensional space prior to running the supervised learning algorithm.

fourth issue is the degree of noise in the desired output values (the supervisory target variables).

If the desired output values are often incorrect (because of human error or sensor errors), then the learning algorithm should not attempt to find a function that exactly matches the training examples.

You can overfit even when there are no measurement errors (stochastic noise) if the function you are trying to learn is too complex for your learning model.

In such a situation, the part of the target function that cannot be modeled 'corrupts' your training data - this phenomenon has been called deterministic noise.

In practice, there are several approaches to alleviate noise in the output values such as early stopping to prevent overfitting as well as detecting and removing the noisy training examples prior to training the supervised learning algorithm.

There are several algorithms that identify noisy training examples and removing the suspected noisy training examples prior to training has decreased generalization error with statistical significance.[5][6] Other factors to consider when choosing and applying a learning algorithm include the following: When considering a new application, the engineer can compare multiple learning algorithms and experimentally determine which one works best on the problem at hand (see cross validation).

Given fixed resources, it is often better to spend more time collecting additional training data and more informative features than it is to spend extra time tuning the learning algorithms.

x

1

y

1

x

N

y

N

x

i

y

i

R

arg

&#x2061;

max

y

|

For example, naive Bayes and linear discriminant analysis are joint probability models, whereas logistic regression is a conditional probability model.

empirical risk minimization and structural risk minimization.[7] Empirical risk minimization seeks the function that best fits the training data.

In both cases, it is assumed that the training set consists of a sample of independent and identically distributed pairs,

x

i

y

i

R

&#x2265;

0

x

i

y

i

y

&#x005E;

y

i

y

&#x005E;

This can be estimated from the training data as In empirical risk minimization, the supervised learning algorithm seeks the function

|

y

&#x005E;

|

contains many candidate functions or the training set is not sufficiently large, empirical risk minimization leads to high variance and poor generalization.

The regularization penalty can be viewed as implementing a form of Occam's razor that prefers simpler functions over more complex ones.

&#x2211;

j

j

2

2

1

&#x2211;

j

|

j

0

j

The training methods described above are discriminative training methods, because they seek to find a function

i

i

i

## How to handle Imbalanced Classification Problems in machine learning?

If you have spent some time in machine learning and data science, you would have definitely come across imbalanced class distribution.

This problem is predominant in scenarios where anomaly detection is crucial like electricity pilferage, fraudulent transactions in banks, identification of rare diseases, etc.

Finally, I reveal an approach using which you can create a balanced class distribution and apply ensemble learning technique designed especially for this purpose.

Utility companies are increasingly turning towards advanced analytics and machine learning algorithms to identify consumption patterns that indicate theft.

For any imbalanced data set, if the event to be predicted belongs to the minority class and the event rate is less than 5%, it is usually referred to as a rare event.

Ex: In an utilities fraud detection data set you have the following data: Total Observations = 1000 Fraudulent  Observations = 20 Non Fraudulent Observations = 980 Event Rate= 2 % The main question faced during data analysis is &#8211;

For eg: A classifier which achieves an accuracy of 98 % with an event rate of 2 % is not accurate, if it classifies all instances as the majority class.

Thus, to sum it up, while trying to resolve specific business challenges with imbalanced data sets, the classifiers produced by standard machine learning algorithms might not give accurate results.

Apart from fraudulent transactions, other examples of a common business problem with imbalanced dataset are: In this article, we will illustrate the various techniques to train a model to perform well against highly imbalanced datasets.

And accurately predict rare events using the following fraud detection dataset: Total Observations = 1000 Fraudulent   Observations =20 Non-Fraudulent Observations = 980 Event Rate= 2 % Fraud Indicator = 0 for Non-Fraud Instances Fraud Indicator = 1 for Fraud

Dealing with imbalanced datasets entails strategies such as improving classification algorithms or balancing classes in the training data (data preprocessing) before providing the data as input to the machine learning algorithm.

Non Fraudulent Observations after random under sampling = 10 % of 980 =98 Total Observations after combining them with Fraudulent observations = 20+98=118 Event Rate for the new dataset after under sampling = 20/118 = 17%

Non Fraudulent Observations =980 Fraudulent Observations after replicating the minority class observations= 400 Total Observations in the new data set after oversampling=1380 Event Rate for the new data set after under sampling= 400/1380 = 29 %

sample of 15 instances is taken from the minority class and similar synthetic instances are generated 20 times Post generation of synthetic instances, the following data set is created Minority Class (Fraudulent Observations) = 300 Majority Class (Non-Fraudulent Observations) = 980 Event rate= 300/1280 = 23.4 %

The algorithm randomly selects a data point from the k nearest neighbors for the security sample, selects the nearest neighbor from the border samples and does nothing for latent noise.

Figure 4:  Approach to Bagging Methodology Total Observations = 1000 Fraudulent   Observations =20 Non Fraudulent Observations = 980 Event Rate= 2 % There are 10 bootstrapped samples chosen from the population with replacement.

The machine learning algorithms like logistic regression, neural networks, decision tree  are fitted to each bootstrapped sample of 200 observations.

And the Classifiers c1, c2…c10 are aggregated to produce a compound classifier.  This ensemble methodology produces a stronger compound classifier since it combines the results of individual classifiers to come up with an improved one.

Ada Boost is the first original boosting technique which creates a highly accurate prediction rule by combining many weak and inaccurate rules.  Each classifier is serially trained with the goal of correctly classifying examples in every round that were incorrectly classified in the previous round.

For a learned classifier to make strong predictions it should follow the following three conditions: Each of the weak hypothesis has an accuracy slightly better than random guessing i.e.

This is the fundamental assumption of this boosting algorithm which can produce a final hypothesis with a small error After each round, it gives more focus to examples that are harder to classify.  The quantity of focus is measured by a weight, which initially is equal for all instances.

Figure 7:  Approach to Gradient Boosting For example: In a training data set containing 1000 observations out of which 20 are labelled fraudulent an initial base classifier.

A differentiable loss function is calculated based on the difference between the actual output and the predicted output of this step.  The residual of the loss function is the target variable (F1) for the next iteration.

The data structure  of the rare event data set is shown below post missing value removal, outlier treatment and dimension reduction.

Results This approach of balancing the data set with SMOTE and training a gradient boosting algorithm on the balanced set significantly impacts the accuracy of the predictive model.

By increasing its lift by around 20% and precision/hit ratio by 3-4 times as compared to normal analytical modeling techniques like logistic regression and decision trees.

She has around 3.5 + years of work experience and has worked in multiple advanced analytics and data science engagements spanning industries like Telecom, utilities, banking , manufacturing.

## Which machine learning algorithm should I use?

This resource is designed primarily for beginner to intermediate data scientists or analysts who are interested in identifying and applying machine learning algorithms to address the problems of their interest.

typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including: Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms.

The machine learning algorithm cheat sheet helps you to choose from a variety of machine learning algorithms to find the appropriate algorithm for your specific problems.

Reinforcement learning analyzes and optimizes the behavior of an agent based on the feedback from the environment.  Machines try different scenarios to discover which actions yield the greatest reward, rather than being told which actions to take.

Once you obtain some results and become familiar with the data, you may spend more time using more sophisticated algorithms to strengthen your understanding of the data, hence further improving the results.

Even in this stage, the best algorithms might not be the methods that have achieved the highest reported accuracy, as an algorithm usually requires careful tuning and extensive training to obtain its best achievable performance.

Here we discuss the binary case where the dependent variable $$y$$y only takes binary values $$\{y_i\in(-1,1)\}_{i=1}^N$$\{y_i\in(-1,1)\}_{i=1}^N (it which can be easily extended to multi-class classification problems).

In logistic regression we use a different hypothesis class to try to predict the probability that a given example belongs to the '1' class versus the probability that it belongs to the '-1' class.

A support vector machine (SVM) training algorithm finds the classifier represented by the normal vector $$w$$w and bias $$b$$b of the hyperplane.

support vector machine (SVM) training algorithm finds the classifier represented by the normal vector  and bias  of the hyperplane.

The problem can be converted into a constrained optimization problem: When the classes are not linearly separable, a kernel trick can be used to map a non-linearly separable space into a higher dimension linearly separable space.

Decision trees, random forest and gradient boosting are all algorithms based on decision trees.  There are many variants of decision trees, but they all do the same thing – subdivide the feature space into regions with mostly the same label.

Neural networks flourished in the mid-1980s due to their parallel and distributed processing ability.  But research in this field was impeded by the ineffectiveness of the back-propagation training algorithm that is widely used to optimize the parameters of neural networks.

Support vector machines (SVM) and other simpler models, which can be easily trained by solving convex optimization problems, gradually replaced neural networks in machine learning.

In recent years, new and improved training techniques such as unsupervised pre-training and layer-wise greedy training have led to a resurgence of interest in neural networks.

neural network consists of three parts: input layer, hidden layers and output layer.  The training samples define the input and output layers.

We generally do not want to feed a large number of features directly into a machine learning algorithm since some features may be irrelevant or the “intrinsic” dimensionality may be smaller than the number of features.

The SVD is related to PCA in the sense that SVD of the centered data matrix (features versus samples) provides the dominant left singular vectors that define the same subspace as found by PCA.

The takeaway messages when trying to solve a new problem are: SAS Visual Data Mining and Machine Learning provides a  good platform for beginners to learn machine learning and apply machine learning methods to their problems.

Machine Learning: Testing and Error Metrics

A friendly journey into the process of evaluating and improving machine learning models. - Training, Testing - Evaluation Metrics: Accuracy, Precision, Recall, ...

K Nearest Neighbor (kNN) Algorithm | R Programming | Data Prediction Algorithm

In this video I've talked about how you can implement kNN or k Nearest Neighbor algorithm in R with the help of an example data set freely available on UCL ...

Overfitting 4: training, validation, testing

When building a learning algorithm, we need to have three disjoint sets of data: the training set, the validation set and the testing set

Train, Test, & Validation Sets explained

In this video, we explain the concept of the different data sets used for training and testing an artificial neural network, including the training set, testing set, and ...

Logistic Regression in R | Machine Learning Algorithms | Data Science Training | Edureka

Data Science Training - ) This Logistic Regression Tutorial shall give you a clear understanding as to how a Logistic ..

Hyperparameter Optimization - The Math of Intelligence #7

Hyperparameters are the magic numbers of machine learning. We're going to learn how to find them in a more intelligent way than just trial-and-error. We'll go ...

Weka Tutorial 35: Creating Training, Validation and Test Sets (Data Preprocessing)

The tutorial that demonstrates how to create training, test and cross validation sets from a given dataset.

Azure Machine Learning algorithm accuracy enhancement, tips, and tricks - THR3078R

Azure ML is the Microsoft cloud based Machine Learning tool. It provides a variety of algorithms for performing descriptive, predictive, and prescriptive analysis.

Handling Class Imbalance Problem in R: Improving Predictive Model Performance