AI News, On Machine Learning

On Machine Learning

post outlines some of the things that I have been thinking about how to apply machine learning for a given problem along with the process that we adopted for the classification problem at CB Insights, but also gave me a good opportunity to

My aim is not to focus on the algorithms, methods or classifiers but rather to offer a broader picture on how to approach a machine learning problem, and in the meantime give couple of bad advices.

and be warned that they may generalize better than your favorite classifier.(I will try not to overfit, but let me know if I do so in the comments.) Most of the machine learning book chapters and articles focus on algorithms/classifiers and sometimes optimization methods.

From a theoretical perspective, they analyze the algorithms' theoretical bounds and sometimes the learning function itself along with different types of optimization.

The datasets in papers sometimes happen to be trivial and not necessarily reflect the real-world or in the wild dataset characteristics, though.

There is a significant amount of knowledge and experience one has to gain (sometimes just by experimentation) to cover the gap these two separate(yet not independent) two sections to build a pipeline.

If I make an analogy with software programming, we put algorithms and data from input, output would become the program that we intended to write yet without explicitly writing it.

If we have data and labels for that class, we could train a classifier based on that data along with features, feature selection, and then classify the sample based on that classifier.

There would be always cases you miss couple of rules or some structures in text are hard to express in hard-coded rules (if one company joins another company, that article is most probably partnership rather than HR), and it requires quite amount of effort both in development and also requires large domain expertise.

It could incorporate more data and use that data without putting more effort where you want to introduce new rules, you basically grow and grow your code base.

In computer vision domain, even the pixel values are found to be not very good or discriminative, so computer vision researchers come up with higher level representations for the images.

Rather than knowing particular classifier strengths and weaknesses, even knowing categories of classifiers would be useful to make a good decision around which classifier to choose.

For example, a search engine needs to take both precision and recall to evaluate the ranking where a classifier on medical domain may put more emphasis on Type-I error than Type-II error or vice versa.

(I will mention different cross-validation methods in the generalized section in a bit) When we have some input(text, image, video, discrete, continuous or categorical variables), which we want to learn some structure or train a classifier, the first thing that needs to be done is to represent the input in a way that the classifier or the algorithm could use.

individual pixel values for images, words in text), you may want to also build your features which could be more higher level or in the same level but useful for learning.

Not only that, but when you evaluate your classifiers, the ones that are generalizing well(performing good on the test dataset), turn out to be the ones that use better document representation rather than the differences of the methods.

You tried a bunch of great classifiers into your training dataset but the results are far from satisfying and in different measures for performance, they may be even dismal.

If you have domain knowledge, then you are in a better shape as you could reason about what type of features would be more important and what needs to be done in order to improve the classification accuracies.

If you do not know much about the domain, then you should probably be spending some time on the misclassifications and try to figure out why do these classifiers perform very poorly and what needs to be corrected in the representation.

will deal with generalization in the evaluation by using cross-validation and make sure that we have a separate test set rather than the dataset that we optimize the parameters for.

said noise-free training samples in the beginning of this section, but for some of algorithms(especially the ones that tend to overfit, some amount of noise may actually improve the classification accuracy due to the reasons that I explained above).

Machine learning

Machine learning is a subset of artificial intelligence in the field of computer science that often uses statistical techniques to give computers the ability to 'learn' (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed.[1]

These analytical models allow researchers, data scientists, engineers, and analysts to 'produce reliable, repeatable decisions and results' and uncover 'hidden insights' through learning from historical relationships and trends in the data.[12]

Mitchell provided a widely quoted, more formal definition of the algorithms studied in the machine learning field: 'A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.'[13]

Developmental learning, elaborated for robot learning, generates its own sequences (also called curriculum) of learning situations to cumulatively acquire repertoires of novel skills through autonomous self-exploration and social interaction with human teachers and using guidance mechanisms such as active learning, maturation, motor synergies, and imitation.

Work on symbolic/knowledge-based learning did continue within AI, leading to inductive logic programming, but the more statistical line of research was now outside the field of AI proper, in pattern recognition and information retrieval.[17]:708–710;

Machine learning and data mining often employ the same methods and overlap significantly, but while machine learning focuses on prediction, based on known properties learned from the training data, data mining focuses on the discovery of (previously) unknown properties in the data (this is the analysis step of knowledge discovery in databases).

Much of the confusion between these two research communities (which do often have separate conferences and separate journals, ECML PKDD being a major exception) comes from the basic assumptions they work with: in machine learning, performance is usually evaluated with respect to the ability to reproduce known knowledge, while in knowledge discovery and data mining (KDD) the key task is the discovery of previously unknown knowledge.

Evaluated with respect to known knowledge, an uninformed (unsupervised) method will easily be outperformed by other supervised methods, while in a typical KDD task, supervised methods cannot be used due to the unavailability of training data.

Loss functions express the discrepancy between the predictions of the model being trained and the actual problem instances (for example, in classification, one wants to assign a label to instances, and models are trained to correctly predict the pre-assigned labels of a set of examples).

The difference between the two fields arises from the goal of generalization: while optimization algorithms can minimize the loss on a training set, machine learning is concerned with minimizing the loss on unseen samples.[19]

The training examples come from some generally unknown probability distribution (considered representative of the space of occurrences) and the learner has to build a general model about this space that enables it to produce sufficiently accurate predictions in new cases.

An artificial neural network (ANN) learning algorithm, usually called 'neural network' (NN), is a learning algorithm that is vaguely inspired by biological neural networks.

They are usually used to model complex relationships between inputs and outputs, to find patterns in data, or to capture the statistical structure in an unknown joint probability distribution between observed variables.

Falling hardware prices and the development of GPUs for personal use in the last few years have contributed to the development of the concept of deep learning which consists of multiple hidden layers in an artificial neural network.

Given an encoding of the known background knowledge and a set of examples represented as a logical database of facts, an ILP system will derive a hypothesized logic program that entails all positive and no negative examples.

Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.

Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that observations within the same cluster are similar according to some predesignated criterion or criteria, while observations drawn from different clusters are dissimilar.

Different clustering techniques make different assumptions on the structure of the data, often defined by some similarity metric and evaluated for example by internal compactness (similarity between members of the same cluster) and separation between different clusters.

Bayesian network, belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independencies via a directed acyclic graph (DAG).

Representation learning algorithms often attempt to preserve the information in their input but transform it in a way that makes it useful, often as a pre-processing step before performing classification or predictions, allowing reconstruction of the inputs coming from the unknown data generating distribution, while not being necessarily faithful for configurations that are implausible under that distribution.

Deep learning algorithms discover multiple levels of representation, or a hierarchy of features, with higher-level, more abstract features defined in terms of (or generating) lower-level features.

genetic algorithm (GA) is a search heuristic that mimics the process of natural selection, and uses methods such as mutation and crossover to generate new genotype in the hope of finding good solutions to a given problem.

In 2006, the online movie company Netflix held the first 'Netflix Prize' competition to find a program to better predict user preferences and improve the accuracy on its existing Cinematch movie recommendation algorithm by at least 10%.

Classification machine learning models can be validated by accuracy estimation techniques like the Holdout method, which splits the data into a training and test sets (conventionally 2/3 training set and 1/3 test set designation) and evaluates the performance of the training model on the test set.

In comparison, the k-fold-cross-validation method randomly splits the data into k subsets where the k - 1 instances of the data subsets are used to train the model while the kth subset instance is used to test the predictive ability of the training model.

For example, using job hiring data from a firm with racist hiring policies may lead to a machine learning system duplicating the bias by scoring job applicants against similarity to previous successful applicants.[50][51]

There is huge potential for machine learning in health care to provide professionals a great tool to diagnose, medicate, and even plan recovery paths for patients, but this will not happen until the personal biases mentioned previously, and these 'greed' biases are addressed.[53]

12 Useful Things to Know about Machine Learning

The general problem with high dimensions is that our intuitions, which come from a 3-dimensional world, often do not apply in high-dimensional ones.

If a constant number of examples is distributed uniformly in a high-dimensional hypercube, beyond some dimensionality most examples are closer to a face of the hypercube than to their nearest neighbor.

One of the major developments of recent decades has been the realization that in fact we can have guarantees on the results of induction, particularly if we’re willing to settle for probabilistic guarantees.

What is says is that, given a large enough training set, with high probability your learner will either return a hypothesis that generalizes well or be unable to find a consistent hypothesis.

It only tells us that, if the hypothesis space contains the true classifier, then the probability that the learner outputs a bad classifier decreases with training set size.

And, because of the bias-variance tradeoff discussed above, if learner A is better than learner B given infinite data, B is often better than A given finite data.

The main role of theoretical guarantees in machine learning is not as a criterion for practical decisions, but as a source of understanding and driving force for algorithm design.

But caveat emptor: learning is a complex phenomenon, and just because a learner has a theoretical justification and works in practice doesn’t mean the former is the reason for the latter.

Also, machine learning is not a one-shot process of building a dataset and running a learner, but rather an iterative process of running the learner, analyzing the results, modifying the data and/or the learner, and repeating.

Feature learning

The data label allows the system to compute an error term, the degree to which the system fails to produce the label, which can then be used as feedback to correct the learning process (reduce/minimize the error).

The dictionary elements and the weights may be found by minimizing the average representation error (over the input data), together with L1 regularization on the weights to enable sparsity (i.e., the representation of each data point has only a few nonzero weights).

applied dictionary learning on classification problems by jointly optimizing the dictionary elements, weights for representing data points, and parameters of the classifier based on the input data.

In particular, a minimization problem is formulated, where the objective function consists of the classification error, the representation error, an L1 regularization on the representing weights for each data point (to enable sparse representation of data), and an L2 regularization on the parameters of the classifier.

Multilayer neural networks can be used to perform feature learning, since they learn a representation of their input at the hidden layer(s) which is subsequently used for classification or regression at the output layer.

When the feature learning is performed in an unsupervised way, it enables a form of semisupervised learning where features learned from an unlabeled dataset are then employed to improve performance in a supervised setting with labeled data.[7][8]

In particular, given a set of n vectors, k-means clustering groups them into k clusters (i.e., subsets) in such a way that each vector belongs to the cluster with the closest mean.

The simplest is to add k binary features to each sample, where each feature j has value one iff the jth centroid learned by k-means is the closest to the sample under consideration.[3]

In a comparative evaluation of unsupervised feature learning methods, Coates, Lee and Ng found that k-means clustering with an appropriate transformation outperforms the more recently invented auto-encoders and RBMs on an image classification task.[3]

Given an unlabeled set of n input data vectors, PCA generates p (which is much smaller than the dimension of the input data) right singular vectors corresponding to the p largest singular values of the data matrix, where the kth row of the data matrix is the kth input data vector shifted by the sample mean of the input (i.e., subtracting the sample mean from the data vector).

In the ith iteration, the projection of the data matrix on the (i-1)th eigenvector is subtracted, and the ith singular vector is found as the right singular vector corresponding to the largest singular of the residual data matrix.

The first step is for 'neighbor-preserving', where each input data point Xi is reconstructed as a weighted sum of K nearest neighbor data points, and the optimal weights are found by minimizing the average squared reconstruction error (i.e., difference between an input point and its reconstruction) under the constraint that the weights associated with each point sum up to one.

An example of unsupervised dictionary learning is sparse coding, which aims to learn basis functions (dictionary elements) for data representation from unlabeled input data.

An RBM can be represented by an undirected bipartite graph consisting of a group of binary hidden variables, a group of visible variables, and edges connecting the hidden and visible nodes.

where the encoder uses raw data (e.g., image) as input and produces feature or representation as output and the decoder uses the extracted feature from the encoder as input and reconstructs the original input raw data as output.

The parameters involved in the architecture were originally trained in a greedy layer-by-layer manner: after one layer of feature detectors is learned, they are fed up as visible variables for training the corresponding RBM.

How to handle Imbalanced Classification Problems in machine learning?

If you have spent some time in machine learning and data science, you would have definitely come across imbalanced class distribution.

This problem is predominant in scenarios where anomaly detection is crucial like electricity pilferage, fraudulent transactions in banks, identification of rare diseases, etc.

Finally, I reveal an approach using which you can create a balanced class distribution and apply ensemble learning technique designed especially for this purpose.

Utility companies are increasingly turning towards advanced analytics and machine learning algorithms to identify consumption patterns that indicate theft.

For any imbalanced data set, if the event to be predicted belongs to the minority class and the event rate is less than 5%, it is usually referred to as a rare event.

Ex: In an utilities fraud detection data set you have the following data: Total Observations = 1000 Fraudulent  Observations = 20 Non Fraudulent Observations = 980 Event Rate= 2 % The main question faced during data analysis is –

For eg: A classifier which achieves an accuracy of 98 % with an event rate of 2 % is not accurate, if it classifies all instances as the majority class.

Thus, to sum it up, while trying to resolve specific business challenges with imbalanced data sets, the classifiers produced by standard machine learning algorithms might not give accurate results.

Apart from fraudulent transactions, other examples of a common business problem with imbalanced dataset are: In this article, we will illustrate the various techniques to train a model to perform well against highly imbalanced datasets.

And accurately predict rare events using the following fraud detection dataset: Total Observations = 1000 Fraudulent   Observations =20 Non-Fraudulent Observations = 980 Event Rate= 2 % Fraud Indicator = 0 for Non-Fraud Instances Fraud Indicator = 1 for Fraud

Dealing with imbalanced datasets entails strategies such as improving classification algorithms or balancing classes in the training data (data preprocessing) before providing the data as input to the machine learning algorithm.

Non Fraudulent Observations after random under sampling = 10 % of 980 =98 Total Observations after combining them with Fraudulent observations = 20+98=118 Event Rate for the new dataset after under sampling = 20/118 = 17%

Non Fraudulent Observations =980 Fraudulent Observations after replicating the minority class observations= 400 Total Observations in the new data set after oversampling=1380 Event Rate for the new data set after under sampling= 400/1380 = 29 %  

sample of 15 instances is taken from the minority class and similar synthetic instances are generated 20 times Post generation of synthetic instances, the following data set is created Minority Class (Fraudulent Observations) = 300 Majority Class (Non-Fraudulent Observations) = 980 Event rate= 300/1280 = 23.4 %

The algorithm randomly selects a data point from the k nearest neighbors for the security sample, selects the nearest neighbor from the border samples and does nothing for latent noise.

                                     Figure 4:  Approach to Bagging Methodology Total Observations = 1000 Fraudulent   Observations =20 Non Fraudulent Observations = 980 Event Rate= 2 % There are 10 bootstrapped samples chosen from the population with replacement.

The machine learning algorithms like logistic regression, neural networks, decision tree  are fitted to each bootstrapped sample of 200 observations.

And the Classifiers c1, c2…c10 are aggregated to produce a compound classifier.  This ensemble methodology produces a stronger compound classifier since it combines the results of individual classifiers to come up with an improved one.

Ada Boost is the first original boosting technique which creates a highly accurate prediction rule by combining many weak and inaccurate rules.  Each classifier is serially trained with the goal of correctly classifying examples in every round that were incorrectly classified in the previous round.

For a learned classifier to make strong predictions it should follow the following three conditions: Each of the weak hypothesis has an accuracy slightly better than random guessing i.e.

This is the fundamental assumption of this boosting algorithm which can produce a final hypothesis with a small error After each round, it gives more focus to examples that are harder to classify.  The quantity of focus is measured by a weight, which initially is equal for all instances.

     Figure 7:  Approach to Gradient Boosting For example: In a training data set containing 1000 observations out of which 20 are labelled fraudulent an initial base classifier.

A differentiable loss function is calculated based on the difference between the actual output and the predicted output of this step.  The residual of the loss function is the target variable (F1) for the next iteration.

The data structure  of the rare event data set is shown below post missing value removal, outlier treatment and dimension reduction.

Results This approach of balancing the data set with SMOTE and training a gradient boosting algorithm on the balanced set significantly impacts the accuracy of the predictive model.

By increasing its lift by around 20% and precision/hit ratio by 3-4 times as compared to normal analytical modeling techniques like logistic regression and decision trees.

She has around 3.5 + years of work experience and has worked in multiple advanced analytics and data science engagements spanning industries like Telecom, utilities, banking , manufacturing.

8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset

In this post you will discover the tactics that you can use to deliver great results on machine learning datasets with imbalanced data.

write long lists of techniques to try and think about the best ways to get past this problem.

I finally took the advice of one of my students: Perhaps one of your upcoming blog posts could address the problem of training a model to perform against highly imbalanced data, and outline some techniques and expectations.

You feel very frustrated when you discovered that your data has imbalanced classes and that all of the great results you thought you were getting turn out to be a lie.

The next wave of frustration hits when the books, articles and blog posts don’t seem to give you good advice about handling the imbalance in your data.

Most classification data sets do not have exactly equal number of instances in each class, but a small difference often does not matter.

It is the case where your accuracy measures tell the story that you have excellent accuracy (such as 90%), but the accuracy is only reflecting the underlying class distribution.

As you might have guessed, the reason we get 90% accuracy on an imbalanced data (with 90% of the instances in Class-1) is because our models look at the data and cleverly decide that the best thing to do is to always predict “Class-1”

If you print out the rule in the final model you will see that it is very likely predicting one class regardless of the data it is asked to predict.

From that post, I recommend looking at the following performance measures that can give more insight into the accuracy of the model than traditional classification accuracy: I

would also advice you to take a look at the following: You can learn a lot more about using ROC Curves to compare classification accuracy in our post “Assessing and Comparing Classifier Performance with ROC Curves“.

This change is called sampling your dataset and there are two main methods that you can use to even-up the classes: These approaches are often very easy to implement and fast to run.

In fact, I would advise you to always try both approaches on all of your imbalanced datasets, just to see if it gives you a boost in your preferred accuracy measures.

simple way to generate synthetic samples is to randomly sample the attributes from instances in the minority class.

The algorithm selects two or more similar instances (using a distance measure) and perturbing an instance one attribute at a time by a random amount within the difference to the neighboring instances.

The splitting rules that look at the class variable used in the creation of the trees, can force both classes to be addressed.

Penalized classification imposes an additional cost on the model for making classification mistakes on the minority class during training.

Using penalization is desirable if you are locked into a specific algorithm and are unable to resample or you’re getting poor results.

This might be a machine malfunction indicated through its vibrations or a malicious activity by a program indicated by it’s sequence of system calls.

This shift in thinking considers the minor class as the outliers class which might help you think of new ways to separate and classify samples.

Both of these shifts take a more real-time stance to the classification problem that might give you some new ways of thinking about your problem and maybe some more techniques to try.

For inspiration, take a look at the very creative answers on Quora in response to the question “In classification, how do you handle an unbalanced training set?”

Running an ensemble of classifiers on these sets could produce a much better result than one classifier alone These are just a few of some interesting and creative ideas you could try.

If you’d like to dive deeper into some of the academic literature on dealing with class imbalance, check out some of the links below.

Difference between Classification and Regression - Georgia Tech - Machine Learning

Watch on Udacity: Check out the full Advanced Operating Systems course for free ..

Decision Tree 1: how it works

Full lecture: A Decision Tree recursively splits training data into subsets based on the value of a single attribute. Each split corresponds to a ..

IAML2.22: Classification accuracy and imbalanced classes

6.3.1 Prioritizing What to Work On: Spam classification example

Week 6 (Machine Learning System Design) - Building a Spam Classifier - Prioritizing What to work on Machine ..

Machine Learning (Representing Data + WEKA)

Short video describing how data is represented for use in Machine Learning classifiers, regression, etc. Additionally, I introduce Weka, a software solution for ...

13. Classification

MIT 6.0002 Introduction to Computational Thinking and Data Science, Fall 2016 View the complete course: Instructor: John Guttag ..

How SVM (Support Vector Machine) algorithm works

In this video I explain how SVM (Support Vector Machine) algorithm works to classify a linearly separable binary data set. The original presentation is available ...

Learning Classifier Systems in a Nutshell

This video offers an accessible introduction to the basics of how Learning Classifier Systems (LCS), also known as Rule-Based Machine Learning (RBML), ...

Multi-Class Classifier (One-Vs-All)

Explains the One-Vs-All (Multi class classifier) with example. It also demonstrates the entire classification system by using dataset available at "UCI Machine ...

3.3.1 Logistic Regression - Multiclass Classification (One vs all)

Week 3 (Logistic Regression) - Multiclass Classification (One vs all) Machine Learning Coursera by Andrew Ng ..