AI News, On Machine Learning

On Machine Learning

post outlines some of the things that I have been thinking about how to apply machine learning for a given problem along with the process that we adopted for the classification problem at CB Insights, but also gave me a good opportunity to

My aim is not to focus on the algorithms, methods or classifiers but rather to offer a broader picture on how to approach a machine learning problem, and in the meantime give couple of bad advices.

and be warned that they may generalize better than your favorite classifier.(I will try not to overfit, but let me know if I do so in the comments.) Most of the machine learning book chapters and articles focus on algorithms/classifiers and sometimes optimization methods.

From a theoretical perspective, they analyze the algorithms' theoretical bounds and sometimes the learning function itself along with different types of optimization.

The datasets in papers sometimes happen to be trivial and not necessarily reflect the real-world or in the wild dataset characteristics, though.

There is a significant amount of knowledge and experience one has to gain (sometimes just by experimentation) to cover the gap these two separate(yet not independent) two sections to build a pipeline.

If I make an analogy with software programming, we put algorithms and data from input, output would become the program that we intended to write yet without explicitly writing it.

If we have data and labels for that class, we could train a classifier based on that data along with features, feature selection, and then classify the sample based on that classifier.

There would be always cases you miss couple of rules or some structures in text are hard to express in hard-coded rules (if one company joins another company, that article is most probably partnership rather than HR), and it requires quite amount of effort both in development and also requires large domain expertise.

It could incorporate more data and use that data without putting more effort where you want to introduce new rules, you basically grow and grow your code base.

In computer vision domain, even the pixel values are found to be not very good or discriminative, so computer vision researchers come up with higher level representations for the images.

Rather than knowing particular classifier strengths and weaknesses, even knowing categories of classifiers would be useful to make a good decision around which classifier to choose.

For example, a search engine needs to take both precision and recall to evaluate the ranking where a classifier on medical domain may put more emphasis on Type-I error than Type-II error or vice versa.

(I will mention different cross-validation methods in the generalized section in a bit) When we have some input(text, image, video, discrete, continuous or categorical variables), which we want to learn some structure or train a classifier, the first thing that needs to be done is to represent the input in a way that the classifier or the algorithm could use.

individual pixel values for images, words in text), you may want to also build your features which could be more higher level or in the same level but useful for learning.

Not only that, but when you evaluate your classifiers, the ones that are generalizing well(performing good on the test dataset), turn out to be the ones that use better document representation rather than the differences of the methods.

You tried a bunch of great classifiers into your training dataset but the results are far from satisfying and in different measures for performance, they may be even dismal.

If you have domain knowledge, then you are in a better shape as you could reason about what type of features would be more important and what needs to be done in order to improve the classification accuracies.

If you do not know much about the domain, then you should probably be spending some time on the misclassifications and try to figure out why do these classifiers perform very poorly and what needs to be corrected in the representation.

will deal with generalization in the evaluation by using cross-validation and make sure that we have a separate test set rather than the dataset that we optimize the parameters for.

said noise-free training samples in the beginning of this section, but for some of algorithms(especially the ones that tend to overfit, some amount of noise may actually improve the classification accuracy due to the reasons that I explained above).

Machine learning

progressively improve performance on a specific task) with data, without being explicitly programmed.[1] The name Machine learning was coined in 1959 by Arthur Samuel.[2] Evolved from the study of pattern recognition and computational learning theory in artificial intelligence,[3] machine learning explores the study and construction of algorithms that can learn from and make predictions on data[4] – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions,[5]:2 through building a model from sample inputs.

Machine learning is sometimes conflated with data mining,[8] where the latter subfield focuses more on exploratory data analysis and is known as unsupervised learning.[5]:vii[9] Machine learning can also be unsupervised[10] and be used to learn and establish baseline behavioral profiles for various entities[11] and then used to find meaningful anomalies.

These analytical models allow researchers, data scientists, engineers, and analysts to 'produce reliable, repeatable decisions and results' and uncover 'hidden insights' through learning from historical relationships and trends in the data.[12] Effective machine learning is difficult because finding patterns is hard and often not enough training data are available;

Mitchell provided a widely quoted, more formal definition of the algorithms studied in the machine learning field: 'A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.'[15] This definition of the tasks in which machine learning is concerned offers a fundamentally operational definition rather than defining the field in cognitive terms.

Machine learning tasks are typically classified into two broad categories, depending on whether there is a learning 'signal' or 'feedback' available to a learning system: Another categorization of machine learning tasks arises when one considers the desired output of a machine-learned system:[5]:3 Among other categories of machine learning problems, learning to learn learns its own inductive bias based on previous experience.

Developmental learning, elaborated for robot learning, generates its own sequences (also called curriculum) of learning situations to cumulatively acquire repertoires of novel skills through autonomous self-exploration and social interaction with human teachers and using guidance mechanisms such as active learning, maturation, motor synergies, and imitation.

Probabilistic systems were plagued by theoretical and practical problems of data acquisition and representation.[19]:488 By 1980, expert systems had come to dominate AI, and statistics was out of favor.[20] Work on symbolic/knowledge-based learning did continue within AI, leading to inductive logic programming, but the more statistical line of research was now outside the field of AI proper, in pattern recognition and information retrieval.[19]:708–710;

Machine learning and data mining often employ the same methods and overlap significantly, but while machine learning focuses on prediction, based on known properties learned from the training data, data mining focuses on the discovery of (previously) unknown properties in the data (this is the analysis step of knowledge discovery in databases).

Much of the confusion between these two research communities (which do often have separate conferences and separate journals, ECML PKDD being a major exception) comes from the basic assumptions they work with: in machine learning, performance is usually evaluated with respect to the ability to reproduce known knowledge, while in knowledge discovery and data mining (KDD) the key task is the discovery of previously unknown knowledge.

Jordan, the ideas of machine learning, from methodological principles to theoretical tools, have had a long pre-history in statistics.[22] He also suggested the term data science as a placeholder to call the overall field.[22] Leo Breiman distinguished two statistical modelling paradigms: data model and algorithmic model,[23] wherein 'algorithmic model' means more or less the machine learning algorithms like Random forest.

Multilinear subspace learning algorithms aim to learn low-dimensional representations directly from tensor representations for multidimensional data, without reshaping them into (high-dimensional) vectors.[29] Deep learning algorithms discover multiple levels of representation, or a hierarchy of features, with higher-level, more abstract features defined in terms of (or generating) lower-level features.

In machine learning, genetic algorithms found some uses in the 1980s and 1990s.[33][34] Conversely, machine learning techniques have been used to improve the performance of genetic and evolutionary algorithms.[35] Rule-based machine learning is a general term for any machine learning method that identifies, learns, or evolves `rules’ to store, manipulate or apply, knowledge.

They seek to identify a set of context-dependent rules that collectively store and apply knowledge in a piecewise manner in order to make predictions.[37] Applications for machine learning include: In 2006, the online movie company Netflix held the first 'Netflix Prize' competition to find a program to better predict user preferences and improve the accuracy on its existing Cinematch movie recommendation algorithm by at least 10%.

A joint team made up of researchers from ATT Labs-Research in collaboration with the teams Big Chaos and Pragmatic Theory built an ensemble model to win the Grand Prize in 2009 for $1 million.[43] Shortly after the prize was awarded, Netflix realized that viewers' ratings were not the best indicators of their viewing patterns ('everything is a recommendation') and they changed their recommendation engine accordingly.[44] In 2010 The Wall Street Journal wrote about the firm Rebellion Research and their use of Machine Learning to predict the financial crisis.

[45] In 2012, co-founder of Sun Microsystems Vinod Khosla predicted that 80% of medical doctors jobs would be lost in the next two decades to automated machine learning medical diagnostic software.[46] In 2014, it has been reported that a machine learning algorithm has been applied in Art History to study fine art paintings, and that it may have revealed previously unrecognized influences between artists.[47] Classification machine learning models can be validated by accuracy estimation techniques like the Holdout method, which splits the data in a training and test set (conventionally 2/3 training set and 1/3 test set designation) and evaluates the performance of the training model on the test set.

Systems which are trained on datasets collected with biases may exhibit these biases upon use (algorithmic bias), thus digitizing cultural prejudices.[49] For example, using job hiring data from a firm with racist hiring policies may lead to a machine learning system duplicating the bias by scoring job applicants against similarity to previous successful applicants.[50][51] Responsible collection of data and documentation of algorithmic rules used by a system thus is a critical part of machine learning.

How to handle Imbalanced Classification Problems in machine learning?

If you have spent some time in machine learning and data science, you would have definitely come across imbalanced class distribution.

This problem is predominant in scenarios where anomaly detection is crucial like electricity pilferage, fraudulent transactions in banks, identification of rare diseases, etc.

Finally, I reveal an approach using which you can create a balanced class distribution and apply ensemble learning technique designed especially for this purpose.

Utility companies are increasingly turning towards advanced analytics and machine learning algorithms to identify consumption patterns that indicate theft.

For any imbalanced data set, if the event to be predicted belongs to the minority class and the event rate is less than 5%, it is usually referred to as a rare event.

Ex: In an utilities fraud detection data set you have the following data: Total Observations = 1000 Fraudulent  Observations = 20 Non Fraudulent Observations = 980 Event Rate= 2 % The main question faced during data analysis is –

For eg: A classifier which achieves an accuracy of 98 % with an event rate of 2 % is not accurate, if it classifies all instances as the majority class.

Thus, to sum it up, while trying to resolve specific business challenges with imbalanced data sets, the classifiers produced by standard machine learning algorithms might not give accurate results.

Apart from fraudulent transactions, other examples of a common business problem with imbalanced dataset are: In this article, we will illustrate the various techniques to train a model to perform well against highly imbalanced datasets.

And accurately predict rare events using the following fraud detection dataset: Total Observations = 1000 Fraudulent   Observations =20 Non-Fraudulent Observations = 980 Event Rate= 2 % Fraud Indicator = 0 for Non-Fraud Instances Fraud Indicator = 1 for Fraud

Dealing with imbalanced datasets entails strategies such as improving classification algorithms or balancing classes in the training data (data preprocessing) before providing the data as input to the machine learning algorithm.

Non Fraudulent Observations after random under sampling = 10 % of 980 =98 Total Observations after combining them with Fraudulent observations = 20+98=118 Event Rate for the new dataset after under sampling = 20/118 = 17%

Non Fraudulent Observations =980 Fraudulent Observations after replicating the minority class observations= 400 Total Observations in the new data set after oversampling=1380 Event Rate for the new data set after under sampling= 400/1380 = 29 %  

sample of 15 instances is taken from the minority class and similar synthetic instances are generated 20 times Post generation of synthetic instances, the following data set is created Minority Class (Fraudulent Observations) = 300 Majority Class (Non-Fraudulent Observations) = 980 Event rate= 300/1280 = 23.4 %

The algorithm randomly selects a data point from the k nearest neighbors for the security sample, selects the nearest neighbor from the border samples and does nothing for latent noise.

                                     Figure 4:  Approach to Bagging Methodology Total Observations = 1000 Fraudulent   Observations =20 Non Fraudulent Observations = 980 Event Rate= 2 % There are 10 bootstrapped samples chosen from the population with replacement.

The machine learning algorithms like logistic regression, neural networks, decision tree  are fitted to each bootstrapped sample of 200 observations.

And the Classifiers c1, c2…c10 are aggregated to produce a compound classifier.  This ensemble methodology produces a stronger compound classifier since it combines the results of individual classifiers to come up with an improved one.

Ada Boost is the first original boosting technique which creates a highly accurate prediction rule by combining many weak and inaccurate rules.  Each classifier is serially trained with the goal of correctly classifying examples in every round that were incorrectly classified in the previous round.

For a learned classifier to make strong predictions it should follow the following three conditions: Each of the weak hypothesis has an accuracy slightly better than random guessing i.e.

This is the fundamental assumption of this boosting algorithm which can produce a final hypothesis with a small error After each round, it gives more focus to examples that are harder to classify.  The quantity of focus is measured by a weight, which initially is equal for all instances.

     Figure 7:  Approach to Gradient Boosting For example: In a training data set containing 1000 observations out of which 20 are labelled fraudulent an initial base classifier.

A differentiable loss function is calculated based on the difference between the actual output and the predicted output of this step.  The residual of the loss function is the target variable (F1) for the next iteration.

The data structure  of the rare event data set is shown below post missing value removal, outlier treatment and dimension reduction.

Results This approach of balancing the data set with SMOTE and training a gradient boosting algorithm on the balanced set significantly impacts the accuracy of the predictive model.

By increasing its lift by around 20% and precision/hit ratio by 3-4 times as compared to normal analytical modeling techniques like logistic regression and decision trees.

She has around 3.5 + years of work experience and has worked in multiple advanced analytics and data science engagements spanning industries like Telecom, utilities, banking , manufacturing.

Feature learning

In machine learning, feature learning or representation learning[1] is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data.

For example, a supervised dictionary learning technique[6] applied dictionary learning on classification problems by jointly optimizing the dictionary elements, weights for representing data points, and parameters of the classifier based on the input data.

In particular, a minimization problem is formulated, where the objective function consists of the classification error, the representation error, an L1 regularization on the representing weights for each data point (to enable sparse representation of data), and an L2 regularization on the parameters of the classifier.

The simplest is to add k binary features to each sample, where each feature j has value one iff the jth centroid learned by k-means is the closest to the sample under consideration.[3] It is also possible to use the distances to the clusters as features, perhaps after transforming them through a radial basis function (a technique that has been used to train RBF networks[9]).

Coates and Ng note that certain variants of k-means behave similarly to sparse coding algorithms.[10] In a comparative evaluation of unsupervised feature learning methods, Coates, Lee and Ng found that k-means clustering with an appropriate transformation outperforms the more recently invented auto-encoders and RBMs on an image classification task.[3] K-means also improves performance in the domain of NLP, specifically for named-entity recognition;[11] there, it competes with Brown clustering, as well as with distributed word representations (also known as neural word embeddings).[8] Principal component analysis (PCA) is often used for dimension reduction.

Given an unlabeled set of n input data vectors, PCA generates p (which is much smaller than the dimension of the input data) right singular vectors corresponding to the p largest singular values of the data matrix, where the kth row of the data matrix is the kth input data vector shifted by the sample mean of the input (i.e., subtracting the sample mean from the data vector).

The first step is for 'neighbor-preserving', where each input data point Xi is reconstructed as a weighted sum of K nearest neighbor data points, and the optimal weights are found by minimizing the average squared reconstruction error (i.e., difference between an input point and its reconstruction) under the constraint that the weights associated with each point sum up to one.

The reconstruction weights obtained in the first step capture the 'intrinsic geometric properties' of a neighborhood in the input data.[13] It is assumed that original data lie on a smooth lower-dimensional manifold, and the 'intrinsic geometric properties' captured by the weights of the original data are also expected to be on the manifold.

proposed algorithm K-SVD for learning a dictionary of elements that enables sparse representation.[16] The hierarchical architecture of the biological neural system inspires deep learning architectures for feature learning by stacking multiple layers of learning nodes.[17] These architectures are often designed based on the assumption of distributed representation: observed data is generated by the interactions of many different factors on multiple levels.

Restricted Boltzmann machines (RBMs) are often used as a building block for multilayer learning architectures.[3][18] An RBM can be represented by an undirected bipartite graph consisting of a group of binary hidden variables, a group of visible variables, and edges connecting the hidden and visible nodes.

An example is provided by Hinton and Salakhutdinov[18] where the encoder uses raw data (e.g., image) as input and produces feature or representation as output and the decoder uses the extracted feature from the encoder as input and reconstructs the original input raw data as output.

Statistical classification

In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.

In statistics, where classification is often done with logistic regression or a similar procedure, the properties of observations are termed explanatory variables (or independent variables, regressors, etc.), and the categories to be predicted are known as outcomes, which are considered to be possible values of the dependent variable.

However, such an algorithm has numerous advantages over non-probabilistic classifiers: Early work on statistical classification was undertaken by Fisher,[2][3] in the context of two-group problems, leading to Fisher's linear discriminant function as the rule for assigning a group to a new observation.[4] This early work assumed that data-values within each of the two groups had a multivariate normal distribution.

The extension of this same context to more than two-groups has also been considered with a restriction imposed that the classification rule should be linear.[4][5] Later work for the multivariate normal distribution allowed the classifier to be nonlinear:[6] several classification rules can be derived based on slight different adjustments of the Mahalanobis distance, with a new observation being assigned to the group whose centre has the lowest adjusted distance from the observation.

Unlike frequentist procedures, Bayesian classification procedures provide a natural way of taking into account any available information about the relative sizes of the sub-populations associated with the different groups within the overall population.[7] Bayesian procedures tend to be computationally expensive and, in the days before Markov chain Monte Carlo computations were developed, approximations for Bayesian clustering rules were devised.[8] Some Bayesian procedures involve the calculation of group membership probabilities: these can be viewed as providing a more and more informative outcome of a data analysis than a simple attribution of a single group-label to each new observation.

In binary classification, a better understood task, only two classes are involved, whereas multiclass classification involves assigning an object to one of several classes.[9] Since many classification methods have been developed specifically for binary classification, multiclass classification often requires the combined use of multiple binary classifiers.

This type of score function is known as a linear predictor function and has the following general form: where Xi is the feature vector for instance i, βk is the vector of weights corresponding to category k, and score(Xi, k) is the score associated with assigning instance i to category k.

Decision Tree 1: how it works

Full lecture: A Decision Tree recursively splits training data into subsets based on the value of a single attribute. Each split corresponds to a node in the. Splitting..

Machine Learning: Multiclass Classification

How to turn binary classifiers into multiclass classifiers.

Learning Classifier Systems in a Nutshell

This video offers an accessible introduction to the basics of how Learning Classifier Systems (LCS), also known as Rule-Based Machine Learning (RBML), operate to learn patterns and make predictions...

Difference between Classification and Regression - Georgia Tech - Machine Learning

Watch on Udacity: Check out the full Advanced Operating Systems course for free at:

Ajinkya More | Resampling techniques and other strategies

PyData SF 2016 Ajinkya More | Resampling techniques and other strategies for handling highly unbalanced datasets in classification Many real world machine learning problems need to deal with...

Machine Learning - Supervised Learning Classification

Enroll in the course for free at: Machine Learning can be an incredibly beneficial tool to uncover hidden insights and predict..

13. Classification

MIT 6.0002 Introduction to Computational Thinking and Data Science, Fall 2016 View the complete course: Instructor: John Guttag Prof. Guttag introduces supervised..

1. Classification

Video from Coursera - Standford University - Course: Machine Learning:



Writing Our First Classifier - Machine Learning Recipes #5

Welcome back! It's time to write our first classifier. This is a milestone if you're new to machine learning. We'll start with our code from episode #4 and comment out the classifier we imported....