AI News, Learning to learn, or the advent of augmented data scientists

Learning to learn, or the advent of augmented data scientists

The job of a data scientist involves finding patterns in data, often in order to automate or augment human decision-making.

The concept of automatic machine learning (autoML) is compelling, and we at MFG Labs pay close attention to the development of this field, because the way we work and design our processes might be disrupted by theoretical or practical breakthroughs in this area.

Our intent is by no means to comprehensively survey the field of automatic machine learning, but rather to showcase a couple of specific topics that resonated particularly with our current interests.

The typical data scientist workflow, when you consider it from defining the problem at hand to debugging a live production system is, in our experience, very intricate and certainly not linear.

Rich Caruana (Microsoft Research) formulates the following pipeline, which feels very familiar for us: Again the workflow is usually not followed linearly: for example data quality problems are often brought to light during feature engineering or model tuning, which implies going back to the data collection &

Some of the necessary steps outlined above look way out of the reach of automation right now: problem definition, data collection, metric selection, deployment, debug… Those tasks involve a kind of general intelligence only found in humans so far.

So far the available tools to perform outlier detection, either assume a given joint distribution (like fitting an elliptic envelope), which is bound to fail on high dimensional complex datasets, or are actually novelty-detection algorithms (like one-class support vector machine), which tends to overfit to the outliers during training.

To overcome these issues, research to advance the state of the art in high dimension density estimation (which is a very challenging research area by itself) in the practical context of data cleaning will yield significant benefits for data scientists.

For example when you run a k-NN (k nearest neighbors) classification algorithm, the hyperparameter is k, the number of neighbors considered for classifying a new data point.

In this case hyperparameter optimization is straightforward since there is only a single parameter to optimize: it consists of running a cross-validation on many values of k and choosing the value maximizing the generalization performance.

Thus most data scientists in industry either use sub-optimal models, for example by using the default parameters suggested by the specific implementation they run, or by optimizing each parameter independently, thus reducing drastically the hyperparameter space but ignoring parameter interactions.

To do that, it chooses to evaluate the function (which involves running the full machine learning algorithm) on a set of hyperparameters on which the result is the most uncertain (exploration) and that is the most likely to have higher values (exploitation).

The uncertainty factor implies modeling the generalization performance as a stochastic process (in practice a Gaussian process is a good choice because of the available closed form formulas for marginal and conditional probabilities), choosing a reasonable prior and updating the posterior via Bayes’ formula each time a new set of hyperparameters is evaluated.

The reason Bayesian optimization vastly outperforms classic grid search for a given computation time is because it carefully chooses the next set of hyperparameters to evaluate at the next iteration.

Since the bottleneck in the context of hyperparameter optimization is the function evaluation time (which consists in running a full machine learning algorithm for each iteration), this algorithm saves a lot of computation time by avoiding evaluating hyperparameters that are unlikely to be optimal.

The main idea is to formulate the problem of algorithm selection as a recommender system problem, a theme we frequently encounter at MFG Labs: a set of users can evaluate or “like” products, and the objective is to predict the unknown ratings of all users for all products from existing ratings.

The experimental results of this strategy are promising: on a portfolio of about 30 algorithms and 200 problem instances, the algorithm recommended by the recommender system consistently outperforms the best overall algorithm.

For example the gradient descent, a very common technique for calibrating weights in machine learning algorithms, implements very specific instructions to minimize a smooth loss function.

Jürgen Schmidhuber, who presented some of these approaches at ICML, went even further with Meta-Genetic Programming: the meta-program modifying the main program can itself be subject to mutations, and these mutations can themselves be modified… in a potentially infinite recursive loop.

In the Gödel machine, a theoretically optimal self-referential learner designed by Schmidhuber, the starter program basically consists in a set of axioms, describing its own hardware and software (including a theorem prover), and an utility function supplied by the user.

Even the utility function could be modified, if the theorem prover manages to deduce that a rewrite would be beneficial, for example if simplifying the utility function would save computing power but still preserve the properties of the original utility.

Hyperparameter optimization

Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data.[1] The objective function takes a tuple of hyperparameters and returns the associated loss.[1] Cross-validation is often used to estimate this generalization performance.[2]

A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set[3] or evaluation on a held-out validation set.[4] Since the parameter space of a machine learner may include real-valued or unbounded value spaces for certain parameters, manually set bounds and discretization may be necessary before applying grid search.

Both parameters are continuous, so to perform grid search, one selects a finite set of 'reasonable' values for each, say Grid search then trains an SVM with each pair (C, γ) in the Cartesian product of these two sets and evaluates their performance on a held-out validation set (or by internal cross-validation on the training set, in which case multiple SVMs are trained per pair).

In hyperparameter optimization, evolutionary optimization uses evolutionary algorithms to search the space of hyperparameters for a given algorithm.[7] Evolutionary hyperparameter optimization follows a process inspired by the biological concept of evolution: Evolutionary optimization has been used in hyperparameter optimization for statistical machine learning algorithms[7], automated machine learning[15][16], deep neural network architecture search[17][18], as well as training of the weights in deep neural networks[19].

40 Questions to test a data scientist on Machine Learning [Solution: SkillPower – Machine Learning, DataFest 2017]

A) Only 1 B) Only 2 C) Only 3 D) 1 and 2 E) 2 and 3 F) 1,2 and 3 Solution: (A) In SGD for each iteration you choose the batch which is generally contain the random sample of data But in case of GD each iteration contain the all of the training observations.

5) Which of the following hyper parameter(s), when increased may cause random forest to over fit the data?  A) Only 1 B) Only 2 C) Only 3 D) 1 and 2 E) 2 and 3 F) 1,2 and 3 Solution: (B) Usually, if we increase the depth of tree it will cause overfitting.

and you want to develop a machine learning algorithm which predicts the number of views on the articles.  Your analysis is based on features like author name, number of articles written by the same author on Analytics Vidhya in past and a few other features.

A) Only 1 B) Only 2 C) Only 3 D) 1 and 3 E) 2 and 3 F) 1 and 2 Solution:(A) You can think that the number of views of articles is the continuous target variable which fall under the regression problem.

[0,0,0,1,1,1,1,1] What is the entropy of the target variable?  A) -(5/8 log(5/8) + 3/8 log(3/8)) B) 5/8 log(5/8) + 3/8 log(3/8) C) 3/8 log(5/8) + 5/8 log(3/8) D) 5/8 log(3/8) –

What challenges you may face if you have applied OHE on a categorical variable of train dataset?  A) All categories of categorical variable are not present in the test dataset.

A) Only 1 B) Only 2 C) Only 3 D) 1 and 2 E) 1 and 3 F) 2 and 3 Solution: (E) In statistical hypothesis testing, a type I error is the incorrect rejection of a true null hypothesis (a “false positive”), while a type II error is incorrectly retaining a false null hypothesis (a “false negative”).

A) 1 and 2 B) 1 and 3 C) 2 and 3 D) 1,2 and 3 Solution: (D) Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.

1 Solution: (D) In Image 1, features have high positive correlation where as in Image 2 has high negative correlation between the features so in both images pair of features are the example of multicollinear features.

Which of the following action(s) would you perform next?  A) Only 1 B)Only 2 C) Only 3 D) Either 1 or 3 E) Either 2 or 3 Solution: (E) You cannot remove the both features because after removing the both features  you will lose all of the information so you should either remove the only 1 feature or you can use the regularization algorithm like L1 and L2.

A) Only 1 is correct B) Only 2 is correct C) Either 1 or 2 D) None of these Solution: (A) After adding a feature in feature space, whether that feature is important or unimportant features the R-squared always increase.

21) In ensemble learning, you aggregate the predictions for weak learners, so that an ensemble of these models will give a better prediction than prediction of individual models.

A) 1 and 2 B) 2 and 3 C) 1 and 3 D) 1,2 and 3 Solution: (D) Larger k value means less bias towards overestimating the true expected error (as training folds will be closer to the total dataset) and higher running time (as you are getting closer to the limit case: Leave-One-Out CV).

for GBM by selecting it from 10 different depth values (values are greater than 2) for tree based model using 5-fold cross validation.

Time taken by an algorithm for training (on a model with max_depth 2) 4-fold is 10 seconds and for the prediction on remaining 1-fold is 2 seconds.

23) Which of the following option is true for overall execution time for 5-fold cross validation with 10 different values of “max_depth”?  A) Less than 100 seconds B) 100 –

600 seconds D) More than or equal to 600 seconds C) None of the above D) Can’t estimate Solution: (D) Each iteration for depth “2” in 5-fold cross validation will take 10 secs for training and 2 second for testing.

But training and testing a model on depth greater than 2 will take more time than depth “2” so overall timing would be greater than 600.

A) Transform data to zero mean B) Transform data to zero median C) Not possible D) None of these Solution: (A) When the data has a zero mean vector PCA will have same projections as SVD, otherwise you have to centre the data first before taking SVD.

The black box outputs the nearest neighbor of q1 (say ti) and its corresponding class label ci.  You can also think that this black box algorithm is same as 1-NN (1-nearest neighbor).

A) 1 and 3 B) 2 and 3 C) 1 and 4 D) 2 and 4 Solution: (B) from image 1to 4 correlation is decreasing (absolute value).

A) 0 D) 0.4 C) 0.8 D) 1 Solution: (C) In Leave-One-Out cross validation, we will select (n-1) observations for training and 1 observation of validation.

So if you repeat this procedure for all points you will get the correct classification for all positive class given in the above figure but negative class will be misclassified.

A) First w2 becomes zero and then w1 becomes zero B) First w1 becomes zero and then w2 becomes zero C) Both becomes zero at the same time D) Both cannot be zero even after very large value of C Solution: (B) By looking at the image, we see that even on just using x2, we can efficiently perform classification.

Note: All other hyper parameters are same and other factors are not affected.  A) Only 1 B) Only 2 C) Both 1 and 2 D) None of the above Solution: (A) If you fit decision tree of depth 4 in such data means it will more likely to underfit the data.

A)1 and 2 B) 2 and 3 C) 1 and 3 D) 1, 2 and 3 E) Can’t say Solution: (E) For all three options A, B and C, it is not necessary that if you increase the value of parameter the performance may increase.

Context 38-39 Imagine, you have a 28 * 28 image and you run a 3 * 3 convolution neural network on it with the input depth of 3 and output depth of 8.

A) 28 width, 28 height and 8 depth B) 13 width, 13 height and 8 depth C) 28 width, 13 height and 8 depth D) 13 width, 28 height and 8 depth Solution: (A) The formula for calculating output size is output size = (N – F)/S + 1 where, N is input size, F is filter size and S is stride.

A)  28 width, 28 height and 8 depth B) 13 width, 13 height and 8 depth C) 28 width, 13 height and 8 depth D) 13 width, 28 height and 8 depth Solution: (B) Same as above

In that case, which of the following option best explains the C values for the images below (1,2,3 left to right, so C values are C1 for image1, C2 for image2 and C3 for image3 ) in case of rbf kernel.

Hyperparameter Optimization - The Math of Intelligence #7

Hyperparameters are the magic numbers of machine learning. We're going to learn how to find them in a more intelligent way than just trial-and-error. We'll go ...

Hyperopt: A Python library for optimizing machine learning algorithms; SciPy 2013

Hyperopt: A Python library for optimizing the hyperparameters of machine learning algorithms Authors: Bergstra, James, University of Waterloo; Yamins, Dan, ...

Machine Learning DataScience - What is difference between model parameter and hyperparameter?

Difference between Hyperparameters & Parameters in Machine Learning

Hyperparameters are values decided outside of model training process whereas parameters are found out during the model training. Hyperparameter tuning is ...

Dimensionality Reduction - The Math of Intelligence #5

Most of the datasets you'll find will have more than 3 dimensions. How are you supposed to understand visualize n-dimensional data? Enter dimensionality ...

The Evolution of Gradient Descent

Which optimizer should we use to train our neural network? Tensorflow gives us lots of options, and there are way too many acronyms. We'll go over how the ...

Hyperparameter Tuning and Cross Validation to Decision Tree classifier (Machine learning by Python)

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. The traditional ...

Decision Tree Parameters - Intro to Machine Learning

This video is part of an online course, Intro to Machine Learning. Check out the course here: This course was designed ..

Deep Neural Network Hyperparameter Optimization wtih Genetic Algorithms

2017 Rice Data Science Conference: "Deep Neural Network Hyperparameter Optimization wtih Genetic Algorithms" Speakers: Jacob Balma, Cray, Inc.; Aaron ...

Lecture 10.3 — Advice For Applying Machine Learning | Model Selection And Train Validation Test Sets

Copyright Disclaimer Under Section 107 of the Copyright Act 1976, allowance is made for "FAIR USE" for purposes such as criticism, comment, news reporting, ...