AI News, Machine Learning FAQ

Machine Learning FAQ

Now, what we really care about in machine learning is to build a model that generalizes well to unseen data, that is, we want to build a model that has a high accuracy on the whole distribution of data;

(Typically, we use cross-validation techniques and a separate, independent test set to estimate the generalization performance.) Now, overfitting occurs if there’s an alternative model m’ from the algorithm’s hypothesis space where the training accuracy is better and the generalization performance is worse compared to model m – we say that m overfits the training data.

Machine Learning FAQ

Now, what we really care about in machine learning is to build a model that generalizes well to unseen data, that is, we want to build a model that has a high accuracy on the whole distribution of data;

(Typically, we use cross-validation techniques and a separate, independent test set to estimate the generalization performance.) Now, overfitting occurs if there’s an alternative model m’ from the algorithm’s hypothesis space where the training accuracy is better and the generalization performance is worse compared to model m – we say that m overfits the training data.

Train/Test Split and Cross Validation in Python

I’ll explain what that is — when we’re using a statistical model (like linear regression, for example), we usually fit the model on a training set in order to make predications on a data that wasn’t trained (general data).

As mentioned, in statistics and machine learning we usually split our data into to subsets: training data and testing data (and sometimes to three: train, validate and test), and fit our model on the train data, in order to make predictions on the test data.

We don’t want any of these things to happen, because they affect the predictability of our model — we might be using a model that has lower accuracy and/or is ungeneralized (meaning you can’t generalize your predictions on other data).

Let’s see what under and overfitting actually mean: Overfitting means that model we trained has trained “too well” and is now, well, fit too closely to the training dataset.

Basically, when this happens, the model learns or describes the “noise” in the training data instead of the actual relationships between variables in the data.

In contrast to overfitting, when a model is underfitted, it means that the model does not fit the training data and therefore misses the trends in the data.

The training set contains a known output and the model learns on this data in order to be generalized to other data later on.

Let’s load in the diabetes dataset, turn it into a data frame and define the columns’ names: Now we can use the train_test_split function in order to make the split.

Now we’ll fit the model on the training data: As you can see, we’re fitting the model on the training data and trying to predict the test data.

Here is a summary of what I did: I’ve loaded in the data, split it into a training and testing sets, fitted a regression model to the training data, made predictions based on this data and tested the predictions on the test data.

What if one subset of our data has only people from a certain state, employees with a certain income level but not other income levels, only women or only people at a certain age?

Here is a very simple example from the Sklearn documentation for K-Folds: And let’s see the result — the folds: As you can see, the function split the original data into different subsets of the data.

Because we would get a big number of training sets (equals to the number of samples), this method is very computationally expensive and should be used on small datasets.

40 Questions to test a data scientist on Machine Learning [Solution: SkillPower – Machine Learning, DataFest 2017]

A) Only 1 B) Only 2 C) Only 3 D) 1 and 2 E) 2 and 3 F) 1,2 and 3 Solution: (A) In SGD for each iteration you choose the batch which is generally contain the random sample of data But in case of GD each iteration contain the all of the training observations.

5) Which of the following hyper parameter(s), when increased may cause random forest to over fit the data?  A) Only 1 B) Only 2 C) Only 3 D) 1 and 2 E) 2 and 3 F) 1,2 and 3 Solution: (B) Usually, if we increase the depth of tree it will cause overfitting.

and you want to develop a machine learning algorithm which predicts the number of views on the articles.  Your analysis is based on features like author name, number of articles written by the same author on Analytics Vidhya in past and a few other features.

A) Only 1 B) Only 2 C) Only 3 D) 1 and 3 E) 2 and 3 F) 1 and 2 Solution:(A) You can think that the number of views of articles is the continuous target variable which fall under the regression problem.

[0,0,0,1,1,1,1,1] What is the entropy of the target variable?  A) -(5/8 log(5/8) + 3/8 log(3/8)) B) 5/8 log(5/8) + 3/8 log(3/8) C) 3/8 log(5/8) + 5/8 log(3/8) D) 5/8 log(3/8) –

What challenges you may face if you have applied OHE on a categorical variable of train dataset?  A) All categories of categorical variable are not present in the test dataset.

A) Only 1 B) Only 2 C) Only 3 D) 1 and 2 E) 1 and 3 F) 2 and 3 Solution: (E) In statistical hypothesis testing, a type I error is the incorrect rejection of a true null hypothesis (a “false positive”), while a type II error is incorrectly retaining a false null hypothesis (a “false negative”).

A) 1 and 2 B) 1 and 3 C) 2 and 3 D) 1,2 and 3 Solution: (D) Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.

1 Solution: (D) In Image 1, features have high positive correlation where as in Image 2 has high negative correlation between the features so in both images pair of features are the example of multicollinear features.

Which of the following action(s) would you perform next?  A) Only 1 B)Only 2 C) Only 3 D) Either 1 or 3 E) Either 2 or 3 Solution: (E) You cannot remove the both features because after removing the both features  you will lose all of the information so you should either remove the only 1 feature or you can use the regularization algorithm like L1 and L2.

A) Only 1 is correct B) Only 2 is correct C) Either 1 or 2 D) None of these Solution: (A) After adding a feature in feature space, whether that feature is important or unimportant features the R-squared always increase.

21) In ensemble learning, you aggregate the predictions for weak learners, so that an ensemble of these models will give a better prediction than prediction of individual models.

A) 1 and 2 B) 2 and 3 C) 1 and 3 D) 1,2 and 3 Solution: (D) Larger k value means less bias towards overestimating the true expected error (as training folds will be closer to the total dataset) and higher running time (as you are getting closer to the limit case: Leave-One-Out CV).

for GBM by selecting it from 10 different depth values (values are greater than 2) for tree based model using 5-fold cross validation.

Time taken by an algorithm for training (on a model with max_depth 2) 4-fold is 10 seconds and for the prediction on remaining 1-fold is 2 seconds.

23) Which of the following option is true for overall execution time for 5-fold cross validation with 10 different values of “max_depth”?  A) Less than 100 seconds B) 100 –

600 seconds D) More than or equal to 600 seconds C) None of the above D) Can’t estimate Solution: (D) Each iteration for depth “2” in 5-fold cross validation will take 10 secs for training and 2 second for testing.

But training and testing a model on depth greater than 2 will take more time than depth “2” so overall timing would be greater than 600.

A) Transform data to zero mean B) Transform data to zero median C) Not possible D) None of these Solution: (A) When the data has a zero mean vector PCA will have same projections as SVD, otherwise you have to centre the data first before taking SVD.

The black box outputs the nearest neighbor of q1 (say ti) and its corresponding class label ci.  You can also think that this black box algorithm is same as 1-NN (1-nearest neighbor).

A) 1 and 3 B) 2 and 3 C) 1 and 4 D) 2 and 4 Solution: (B) from image 1to 4 correlation is decreasing (absolute value).

A) 0 D) 0.4 C) 0.8 D) 1 Solution: (C) In Leave-One-Out cross validation, we will select (n-1) observations for training and 1 observation of validation.

So if you repeat this procedure for all points you will get the correct classification for all positive class given in the above figure but negative class will be misclassified.

A) First w2 becomes zero and then w1 becomes zero B) First w1 becomes zero and then w2 becomes zero C) Both becomes zero at the same time D) Both cannot be zero even after very large value of C Solution: (B) By looking at the image, we see that even on just using x2, we can efficiently perform classification.

Note: All other hyper parameters are same and other factors are not affected.  A) Only 1 B) Only 2 C) Both 1 and 2 D) None of the above Solution: (A) If you fit decision tree of depth 4 in such data means it will more likely to underfit the data.

A)1 and 2 B) 2 and 3 C) 1 and 3 D) 1, 2 and 3 E) Can’t say Solution: (E) For all three options A, B and C, it is not necessary that if you increase the value of parameter the performance may increase.

Context 38-39 Imagine, you have a 28 * 28 image and you run a 3 * 3 convolution neural network on it with the input depth of 3 and output depth of 8.

A) 28 width, 28 height and 8 depth B) 13 width, 13 height and 8 depth C) 28 width, 13 height and 8 depth D) 13 width, 28 height and 8 depth Solution: (A) The formula for calculating output size is output size = (N – F)/S + 1 where, N is input size, F is filter size and S is stride.

A)  28 width, 28 height and 8 depth B) 13 width, 13 height and 8 depth C) 28 width, 13 height and 8 depth D) 13 width, 28 height and 8 depth Solution: (B) Same as above

In that case, which of the following option best explains the C values for the images below (1,2,3 left to right, so C values are C1 for image1, C2 for image2 and C3 for image3 ) in case of rbf kernel.

Cross Validation

Watch on Udacity: Check out the full Advanced Operating Systems course for free ..

CppCon 2017: Carl Cook “When a Microsecond Is an Eternity: High Performance Trading Systems in C++”

— Presentation Slides, PDFs, Source Code and other presenter materials are available at: .

Lecture 3 | Loss Functions and Optimization

Lecture 3 continues our discussion of linear classifiers. We introduce the idea of a loss function to quantify our unhappiness with a model's predictions, and ...

Mod-10 Lec-38 No Free Lunch Theorem; Model selection and model estimation; Bias-variance trade-off

Pattern Recognition by Prof. P.S. Sastry, Department of Electronics & Communication Engineering, IISc Bangalore. For more details on NPTEL visit ...

Where Did This Code Come From? Discovering the Provenance of Program Binaries

Google Tech Talk (more info below) April 22, 2011 Presented by Nathan Rosenblum, UW-Madison ABSTRACT Where did this binary come from? How was it ...

Lecture 9 | CNN Architectures

In Lecture 9 we discuss some common architectures for convolutional neural networks. We discuss architectures which performed well in the ImageNet ...

Lecture 15 | Efficient Methods and Hardware for Deep Learning

In Lecture 15, guest lecturer Song Han discusses algorithms and specialized hardware that can be used to accelerate training and inference of deep learning ...

Anomaly Detection: Algorithms, Explanations, Applications

Anomaly detection is important for data cleaning, cybersecurity, and robust AI systems. This talk will review recent work in our group on (a) benchmarking ...

DEF CON 24 - Delta Zero, KingPhish3r - Weaponizing Data Science for Social Engineering

Historically, machine learning for information security has prioritized defense: think intrusion detection systems, malware classification and bonnet traffic ...

Lecture 16 | Adversarial Examples and Adversarial Training

In Lecture 16, guest lecturer Ian Goodfellow discusses adversarial examples in deep learning. We discuss why deep networks and other machine learning ...