AI News, QA – Testing Features of Machine Learning Models

QA – Testing Features of Machine Learning Models

In this post, you will learn about different types of test cases which you could come up for testing features of the data science/machine learning models.

The following diagram represents key aspects which need to be tested in relation to doing quality control checks/testing of features of the machine learning models.

The following represents different aspects which need to be tested for each of the features: The following represents a test plan for testing features of machine learning models: In this post, you learned about details related to different aspects of testing features of the machine learning models.

Dependent and independent variables

In mathematical modeling, statistical modeling and experimental sciences, the values of dependent variables depend on the values of independent variables.

The dependent variables represent the output or outcome whose variation is being studied.

The independent variables, also known in a statistical context as regressors, represent inputs or causes, i.e., potential reasons for variation or, in the experimental setting, the variable controlled by the experimenter.

Models and experiments test or determine the effects that the independent variables have on the dependent variables.

Sometimes, independent variables may be included for other reasons, such as for their potential confounding effect, without a wish to test their effect directly.

In mathematics, a function is a rule for taking an input (in the simplest case, a number or set of numbers)[2]

A symbol that stands for an arbitrary input is called an independent variable, while a symbol that stands for an arbitrary output is called a dependent variable.[3]

In this situation, a symbol representing an element of X may be called an independent variable and a symbol representing an element of Y may be called a dependent variable, such as when X is a manifold and the symbol x represents an arbitrary point in the manifold.[6]

In data mining tools (for multivariate statistics and machine learning), the dependent variable is assigned a role as target variable (or in some tools as label attribute), while an independent variable may be assigned a role as regular variable.[10]

Known values for the target variable are provided for the training data set and test data set, but should be predicted for other data.

In mathematical modeling, the dependent variable is studied to see if and how much it varies as the independent variables vary.

is known as the 'error' and contains the variability of the dependent variable not explained by the independent variable.

Depending on the context, an independent variable is sometimes called a 'predictor variable', regressor, covariate, 'controlled variable', 'manipulated variable', 'explanatory variable', exposure variable (see reliability theory), 'risk factor' (see medical statistics), 'feature' (in machine learning and pattern recognition) or 'input variable.'[11][12] In

Depending on the context, a dependent variable is sometimes called a 'response variable', 'regressand', 'criterion', 'predicted variable', 'measured variable', 'explained variable', 'experimental variable', 'responding variable', 'outcome variable', 'output variable' or 'label'.[12]

'Explanatory variable' is preferred by some authors over 'independent variable' when the quantities treated as independent variables may not be statistically independent or independently manipulable by the researcher.[18][19]

If the independent variable is referred to as an 'explanatory variable' then the term 'response variable' is preferred by some authors for the dependent variable.[12][18][19]

'Explained variable' is preferred by some authors over 'dependent variable' when the quantities treated as 'dependent variables' may not be statistically dependent.[20]

If the dependent variable is referred to as an 'explained variable' then the term 'predictor variable' is preferred by some authors for the independent variable.[20]

Variables may also be referred to by their form: continuous, binary/dichotomous, nominal categorical, and ordinal categorical, among others.

Here the dependent variable (and variable of most interest) was the annual mean sea level at a given location for which a series of yearly values were available.

Use was made of a covariate consisting of yearly values of annual mean atmospheric pressure at sea level.

The results showed that inclusion of the covariate allowed improved estimates of the trend against time to be obtained, compared to analyses which omitted the covariate.

variable may be thought to alter the dependent or independent variables, but may not actually be the focus of the experiment.

Extraneous variables, if included in a regression analysis as independent variables, may aid a researcher with accurate response parameter estimation, prediction, and goodness of fit, but are not of substantive interest to the hypothesis under examination.

For example, in a study examining the effect of post-secondary education on lifetime earnings, some extraneous variables might be gender, ethnicity, social class, genetics, intelligence, age, and so forth.

If it is excluded from the regression and if it has a non-zero covariance with one or more of the independent variables of interest, its omission will bias the regression's result for the effect of that independent variable of interest.

in these situations, design changes and/or controlling for a variable statistical control is necessary.

and is known as the 'residual', 'side effect', 'error', 'unexplained share', 'residual variable', or 'tolerance'.

Introduction to Machine Learning with Python by Sarah Guido, Andreas C. Müller

Feature engineering is often an important place to use expert knowledge

flights are usually more expensive during peak vacation months and around holidays.

While the dates of some holidays (like Christmas) are fixed, and their effect can therefore be learned from the date, others might depend on the phases of the moon

We’ll now look at one particular case of using expert knowledge—though in this case it might be more rightfully called “common sense.” The

for a given time and day how many people will rent a bike in front

main trends for each day: In[48]: In[49]: Out[49]: The following example shows a visualization of the rental frequencies for the whole month (Figure 4-12): In[50]: Looking at the data, we can clearly distinguish day and night for each 24-hour

means when doing a split into a training and a test set, we want to use

the remaining 64 data points, corresponding to the remaining 8 days, as

of rentals in the following three hours (three in this case, according to our DataFrame).

feature as our data representation: In[51]: We first define a function to split the data into training and test sets, build

the model, and visualize the result: In[52]: We saw earlier that random forests require very little preprocessing of the data,

time feature X and pass a random forest regressor to our eval_on_features

predicts the target value of the closest point in the training set—which is the last time it observed any data.

As Figure 4-14 shows, now the predictions have the same pattern for each day of the week: In[54]: Out[54]: TheR2 is already much better, but the predictions clearly miss

Now let’s also add the day of the week (see Figure 4-15): In[55]: Out[55]: Now we have a model that captures the periodic behavior by considering the day of week and time of day.

complex model like a random forest, so let’s try with a simpler model, LinearRegression (see Figure 4-16): In[56]: Out[56]: LinearRegression works much worse, and the periodic pattern looks odd. The

for each combination of day and time of day (see Figure 4-18): In[59]: Out[59]: This transformation finally yields a model that performs similarly well to

First, we create feature names for the hour and day features: In[60]: Then we name all the interaction features extracted by PolynomialFeatures,

the features with nonzero coefficients: In[61]: Now we can visualize the coefficients learned by the linear model, as seen in Figure 4-19: In[62]:

40 Questions to test a data scientist on Machine Learning [Solution: SkillPower – Machine Learning, DataFest 2017]

A) Only 1 B) Only 2 C) Only 3 D) 1 and 2 E) 2 and 3 F) 1,2 and 3 Solution: (A) In SGD for each iteration you choose the batch which is generally contain the random sample of data But in case of GD each iteration contain the all of the training observations.

5) Which of the following hyper parameter(s), when increased may cause random forest to over fit the data?  A) Only 1 B) Only 2 C) Only 3 D) 1 and 2 E) 2 and 3 F) 1,2 and 3 Solution: (B) Usually, if we increase the depth of tree it will cause overfitting.

and you want to develop a machine learning algorithm which predicts the number of views on the articles.  Your analysis is based on features like author name, number of articles written by the same author on Analytics Vidhya in past and a few other features.

A) Only 1 B) Only 2 C) Only 3 D) 1 and 3 E) 2 and 3 F) 1 and 2 Solution:(A) You can think that the number of views of articles is the continuous target variable which fall under the regression problem.

[0,0,0,1,1,1,1,1] What is the entropy of the target variable?  A) -(5/8 log(5/8) + 3/8 log(3/8)) B) 5/8 log(5/8) + 3/8 log(3/8) C) 3/8 log(5/8) + 5/8 log(3/8) D) 5/8 log(3/8) –

What challenges you may face if you have applied OHE on a categorical variable of train dataset?  A) All categories of categorical variable are not present in the test dataset.

A) Only 1 B) Only 2 C) Only 3 D) 1 and 2 E) 1 and 3 F) 2 and 3 Solution: (E) In statistical hypothesis testing, a type I error is the incorrect rejection of a true null hypothesis (a “false positive”), while a type II error is incorrectly retaining a false null hypothesis (a “false negative”).

A) 1 and 2 B) 1 and 3 C) 2 and 3 D) 1,2 and 3 Solution: (D) Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.

1 Solution: (D) In Image 1, features have high positive correlation where as in Image 2 has high negative correlation between the features so in both images pair of features are the example of multicollinear features.

Which of the following action(s) would you perform next?  A) Only 1 B)Only 2 C) Only 3 D) Either 1 or 3 E) Either 2 or 3 Solution: (E) You cannot remove the both features because after removing the both features  you will lose all of the information so you should either remove the only 1 feature or you can use the regularization algorithm like L1 and L2.

A) Only 1 is correct B) Only 2 is correct C) Either 1 or 2 D) None of these Solution: (A) After adding a feature in feature space, whether that feature is important or unimportant features the R-squared always increase.

21) In ensemble learning, you aggregate the predictions for weak learners, so that an ensemble of these models will give a better prediction than prediction of individual models.

A) 1 and 2 B) 2 and 3 C) 1 and 3 D) 1,2 and 3 Solution: (D) Larger k value means less bias towards overestimating the true expected error (as training folds will be closer to the total dataset) and higher running time (as you are getting closer to the limit case: Leave-One-Out CV).

for GBM by selecting it from 10 different depth values (values are greater than 2) for tree based model using 5-fold cross validation.

Time taken by an algorithm for training (on a model with max_depth 2) 4-fold is 10 seconds and for the prediction on remaining 1-fold is 2 seconds.

23) Which of the following option is true for overall execution time for 5-fold cross validation with 10 different values of “max_depth”?  A) Less than 100 seconds B) 100 –

600 seconds D) More than or equal to 600 seconds C) None of the above D) Can’t estimate Solution: (D) Each iteration for depth “2” in 5-fold cross validation will take 10 secs for training and 2 second for testing.

But training and testing a model on depth greater than 2 will take more time than depth “2” so overall timing would be greater than 600.

A) Transform data to zero mean B) Transform data to zero median C) Not possible D) None of these Solution: (A) When the data has a zero mean vector PCA will have same projections as SVD, otherwise you have to centre the data first before taking SVD.

The black box outputs the nearest neighbor of q1 (say ti) and its corresponding class label ci.  You can also think that this black box algorithm is same as 1-NN (1-nearest neighbor).

A) 1 and 3 B) 2 and 3 C) 1 and 4 D) 2 and 4 Solution: (B) from image 1to 4 correlation is decreasing (absolute value).

A) 0 D) 0.4 C) 0.8 D) 1 Solution: (C) In Leave-One-Out cross validation, we will select (n-1) observations for training and 1 observation of validation.

So if you repeat this procedure for all points you will get the correct classification for all positive class given in the above figure but negative class will be misclassified.

A) First w2 becomes zero and then w1 becomes zero B) First w1 becomes zero and then w2 becomes zero C) Both becomes zero at the same time D) Both cannot be zero even after very large value of C Solution: (B) By looking at the image, we see that even on just using x2, we can efficiently perform classification.

Note: All other hyper parameters are same and other factors are not affected.  A) Only 1 B) Only 2 C) Both 1 and 2 D) None of the above Solution: (A) If you fit decision tree of depth 4 in such data means it will more likely to underfit the data.

A)1 and 2 B) 2 and 3 C) 1 and 3 D) 1, 2 and 3 E) Can’t say Solution: (E) For all three options A, B and C, it is not necessary that if you increase the value of parameter the performance may increase.

Context 38-39 Imagine, you have a 28 * 28 image and you run a 3 * 3 convolution neural network on it with the input depth of 3 and output depth of 8.

A) 28 width, 28 height and 8 depth B) 13 width, 13 height and 8 depth C) 28 width, 13 height and 8 depth D) 13 width, 28 height and 8 depth Solution: (A) The formula for calculating output size is output size = (N – F)/S + 1 where, N is input size, F is filter size and S is stride.

A)  28 width, 28 height and 8 depth B) 13 width, 13 height and 8 depth C) 28 width, 13 height and 8 depth D) 13 width, 28 height and 8 depth Solution: (B) Same as above

In that case, which of the following option best explains the C values for the images below (1,2,3 left to right, so C values are C1 for image1, C2 for image2 and C3 for image3 ) in case of rbf kernel.

Decision Tree (CART) - Machine Learning Fun and Easy

Decision Tree (CART) - Machine Learning Fun and Easy

Machine Learning - Dimensionality Reduction - Feature Extraction & Selection

Enroll in the course for free at: Machine Learning can be an incredibly beneficial tool to ..

Decision Tree 1: how it works

Full lecture: A Decision Tree recursively splits training data into subsets based on the value of a single attribute. Each split corresponds to a ..

Lecture 03 -The Linear Model I

The Linear Model I - Linear classification and linear regression. Extending linear models through nonlinear transforms. Lecture 3 of 18 of Caltech's Machine ...

Machine Learning Research & Interpreting Neural Networks

Machine learning and neural networks change how computers and humans interact, but they can be complicated to understand. In this episode of Coffee with a ...

Visualizing a Decision Tree - Machine Learning Recipes #2

Last episode, we treated our Decision Tree as a blackbox. In this episode, we'll build one on a real dataset, add code to visualize it, and practice reading it - so ...

Supervised Machine Learning

Dr. Daniela Witten from the University of Washington presents a lecture titled "Supervised Machine Learning." View Slides ...

Lecture 02 - Is Learning Feasible?

Is Learning Feasible? - Can we generalize from a limited sample to the entire space? Relationship between in-sample and out-of-sample. Lecture 2 of 18 of ...

Lecture 14 - Support Vector Machines

Support Vector Machines - One of the most successful learning algorithms; getting a complex model at the price of a simple one. Lecture 14 of 18 of Caltech's ...

TensorFlow in 5 Minutes (tutorial)

This video is all about building a handwritten digit image classifier in Python in under 40 lines of code (not including spaces and comments). We'll use the ...