AI News, What are some common machine learning interview questions?

What are some common machine learning interview questions?

Machine learning interview questions are an integral part of the data science interview and the path to becoming a data scientist, machine learning engineer or data engineer.

Eckovation created a free guide to data science interviews so we know exactly how they can trip candidates up!

In order to help resolve that, here is a curated and created a list of key questions that you could see in a machine learning interview.

You’ll be able to do well in any job interview with machine learning interview questions after reading through this piece.

You should always find this out prior to beginning your interview preparation Learn more about Machine Learning Note: A key to answer these questions is to have concrete practical understanding on ML and related statistical concepts.

(You are free to make practical assumptions.) Answer: Processing a high dimensional data on a limited memory machine is a strenuous task, your interviewer would be fully aware of that.

Not to forget, that’s the motive of doing PCA where, we aim to select fewer components (than features) which can explain the maximum variance in the data set.

By doing rotation, the relative location of the components doesn’t change, it only changes the actual coordinates of the points.

If we don’t rotate the components, the effect of PCA will diminish and we’ll have to select more number of components to explain variance in the data set.

We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected.

In an imbalanced data set, accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly, but our class of interest is minority class (4%) which is the people who actually got diagnosed with cancer.

On the other hand, a decision tree algorithm is known to work best to detect non – linear interactions.

The reason why decision tree failed to provide robust predictions because it couldn’t map the linear relationship as good as a regression model did.

Therefore, we learned that, a linear regression model can provide robust prediction given the data set satisfies its linearity assumptions.

A machine learning problem consist of three things: Always look for these three factors to decide if machine learning is a tool to solve a particular problem.

Discarding correlated variables have a substantial effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated.

If you run PCA on this data set, the first principal component would exhibit twice the variance than it would exhibit with uncorrelated variables.

Answer: As we know, ensemble learners are based on the idea of combining weak learners to create strong learners.

For example: If model 1 has classified User1122 as 1, there are high chances model 2 and model 3 would have done the same, even if its actual value is 0.

Therefore, ensemble learners are built on the premise of combining weak uncorrelated models to obtain better predictions Q12.

kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other.

kNN algorithm tries to classify an unlabeled observation based on its k (can be any number ) surrounding neighbors.

The formula of R² = 1 – ∑(y – y´)²/∑(y – ymean)² where y´ is predicted value.

In absence of intercept term (ymean), the model can make no such evaluation, with large denominator, ∑(y - y´)²/∑(y)² equation’s value becomes smaller than actual, resulting in higher R².

VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity.

41 Essential Machine Learning Interview Questions (with answers)

We’ve traditionally seen machine learning interview questions pop up in several categories.

The third has to do with your general interest in machine learning: you’ll be asked about what’s going on in the industry and how you keep up with the latest machine learning trends.

Finally, there are company or industry-specific questions that test your ability to take your general machine learning knowledge and turn it into actionable points to drive the bottom line forward.

We’ve divided this guide to machine learning interview questions into the categories we mentioned above so that you can more easily get to the information you need when it comes to machine learning interview questions.

This can lead to the model underfitting your data, making it hard for it to have high predictive accuracy and for you to generalize your knowledge from the training set to the test set.

The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, the variance and a bit of irreducible error due to noise in the underlying dataset.

For example, in order to do classification (a supervised learning task), you’ll need to first label the data you’ll use to train the model to classify data into your labeled groups.

K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points.

It’s often used as a proxy for the trade-off between the sensitivity of the model (true positives) vs the fall-out or the probability it will trigger a false alarm (false positives).

More reading: Precision and recall (Wikipedia) Recall is also known as the true positive rate: the amount of positives your model claims compared to the actual number of positives there are throughout the data.

Precision is also known as the positive predictive value, and it is a measure of the amount of accurate positives your model claims compared to the number of positives it actually claims.

It can be easier to think of recall and precision in the context of a case where you’ve predicted that there were 10 apples and 5 oranges in a case of 10 apples.

Mathematically, it’s expressed as the true positive rate of a condition sample divided by the sum of the false positive rate of the population and the true positive rate of a condition.

Say you had a 60% chance of actually having the flu after a flu test, but out of people who had the flu, the test will be false 50% of the time, and the overall population only has a 5% chance of having the flu.

(Quora) Despite its practical applications, especially in text mining, Naive Bayes is considered “Naive” because it makes an assumption that is virtually impossible to see in real-life data: the conditional probability is calculated as the pure product of the individual probabilities of components.

clever way to think about this is to think of Type I error as telling a man he is pregnant, while Type II error means you tell a pregnant woman she isn’t carrying a baby.

More reading: Deep learning (Wikipedia) Deep learning is a subset of machine learning that is concerned with neural networks: how to use backpropagation and certain principles from neuroscience to more accurately model large sets of unlabelled or semi-structured data.

More reading: Using k-fold cross-validation for time-series model selection (CrossValidated) Instead of using standard k-folds cross-validation, you have to pay attention to the fact that a time series is not randomly distributed data —

More reading: Pruning (decision trees) Pruning is what happens in decision trees when branches that have weak predictive power are removed in order to reduce the complexity of the model and increase the predictive accuracy of a decision tree model.

For example, if you wanted to detect fraud in a massive dataset with a sample of millions, a more accurate model would most likely predict no fraud at all if only a vast minority of cases were fraud.

More reading: Regression vs Classification (Math StackExchange) Classification produces discrete values and dataset to strict categories, while regression gives you continuous results that allow you to better distinguish differences between individual points.

You would use classification over regression if you wanted your results to reflect the belongingness of data points in your dataset to certain explicit categories (ex: If you wanted to know whether a name was male or female rather than just how correlated they were with male and female names.) Q21- Name an example where ensemble techniques might be useful.

They typically reduce overfitting in models and make the model more robust (unlikely to be influenced by small changes in the training data).  You could list some examples of ensemble methods, from bagging to boosting to a “bucket of models” method and demonstrate how they could increase predictive power.

(Quora) This is a simple restatement of a fundamental problem in machine learning: the possibility of overfitting training data and carrying the noise of that data through to the test set, thereby providing inaccurate generalizations.

There are three main methods to avoid overfitting: 1- Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data.

More reading: How to Evaluate Machine Learning Algorithms (Machine Learning Mastery) You would first split the dataset into training and test sets, or perhaps use cross-validation techniques to further segment the dataset into composite sets of training and test sets within the data.

More reading: Kernel method (Wikipedia) The Kernel trick involves kernel functions that can enable in higher-dimension spaces without explicitly calculating the coordinates of points within that dimension: instead, kernel functions compute the inner products between the images of all pairs of data in a feature space.

This allows them the very useful attribute of calculating the coordinates of higher dimensions while being computationally cheaper than the explicit calculation of said coordinates. Many algorithms can be expressed in terms of inner products.

More reading: Writing pseudocode for parallel programming (Stack Overflow) This kind of question demonstrates your ability to think in parallelism and how you could handle concurrency in programming implementations dealing with big data.

For example, if you were interviewing for music-streaming startup Spotify, you could remark that your skills at developing a better recommendation model would increase user retention, which would then increase revenue in the long run.

The startup metrics Slideshare linked above will help you understand exactly what performance indicators are important for startups and tech companies as they think about revenue and growth.

Your interviewer is trying to gauge if you’d be a valuable member of their team and whether you grasp the nuances of why certain things are set the way they are in the company’s data process based on company- or industry-specific conditions.

This overview of deep learning in Nature by the scions of deep learning themselves (from Hinton to Bengio to LeCun) can be a good reference paper and an overview of what’s happening in deep learning —

More reading: Mastering the game of Go with deep neural networks and tree search (Nature) AlphaGo beating Lee Sidol, the best human player at Go, in a best-of-five series was a truly seminal event in the history of machine learning and deep learning.

The Nature paper above describes how this was accomplished with “Monte-Carlo tree search with deep neural networks that have been trained by supervised learning, from human expert games, and by reinforcement learning from games of self-play.” Want more?  Brush up your skills with our free machine learning course.

Supervised and Unsupervised Machine Learning Algorithms

What is supervised machine learning and how does it relate to unsupervised machine learning?

= f(X) The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process.

Some common types of problems built on top of classification and regression include recommendation and time series prediction respectively.

Some popular examples of unsupervised learning algorithms are: Problems where you have a large amount of input data (X) and only some of the data is labeled (Y) are called semi-supervised learning problems.

You can also use supervised learning techniques to make best guess predictions for the unlabeled data, feed that data back into the supervised learning algorithm as training data and use the model to make predictions on new unseen data.

40 Interview Questions asked at Startups in Machine Learning / Data Science

Machine learning and data science are being looked as the drivers of the next industrial revolution happening in the world today.

This also means that there are numerous exciting startups looking for data scientists.  What could be a better start for your aspiring career!

You might also find some real difficult techincal questions on your way. The set of questions asked depend on what does the startup do.

If you can answer and understand these question, rest assured, you will give a tough fight in your job interview.

(You are free to make practical assumptions.) Answer: Processing a high dimensional data on a limited memory machine is a strenuous task, your interviewer would be fully aware of that.

Not to forget, that’s the motive of doing PCA where, we aim to select fewer components (than features) which can explain the maximum variance in the data set.

By doing rotation, the relative location of the components doesn’t change, it only changes the actual coordinates of the points.

If we don’t rotate the components, the effect of PCA will diminish and we’ll have to select more number of components to explain variance in the data set.

Answer: This question has enough hints for you to start thinking! Since, the data is spread across median, let’s assume it’s a normal distribution.

We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected.

In an imbalanced data set, accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly, but our class of interest is minority class (4%) which is the people who actually got diagnosed with cancer.

Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine class wise performance of the classifier.

On the other hand, a decision tree algorithm is known to work best to detect non –

The reason why decision tree failed to provide robust predictions because it couldn’t map the linear relationship as good as a regression model did.

Therefore, we learned that, a linear regression model can provide robust prediction given the data set satisfies its linearity assumptions.

You are assigned a new project which involves helping a food delivery company save more money.

A machine learning problem consist of three things: Always look for these three factors to decide if machine learning is a tool to solve a particular problem.

Discarding correlated variables have a substantial effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated.

For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on this data set, the first principal component would exhibit twice the variance than it would exhibit with uncorrelated variables.

Answer: As we know, ensemble learners are based on the idea of combining weak learners to create strong learners.

For example: If model 1 has classified User1122 as 1, there are high chances model 2 and model 3 would have done the same, even if its actual value is 0.

kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other.

kNN algorithm tries to classify an unlabeled observation based on its k (can be any number ) surrounding neighbors.

In absence of intercept term (ymean), the model can make no such evaluation, with large denominator, ∑(y - y´)²/∑(y)² equation’s value becomes smaller than actual, resulting in higher R².

In addition, we can use calculate VIF (variance inflation factor) to check the presence of multicollinearity. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity.

Answer: You can quote ISLR’s authors Hastie, Tibshirani who asserted that, in presence of few variables with medium / large sized effect, use lasso regression.

Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model.

No, we can’t conclude that decrease in number of pirates caused the climate change because there might be other factors (lurking or confounding variables) influencing this phenomenon.

Therefore, there might be a correlation between global average temperature and number of pirates, but based on this information we can’t say that pirated died because of rise in global average temperature.

For example: if we calculate the covariances of salary ($) and age (years), we’ll get different covariances which can’t be compared because of having unequal scales.

In boosting, after the first round of predictions, the algorithm weighs misclassified predictions higher, such that they can be corrected in the succeeding round.

This sequential process of giving higher weights to misclassified predictions continue until a stopping criterion is reached.

In simple words, the tree algorithm find the best possible feature which can divide the data set into purest possible children nodes.

Training error 0.00 means the classifier has mimiced the training data patterns to an extent, that they are not available in the unseen data.

Hence, when this classifier was run on unseen sample, it couldn’t find those patterns and returned prediction with higher error.

Answer: In such high dimensional data sets, we can’t use classical regression techniques, since their assumptions tend to fail.

n, we can no longer calculate a unique least square coefficient estimate, the variances become infinite, so OLS cannot be used at all.

To combat this situation, we can use penalized regression methods like lasso, LARS, ridge which can shrink the coefficients to reduce variance.

(Hint: Think SVM) Answer: In case of linearly separable data, convex hull represents the outer boundaries of the two group of data points.

Using one hot encoding, the dimensionality (a.k.a features) in a data set get increased because it creates a new variable for each level present in categorical variables.

In label encoding, the levels of a categorical variables gets encoded as 0 and 1, so no new variable is created.

In time series problem, k fold can be troublesome because there might be some pattern in year 4 or 5 which is not in year 3.

They exploit behavior of other users and items in terms of transaction history, ratings, selection and purchase information.

In the context of confusion matrix, we can say Type I error occurs when we classify a value as positive (1) when it is actually negative (0).

On the contrary, stratified sampling helps to maintain the distribution of target variable in the resultant distributed samples also.

We will consider adjusted R² as opposed to R² to evaluate model fit because R² increases irrespective of improvement in prediction accuracy as we add more variables.

For example: a gene mutation data set might result in lower adjusted R² and still provide fairly good predictions, as compared to a stock market data where lower adjusted R² implies that model is not good.

Example: Think of a chess board, the movement made by a bishop or a rook is calculated by manhattan distance because of their respective vertical &

If the business requirement is to build a model which can be deployed, then we’ll use regression or a decision tree model (easy to interpret and explain) instead of black box algorithms like SVM, GBM etc.

A high bias error means we have a under-performing model which keeps on missing important trends. Variance on the other side quantifies how are the prediction made on same observation different from each other.

A high variance model will over-fit on your training population and perform badly on any observation beyond training.

Answer: OLS and Maximum likelihood are the methods used by the respective regression methods to approximate the unknown parameter (coefficient) value.

In simple words, Ordinary least square(OLS) is a method used in linear regression which approximates the parameters resulting in minimum distance between actual and predicted values. Maximum Likelihood helps in choosing the the values of parameters which maximizes the likelihood that the parameters are most likely to produce observed data.

These questions are meant to give you a wide exposure on the types of questions asked at startups in machine learning.

Is Data More Important Than Algorithms In AI?

Whether data or algorithms are more important has been debated at length by experts (and non-experts) in the last few years and theshort version is that it depends on many details and nuances that take some time to understand.

answered a pretty similar question some time ago in this Quora post: In machine learning, is more data always better than better algorithms?

That said, and as I mentioned in my answer to What does the market think AI (artificial Intelligence) means as compared to ML (machine learning)?, most people might not care much about the difference between ML and AI and will use them interchangeably.

So, I think it would be good to address this question from the particular viewpoint of recent advances in Deep Learning: In modern Deep Learning approaches is data more important than algorithms?

Without going into many details, deep learning algorithms have many parameters that need to be tuned and therefore need a lot of data in order to come up with somewhat generalizable models.

As a matter of fact, some have explained that there is a direct relation between the appearance of large public datasets like Imagenet and recent research advances.

That said, if you are trying to push the state of the art and come up with very concrete applications, yes, you will need to have internal data that you can leverage to train your cool new deep learning approach.

How to Prepare Data for Machine Learning and A.I.

In this video, Alina discusses how to Prepare data for Machine Learning and AI. Artificial Intelligence is only powerful as the quality of the data collection, so it's ...

Machine Learning Algorithms | Machine Learning Tutorial | Data Science Training | Edureka

Data Science Training - ) This Machine Learning Algorithms Tutorial shall teach you what machine learning is, and the ..

Predicting the Winning Team with Machine Learning

Can we predict the outcome of a football game given a dataset of past games? That's the question that we'll answer in this episode by using the scikit-learn ...

Lecture 16: Dynamic Neural Networks for Question Answering

Lecture 16 addresses the question ""Can all NLP tasks be seen as question answering problems?"". Key phrases: Coreference Resolution, Dynamic Memory ...

Representation, Modeling and Computation: Opportunities and Challenges of Modern Datasets

Machine learning from modern datasets presents novel opportunities and challenges. Larger and more diverse datasets enable us to answer more complex ...

Intro to Algorithms: Crash Course Computer Science #13

Algorithms are the sets of steps necessary to complete computation - they are at the heart of what our devices actually do. And this isn't a new concept. Since the ...

Linear Regression Algorithm | Linear Regression in Python | Machine Learning Algorithm | Edureka

Machine Learning Training with Python: ** This Linear Regression Algorithm video is designed in a way that you learn about the ..

Data Science Demo - Customer Churn Analysis

This introduction to Data Science provides a demonstration of analyzing customer data to predict churn using the R programming language. MetaScale walks ...

Euclidean Distance - Practical Machine Learning Tutorial with Python p.15

In the previous tutorial, we covered how to use the K Nearest Neighbors algorithm via Scikit-Learn to achieve 95% accuracy in predicting benign vs malignant ...

Google Coding Interview Question and Answer #1: First Recurring Character

Find the first recurring character in the given string! A variation of this problem: find the first NON-recurring character. This variation problem and many others are ...