AI News, 41 Essential Machine Learning Interview Questions (with answers)
41 Essential Machine Learning Interview Questions (with answers)
We’ve traditionally seen machine learning interview questions pop up in several categories.
The third has to do with your general interest in machine learning: you’ll be asked about what’s going on in the industry and how you keep up with the latest machine learning trends.
Finally, there are company or industry-specific questions that test your ability to take your general machine learning knowledge and turn it into actionable points to drive the bottom line forward.
We’ve divided this guide to machine learning interview questions into the categories we mentioned above so that you can more easily get to the information you need when it comes to machine learning interview questions.
This can lead to the model underfitting your data, making it hard for it to have high predictive accuracy and for you to generalize your knowledge from the training set to the test set.
The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, the variance and a bit of irreducible error due to noise in the underlying dataset.
For example, in order to do classification (a supervised learning task), you’ll need to first label the data you’ll use to train the model to classify data into your labeled groups.
K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points.
It’s often used as a proxy for the trade-off between the sensitivity of the model (true positives) vs the fall-out or the probability it will trigger a false alarm (false positives).
More reading: Precision and recall (Wikipedia) Recall is also known as the true positive rate: the amount of positives your model claims compared to the actual number of positives there are throughout the data.
Precision is also known as the positive predictive value, and it is a measure of the amount of accurate positives your model claims compared to the number of positives it actually claims.
It can be easier to think of recall and precision in the context of a case where you’ve predicted that there were 10 apples and 5 oranges in a case of 10 apples.
Mathematically, it’s expressed as the true positive rate of a condition sample divided by the sum of the false positive rate of the population and the true positive rate of a condition.
Say you had a 60% chance of actually having the flu after a flu test, but out of people who had the flu, the test will be false 50% of the time, and the overall population only has a 5% chance of having the flu.
(Quora) Despite its practical applications, especially in text mining, Naive Bayes is considered “Naive” because it makes an assumption that is virtually impossible to see in real-life data: the conditional probability is calculated as the pure product of the individual probabilities of components.
clever way to think about this is to think of Type I error as telling a man he is pregnant, while Type II error means you tell a pregnant woman she isn’t carrying a baby.
More reading: Deep learning (Wikipedia) Deep learning is a subset of machine learning that is concerned with neural networks: how to use backpropagation and certain principles from neuroscience to more accurately model large sets of unlabelled or semi-structured data.
More reading: Using k-fold cross-validation for time-series model selection (CrossValidated) Instead of using standard k-folds cross-validation, you have to pay attention to the fact that a time series is not randomly distributed data —
More reading: Pruning (decision trees) Pruning is what happens in decision trees when branches that have weak predictive power are removed in order to reduce the complexity of the model and increase the predictive accuracy of a decision tree model.
For example, if you wanted to detect fraud in a massive dataset with a sample of millions, a more accurate model would most likely predict no fraud at all if only a vast minority of cases were fraud.
More reading: Regression vs Classification (Math StackExchange) Classification produces discrete values and dataset to strict categories, while regression gives you continuous results that allow you to better distinguish differences between individual points.
You would use classification over regression if you wanted your results to reflect the belongingness of data points in your dataset to certain explicit categories (ex: If you wanted to know whether a name was male or female rather than just how correlated they were with male and female names.) Q21- Name an example where ensemble techniques might be useful.
They typically reduce overfitting in models and make the model more robust (unlikely to be influenced by small changes in the training data). You could list some examples of ensemble methods, from bagging to boosting to a “bucket of models” method and demonstrate how they could increase predictive power.
(Quora) This is a simple restatement of a fundamental problem in machine learning: the possibility of overfitting training data and carrying the noise of that data through to the test set, thereby providing inaccurate generalizations.
There are three main methods to avoid overfitting: 1- Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data.
More reading: How to Evaluate Machine Learning Algorithms (Machine Learning Mastery) You would first split the dataset into training and test sets, or perhaps use cross-validation techniques to further segment the dataset into composite sets of training and test sets within the data.
More reading: Kernel method (Wikipedia) The Kernel trick involves kernel functions that can enable in higher-dimension spaces without explicitly calculating the coordinates of points within that dimension: instead, kernel functions compute the inner products between the images of all pairs of data in a feature space.
This allows them the very useful attribute of calculating the coordinates of higher dimensions while being computationally cheaper than the explicit calculation of said coordinates. Many algorithms can be expressed in terms of inner products.
More reading: Writing pseudocode for parallel programming (Stack Overflow) This kind of question demonstrates your ability to think in parallelism and how you could handle concurrency in programming implementations dealing with big data.
For example, if you were interviewing for music-streaming startup Spotify, you could remark that your skills at developing a better recommendation model would increase user retention, which would then increase revenue in the long run.
The startup metrics Slideshare linked above will help you understand exactly what performance indicators are important for startups and tech companies as they think about revenue and growth.
Your interviewer is trying to gauge if you’d be a valuable member of their team and whether you grasp the nuances of why certain things are set the way they are in the company’s data process based on company- or industry-specific conditions.
This overview of deep learning in Nature by the scions of deep learning themselves (from Hinton to Bengio to LeCun) can be a good reference paper and an overview of what’s happening in deep learning —
More reading: Mastering the game of Go with deep neural networks and tree search (Nature) AlphaGo beating Lee Sidol, the best human player at Go, in a best-of-five series was a truly seminal event in the history of machine learning and deep learning.
The Nature paper above describes how this was accomplished with “Monte-Carlo tree search with deep neural networks that have been trained by supervised learning, from human expert games, and by reinforcement learning from games of self-play.” Want more? Brush up your skills with our free machine learning course.
Machine learning is a field of computer science that uses statistical techniques to give computer systems the ability to 'learn' (e.g., progressively improve performance on a specific task) with data, without being explicitly programmed.
These analytical models allow researchers, data scientists, engineers, and analysts to 'produce reliable, repeatable decisions and results' and uncover 'hidden insights' through learning from historical relationships and trends in the data.
Mitchell provided a widely quoted, more formal definition of the algorithms studied in the machine learning field: 'A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.'
Developmental learning, elaborated for robot learning, generates its own sequences (also called curriculum) of learning situations to cumulatively acquire repertoires of novel skills through autonomous self-exploration and social interaction with human teachers and using guidance mechanisms such as active learning, maturation, motor synergies, and imitation.
Work on symbolic/knowledge-based learning did continue within AI, leading to inductive logic programming, but the more statistical line of research was now outside the field of AI proper, in pattern recognition and information retrieval.:708–710;
Machine learning and data mining often employ the same methods and overlap significantly, but while machine learning focuses on prediction, based on known properties learned from the training data, data mining focuses on the discovery of (previously) unknown properties in the data (this is the analysis step of knowledge discovery in databases).
Much of the confusion between these two research communities (which do often have separate conferences and separate journals, ECML PKDD being a major exception) comes from the basic assumptions they work with: in machine learning, performance is usually evaluated with respect to the ability to reproduce known knowledge, while in knowledge discovery and data mining (KDD) the key task is the discovery of previously unknown knowledge.
Evaluated with respect to known knowledge, an uninformed (unsupervised) method will easily be outperformed by other supervised methods, while in a typical KDD task, supervised methods cannot be used due to the unavailability of training data.
Loss functions express the discrepancy between the predictions of the model being trained and the actual problem instances (for example, in classification, one wants to assign a label to instances, and models are trained to correctly predict the pre-assigned labels of a set of examples).
The difference between the two fields arises from the goal of generalization: while optimization algorithms can minimize the loss on a training set, machine learning is concerned with minimizing the loss on unseen samples.
The training examples come from some generally unknown probability distribution (considered representative of the space of occurrences) and the learner has to build a general model about this space that enables it to produce sufficiently accurate predictions in new cases.
An artificial neural network (ANN) learning algorithm, usually called 'neural network' (NN), is a learning algorithm that is vaguely inspired by biological neural networks.
They are usually used to model complex relationships between inputs and outputs, to find patterns in data, or to capture the statistical structure in an unknown joint probability distribution between observed variables.
Falling hardware prices and the development of GPUs for personal use in the last few years have contributed to the development of the concept of deep learning which consists of multiple hidden layers in an artificial neural network.
Given an encoding of the known background knowledge and a set of examples represented as a logical database of facts, an ILP system will derive a hypothesized logic program that entails all positive and no negative examples.
Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.
Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that observations within the same cluster are similar according to some predesignated criterion or criteria, while observations drawn from different clusters are dissimilar.
Different clustering techniques make different assumptions on the structure of the data, often defined by some similarity metric and evaluated for example by internal compactness (similarity between members of the same cluster) and separation between different clusters.
Bayesian network, belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independencies via a directed acyclic graph (DAG).
Representation learning algorithms often attempt to preserve the information in their input but transform it in a way that makes it useful, often as a pre-processing step before performing classification or predictions, allowing reconstruction of the inputs coming from the unknown data generating distribution, while not being necessarily faithful for configurations that are implausible under that distribution.
Deep learning algorithms discover multiple levels of representation, or a hierarchy of features, with higher-level, more abstract features defined in terms of (or generating) lower-level features.
genetic algorithm (GA) is a search heuristic that mimics the process of natural selection, and uses methods such as mutation and crossover to generate new genotype in the hope of finding good solutions to a given problem.
In 2006, the online movie company Netflix held the first 'Netflix Prize' competition to find a program to better predict user preferences and improve the accuracy on its existing Cinematch movie recommendation algorithm by at least 10%.
Shortly after the prize was awarded, Netflix realized that viewers' ratings were not the best indicators of their viewing patterns ('everything is a recommendation') and they changed their recommendation engine accordingly.
Reasons for this are numerous: lack of (suitable) data, lack of access to the data, data bias, privacy problems, badly chosen tasks and algorithms, wrong tools and people, lack of resources, and evaluation problems.
Classification machine learning models can be validated by accuracy estimation techniques like the Holdout method, which splits the data in a training and test set (conventionally 2/3 training set and 1/3 test set designation) and evaluates the performance of the training model on the test set.
In comparison, the N-fold-cross-validation method randomly splits the data in k subsets where the k-1 instances of the data are used to train the model while the kth instance is used to test the predictive ability of the training model.
For example, using job hiring data from a firm with racist hiring policies may lead to a machine learning system duplicating the bias by scoring job applicants against similarity to previous successful applicants.
There is huge potential for machine learning in health care to provide professionals a great tool to diagnose, medicate, and even plan recovery paths for patients, but this will not happen until the personal biases mentioned previously, and these 'greed' biases are addressed.
Data, Learning and Modeling
There are key concepts in machine learning that lay the foundation for understanding the field.
You will also learn the concepts and terms used to describe learning and modeling from data that will provide a valuable intuition for your journey through the field of machine learning.
You can have strings, dates, times, and more complex types, but typically they are reduced to real or categorical values when working with traditional machine learning methods.
Datasets: A collection of instances is a dataset and when working with machine learning methods we typically need a few datasets for different purposes.
Generalization: Generalization is required because the model that is prepared by a machine learning algorithm needs to make predictions or decisions based on specific data instances that were not seen during training.
Under-Learning: When a model has not learned enough structure from the database because the learning process was terminated early, this is called under-learning.
Online learning requires methods that are robust to noisy data but can produce models that are more in tune with the current state of the domain.
Of all the possible models that exist for a problem, a given algorithm and algorithm configuration on the chosen training dataset will provide a finally selected model.
Biases are introduced by the generalizations made in the model including the configuration of the model and the selection of the algorithm to generate the model.
A machine learning method can create a model with a low or a high bias and tactics can be used to reduce the bias of a highly biased model.
A tactic to reduce the variance of a model is to run it multiple times on a dataset with different initial conditions and take the average accuracy as the models performance.
A low bias model will have a high variance and will need to be trained for a long time or many times to get a usable model.
A high bias model will have a low variance and will train quickly, but suffer poor and limited performance.
How to choose algorithms for Azure Machine Learning Studio
The answer to the question "What machine learning algorithm should I use?"
It depends on how the math of the algorithm was translated into instructions for the computer you are using.
Even the most experienced data scientists can't tell which algorithm will perform best before trying them.
The Microsoft Azure Machine Learning Algorithm Cheat Sheet helps you choose the right machine learning algorithm for your predictive analytics solutions from the Azure Machine Learning Studio library of algorithms. This
This cheat sheet has a very specific audience in mind: a beginning data scientist with undergraduate-level machine learning, trying to choose an algorithm to start with in Azure Machine Learning Studio.
That means that it makes some generalizations and oversimplifications, but it points you in a safe direction.
As Azure Machine Learning grows to encompass a more complete set of available methods, we'll add them.
These recommendations are compiled feedback and tips from many data scientists and machine learning experts.
We didn't agree on everything, but I've tried to harmonize our opinions into a rough consensus.
data scientists I talked with said that the only sure way to find
Supervised learning algorithms make predictions based on a set of examples.
For instance, historical stock prices can be used to hazard guesses
company's financial data, the type of industry, the presence of disruptive
it uses that pattern to make predictions for unlabeled testing data—tomorrow's
Supervised learning is a popular and useful type of machine learning.
In unsupervised learning, data points have no labels associated with them.
grouping it into clusters or finding different ways of looking at complex
In reinforcement learning, the algorithm gets to choose an action in response
signal a short time later, indicating how good the decision was. Based
where the set of sensor readings at one point in time is a data
The number of minutes or hours necessary to train a model varies a great deal
time is limited it can drive the choice of algorithm, especially when
regression algorithms assume that data trends follow a straight line.
These assumptions aren't bad for some problems, but on others they bring
Non-linear class boundary - relying on a linear classification algorithm
Data with a nonlinear trend - using a linear regression method would generate
much larger errors than necessary Despite their dangers, linear algorithms are very popular as a first line
Parameters are the knobs a data scientist gets to turn when setting up an
as error tolerance or number of iterations, or options between variants
to make sure you've spanned the parameter space, the time required to
train a model increases exponentially with the number of parameters.
For certain types of data, the number of features can be very large compared
algorithms, making training time unfeasibly long.
Some learning algorithms make particular assumptions about the structure of
- shows excellent accuracy, fast training times, and the use of linearity ○
- shows good accuracy and moderate training times As mentioned previously, linear regression fits
curve instead of a straight line makes it a natural fit for dividing
logistic regression to two-class data with just one feature - the class
boundary is the point at which the logistic curve is just as close to both classes Decision forests (regression, two-class, and multiclass), decision
all based on decision trees, a foundational machine learning concept.
decision tree subdivides a feature space into regions of roughly uniform
values Because a feature space can be subdivided into arbitrarily small regions,
it's easy to imagine dividing it finely enough to have one data point
a large set of trees are constructed with special mathematical care
memory at the expense of a slightly longer training time.
Boosted decision trees avoid overfitting by limiting how many times they can
a variation of decision trees for the special case where you want to know
that input features are passed forward (never backward) through a sequence
a long time to train, particularly for large data sets with lots of features.
typical support vector machine class boundary maximizes the margin separating
Any new data points that fall far outside that boundary
PCA-based anomaly detection - the vast majority of the data falls into
data set is grouped into five clusters using K-means There is also an ensemble one-v-all multiclass classifier, which
Machine Learning Algorithms: Which One to Choose for Your Problem
First of all, you should distinguish 4 types of Machine Learning tasks: Supervised learning is the task of inferring a function from labeled training data.
By fitting to the labeled training set, we want to find the most optimal model parameters to predict unknown labels on other objects (test set).
The method allows us to significantly improve accuracy, because we can use unlabeled data in the train set with a small amount of labeled data.
RL is an area of machine learning concerned with how software agents ought to take actions in some environment to maximize some notion of cumulative reward.
Now that we have some intuition about types of machine learning tasks, let’s explore the most popular algorithms with their applications in real life.
Your goal is to find the most optimal weights w1,…wn and bias for these features according to some loss function, for example, MSE or MAE for a regression problem.
In the case of MSE there is a mathematical equation from the least squares method: In practice, it’s easier to optimize it with gradient descent, that is much more computationally efficient.
Despite the simplicity of this algorithm, it works pretty well when you have thousands of features, for example, bag of words or n-gramms in text analysis.
More complex algorithms suffer from overfitting many features and not huge datasets, while linear regression provides decent quality.
Since this algorithm calculates the probability of belonging to each class, you should take into account how much the probability differs from 0 or 1 and average it over all objects as we did with linear regression.
If y equals 0, then the first addend under sum equals 0 and the second is the less the closer our predicted y_pred to 0 according to the properties of the logarithm.
It takes linear combination of features and applies non-linear function (sigmoid) to it, so it’s a very very small instance of neural network!
In regression trees we minimize the sum of a squared error between the predictive variable of the target values of the points that fall in that region and the one we assign to it.
Secondly, the result depends on the points randomly chosen at the beginning and the algorithm doesn’t guarantee that we’ll achieve the global minimum of the functional.
You have no chance to remember all the information, but you want to maximize information that you can remember in the time available, for example, learning first the theorems that occur in many exam tickets and so on.
hope that I could explain to you common perceptions of the most used machine learning algorithms and give intuition on how to choose one for your specific problem.
What to do with “small” data?
Many technology companies now have teams of smart data-scientists, versed in big-data infrastructure tools and machine learning algorithms, but every now and then, a data set with very few data points turns up and none of these algorithms seem to be working properly anymore.
Most data science, relevance, and machine learning activities in technology companies have been focused around “Big Data” and scenarios with huge data sets.
And most data scientists and machine learning practitioners have gained experience is such situations, have grown accustomed to the appropriate algorithms, and gained good intuitions about the usual trade-offs (bias-variance, flexibility-stability, hand-crafted features vs.
the set of all linear models with 3 non-zero weights, the set of decision trees with depth <= 4, the set of histograms with 10 equally-spaced bins).
If you try too many different techniques, and use a hold-out set to compare between them, be aware of the statistical power of the results you are getting, and be aware that the performance you are getting on this set is not a good estimator for out of sample performance.
7- Do use Regularization Regularization is an almost-magical solution that constraints model fitting and reduces the effective degrees of freedom without reducing the actual number of parameters in the model.
L1 regularization produces models with fewer non-zero parameters, effectively performing implicit feature selection, which could be desirable for explainability of performance in production, while L2 regularization produces models with more conservative (closer to zero) parameters and is effectively similar to having strong zero-centered priors for the parameters (in the Bayesian world).
8- Do use Model Averaging Model averaging has similar effects to regularization is that it reduces variance and enhances generalization, but it is a generic technique that can be used with any type of models or even with heterogeneous sets of models.
9- Try Bayesian Modeling and Model Averaging Again, not a favorite technique of mine, but Bayesian inference may be well suited for dealing with smaller data sets, especially if you can use domain expertise to construct sensible priors.
For regression analysis this usually takes the form of predicting a range of values that is calibrated to cover the true value 95% of the time or in the case of classification it could be just a matter of producing class probabilities.
- On Wednesday, January 16, 2019
The Best Way to Prepare a Dataset Easily
In this video, I go over the 3 steps you need to prepare a dataset to be fed into a machine learning model. (selecting the data, processing it, and transforming it).
Natalie Hockham: Machine learning with imbalanced data sets
Classification algorithms tend to perform poorly when data is skewed towards one class, as is often the case when tackling real-world problems such as fraud ...
Training a machine learning model with scikit-learn
Now that we're familiar with the famous iris dataset, let's actually use a classification model in scikit-learn to predict the species of an iris! We'll learn how the ...
Comparing machine learning models in scikit-learn
We've learned how to train different machine learning models and make predictions, but how do we actually choose which model is "best"? We'll cover the ...
Prepare your dataset for machine learning (Coding TensorFlow)
Deploying Python Machine Learning Models in Production | SciPy 2015 | Krishna Sridhar
Hierarchical Clustering - Fun and Easy Machine Learning
Hierarchical Clustering - Fun and Easy Machine Learning with Examples
Machine Learning on Molecular Data || Michael Craig
At GTN we are combining ideas from quantum physics and chemistry with machine learning to aid the process of discovering new medicines. In this talk I will ...
Machine Learning 010 Importing the Dataset
The Best Way to Visualize a Dataset Easily
In this video, we'll visualize a dataset of body metrics collected by giving people a fitness tracking device. We'll go over the steps necessary to preprocess the ...