AI News, Frequently Asked Data Science Interview Questions

Frequently Asked Data Science Interview Questions

Here’s a list of frequently asked Data Science interview questions, covering a wide range of topics on which you might be asked.

The answers to these questions depend on the candidate’s hands-on experience and the datasets he/she has worked on. Frequently

do you optimize a web crawler to run much faster, extract better information and summarize data to produce cleaner databases? What

terms of access speed (assuming both fit within RAM) is it better to have 100 small hash tables or one big hash table in memory?

you used any of the following: Time series models, Cross-correlations with time lags, Correlograms, Spectral analysis, Signal processing and filtering techniques?

How would you perform clustering in one million unique keywords, assuming you have 10 million data points and each one consists of two keywords and a metric measuring how similar these two keywords are?

Here’s a list of frequently asked Data Science interview questions, covering a wide range of topics on which you might be asked.

The answers to these questions depend on the candidate’s hands-on experience and the datasets he/she has worked on.

40 Interview Questions asked at Startups in Machine Learning / Data Science

Machine learning and data science are being looked as the drivers of the next industrial revolution happening in the world today.

This also means that there are numerous exciting startups looking for data scientists.  What could be a better start for your aspiring career!

You might also find some real difficult techincal questions on your way. The set of questions asked depend on what does the startup do.

If you can answer and understand these question, rest assured, you will give a tough fight in your job interview.

(You are free to make practical assumptions.) Answer: Processing a high dimensional data on a limited memory machine is a strenuous task, your interviewer would be fully aware of that.

Not to forget, that’s the motive of doing PCA where, we aim to select fewer components (than features) which can explain the maximum variance in the data set.

By doing rotation, the relative location of the components doesn’t change, it only changes the actual coordinates of the points.

If we don’t rotate the components, the effect of PCA will diminish and we’ll have to select more number of components to explain variance in the data set.

Answer: This question has enough hints for you to start thinking! Since, the data is spread across median, let’s assume it’s a normal distribution.

We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected.

In an imbalanced data set, accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly, but our class of interest is minority class (4%) which is the people who actually got diagnosed with cancer.

Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine class wise performance of the classifier.

On the other hand, a decision tree algorithm is known to work best to detect non –

The reason why decision tree failed to provide robust predictions because it couldn’t map the linear relationship as good as a regression model did.

Therefore, we learned that, a linear regression model can provide robust prediction given the data set satisfies its linearity assumptions.

You are assigned a new project which involves helping a food delivery company save more money.

A machine learning problem consist of three things: Always look for these three factors to decide if machine learning is a tool to solve a particular problem.

Discarding correlated variables have a substantial effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated.

For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on this data set, the first principal component would exhibit twice the variance than it would exhibit with uncorrelated variables.

Answer: As we know, ensemble learners are based on the idea of combining weak learners to create strong learners.

For example: If model 1 has classified User1122 as 1, there are high chances model 2 and model 3 would have done the same, even if its actual value is 0.

kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other.

kNN algorithm tries to classify an unlabeled observation based on its k (can be any number ) surrounding neighbors.

In absence of intercept term (ymean), the model can make no such evaluation, with large denominator, ∑(y - y´)²/∑(y)² equation’s value becomes smaller than actual, resulting in higher R².

In addition, we can use calculate VIF (variance inflation factor) to check the presence of multicollinearity. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity.

Answer: You can quote ISLR’s authors Hastie, Tibshirani who asserted that, in presence of few variables with medium / large sized effect, use lasso regression.

Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model.

No, we can’t conclude that decrease in number of pirates caused the climate change because there might be other factors (lurking or confounding variables) influencing this phenomenon.

Therefore, there might be a correlation between global average temperature and number of pirates, but based on this information we can’t say that pirated died because of rise in global average temperature.

For example: if we calculate the covariances of salary ($) and age (years), we’ll get different covariances which can’t be compared because of having unequal scales.

In boosting, after the first round of predictions, the algorithm weighs misclassified predictions higher, such that they can be corrected in the succeeding round.

This sequential process of giving higher weights to misclassified predictions continue until a stopping criterion is reached.

In simple words, the tree algorithm find the best possible feature which can divide the data set into purest possible children nodes.

Training error 0.00 means the classifier has mimiced the training data patterns to an extent, that they are not available in the unseen data.

Hence, when this classifier was run on unseen sample, it couldn’t find those patterns and returned prediction with higher error.

Answer: In such high dimensional data sets, we can’t use classical regression techniques, since their assumptions tend to fail.

n, we can no longer calculate a unique least square coefficient estimate, the variances become infinite, so OLS cannot be used at all.

To combat this situation, we can use penalized regression methods like lasso, LARS, ridge which can shrink the coefficients to reduce variance.

(Hint: Think SVM) Answer: In case of linearly separable data, convex hull represents the outer boundaries of the two group of data points.

Using one hot encoding, the dimensionality (a.k.a features) in a data set get increased because it creates a new variable for each level present in categorical variables.

In label encoding, the levels of a categorical variables gets encoded as 0 and 1, so no new variable is created.

In time series problem, k fold can be troublesome because there might be some pattern in year 4 or 5 which is not in year 3.

They exploit behavior of other users and items in terms of transaction history, ratings, selection and purchase information.

In the context of confusion matrix, we can say Type I error occurs when we classify a value as positive (1) when it is actually negative (0).

On the contrary, stratified sampling helps to maintain the distribution of target variable in the resultant distributed samples also.

We will consider adjusted R² as opposed to R² to evaluate model fit because R² increases irrespective of improvement in prediction accuracy as we add more variables.

For example: a gene mutation data set might result in lower adjusted R² and still provide fairly good predictions, as compared to a stock market data where lower adjusted R² implies that model is not good.

Example: Think of a chess board, the movement made by a bishop or a rook is calculated by manhattan distance because of their respective vertical &

If the business requirement is to build a model which can be deployed, then we’ll use regression or a decision tree model (easy to interpret and explain) instead of black box algorithms like SVM, GBM etc.

A high bias error means we have a under-performing model which keeps on missing important trends. Variance on the other side quantifies how are the prediction made on same observation different from each other.

A high variance model will over-fit on your training population and perform badly on any observation beyond training.

Answer: OLS and Maximum likelihood are the methods used by the respective regression methods to approximate the unknown parameter (coefficient) value.

In simple words, Ordinary least square(OLS) is a method used in linear regression which approximates the parameters resulting in minimum distance between actual and predicted values. Maximum Likelihood helps in choosing the the values of parameters which maximizes the likelihood that the parameters are most likely to produce observed data.

These questions are meant to give you a wide exposure on the types of questions asked at startups in machine learning.

40 Questions to test a Data Scientist on Clustering Techniques (Skill test Solution)

The idea of creating machines which learn by themselves has been driving humans for decades now.

1, 2, 3 and 4 Solution: (E) Generally, movie recommendation systems cluster the users in a finite number of similar groups based on their previous activities and profile.

In some scenarios, this can also be approached as a classification problem for assigning the most appropriate movie class to the user of a specific group of users.

Also, a movie recommendation system can be viewed as a reinforcement learning problem where it learns by its previous recommendations and improves the future recommendations.

1, 2, 3 and 4 Solution: (E) Sentiment analysis at the fundamental level is the task of classifying the sentiments represented in an image, text or speech into a set of defined sentiment classes like happy, sad, excited, positive, negative, etc.

It can also be viewed as a regression problem for assigning a sentiment score of say 1 to 10 for a corresponding image, text or speech.

Another way of looking at sentiment analysis is to consider it using a reinforcement learning perspective where the algorithm constantly learns from the accuracy of past sentiment analysis performed to improve the future performance.

Which of the following is the most appropriate strategy for data cleaning before performing clustering analysis, given less than desirable number of data points: Options: A.

None of these Solution: (A) When the K-Means algorithm has reached the local or global minima, it will not alter the assignment of data points to clusters for two successive iterations.

K-medoids clustering algorithm Solution: (A) Out of all the options, K-Means clustering algorithm is most sensitive to outliers as it uses the mean of cluster data points to find the cluster center.

All of the above Solution: (F) Creating an input feature for cluster ids as ordinal variable or creating an input feature for cluster centroids as a continuous variable might not convey any relevant information to the regression model for multidimensional data.

For example, to cluster people in two groups based on their hair length, storing clustering ID as ordinal variable and cluster centroids as continuous variables will convey meaningful information.

4 Solution: (B) Since the number of vertical lines intersecting the red horizontal line at y=2 in the dendrogram are 2, therefore, two clusters will be formed.

1, 2, 3 and 4 Solution: (D) K-Means clustering algorithm fails to give good results when the data contains outliers, the density spread of data points across the data space is different and the data points follow non-convex shapes.

None of them Solution: (A) Clustering analysis is not negatively affected by heteroscedasticity but the results are negatively impacted by multicollinearity of features/ variables used in clustering as the correlated feature/ variable will carry extra weight on the distance calculation than desired.

As another example, the distance between clusters {3, 6} and {2, 5} is given by dist({3, 6}, {2, 5}) = min(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5)) = min(0.1483, 0.2540, 0.2843, 0.3921) = 0.1483.

  Solution: (B) For the single link or MAX version of hierarchical clustering, the proximity of two clusters is defined to be the maximum of the distance between any two points in the different clusters.

This is because the dist({3, 6}, {4}) = max(dist(3, 4), dist(6, 4)) = max(0.1513, 0.2216) = 0.2216, which is smaller than dist({3, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5)) = max(0.1483, 0.2540, 0.2843, 0.3921) = 0.3921 and dist({3, 6}, {1}) = max(dist(3, 1), dist(6, 1)) = max(0.2218, 0.2347) = 0.2347.

D.  Solution: (C) For the group average version of hierarchical clustering, the proximity of two clusters is defined to be the average of the pairwise proximities between all pairs of points in the different clusters.

dist({3, 6, 4}, {1}) = (0.2218 + 0.3688 + 0.2347)/(3 ∗ 1) = 0.2751.

dist({3, 6, 4}, {2, 5}) = (0.1483 + 0.2843 + 0.2540 + 0.3921 + 0.2042 + 0.2932)/(6∗1) = 0.2637.

Because dist({3, 6, 4}, {2, 5}) is smaller than dist({3, 6, 4}, {1}) and dist({2, 5}, {1}), these two clusters are merged at the fourth stage

All of the above Solution: (C) All of the mentioned techniques are valid for treating missing values before clustering analysis but only imputation with EM algorithm is iterative in its functioning.

Note: Soft assignment can be consider as the probability of being assigned to each cluster: say K = 3 and for some point xn, p1 = 0.7, p2 = 0.2, p3 = 0.1) Which of the following algorithm(s) allows soft assignments?

After first iteration clusters, C1, C2, C3 has following observations: C1: {(2,2), (4,4), (6,6)} C2: {(0,4), (4,0)} C3: {(5,5), (9,9)} What will be the cluster centroids if you want to proceed for second iteration?

None of these Solution: (A) Finding centroid for data points in cluster C1 = ((2+4+6)/3, (2+4+6)/3) = (4, 4) Finding centroid for data points in cluster C2 = ((0+4)/2, (4+0)/2) = (2, 2) Finding centroid for data points in cluster C3 = ((5+9)/2, (5+9)/2) = (7, 7) Hence, C1: (4,4),  C2: (2,2), C3: (7,7)

After first iteration clusters, C1, C2, C3 has following observations: C1: {(2,2), (4,4), (6,6)} C2: {(0,4), (4,0)} C3: {(5,5), (9,9)} What will be the Manhattan distance for observation (9, 9) from cluster centroid C1.

The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data.

In this plot, the optimal clustering number of grid cells in the study area should be 2, at which the value of the average silhouette coefficient is highest.

In addition, the value of the average silhouette coefficient at k = 6 is also very high, which is just lower than k = 2.

The Random Partition method first randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean to be the centroid of the cluster’s randomly assigned points.

None of the above Solution: (B) All of the above statements are true except the 5th as instead K-Means is a special case of EM algorithm in which only the centroids of the cluster distributions are calculated at each iteration.

None of the above Solution: (A) The lowest and highest possible values of F score are 0 and 1 with 1 representing that every data point is assigned to the correct cluster and 0 representing that the precession and/ or recall of the clustering analysis are both 0.

6 Solution: (D) Here, True Positive, TP = 1200 True Negative, TN = 600 + 1600 = 2200 False Positive, FP = 1000 + 200 = 1200 False Negative, FN = 400 + 400 = 800 Therefore, Precision = TP / (TP + FP) = 0.5 Recall = TP / (TP + FN) = 0.6 Hence, F1  = 2 * (Precision * Recall)/ (Precision + recall) = 0.54 ~ 0.5

Samantha Zeitlin | Clustering Data Science Interviews Seven Related but Distinct Categories

PyData SF 2016 Matching candidates with openings: defining features across several sets of Data Scientist selection criteria, using both qualitative and ...

K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm | Data Science |Edureka

Data Science Training - ) This Edureka k-means clustering algorithm tutorial video (Data Science Blog Series: ..

Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoop Tutorial | Edureka

Hadoop Training: ) This Hadoop Tutorial on Hadoop Interview Questions and Answers ( Hadoop Interview Blog series: ..

Machine Learning in R - Classification, Regression and Clustering Problems

Learn the basics of Machine Learning with R. Start our Machine Learning Course for free: ...

Qualitative analysis of interview data: A step-by-step guide

The content applies to qualitative data analysis in general. Do not forget to share this Youtube link with your friends. The steps are also described in writing ...

GOTO 2016 • Cluster Management at Google with Borg • John Wilkes

This presentation was recorded at GOTO Berlin 2016 John Wilkes - Principal Software Engineer at Google ABSTRACT Cluster management ..

Analytics Case Study: Predicting Probability of Churn in a Telecom Firm| Data Science

In this video you will learn how to predict Churn Probability by building a Logistic Regression Model. This is a data science case study for beginners as to how to ...

Machine Learning Algorithms | Machine Learning Tutorial | Data Science Training | Edureka

Data Science Training - ) This Machine Learning Algorithms Tutorial shall teach you what machine learning is, and the ..

Cluster Analysis in SAS using PROC CLUSTER | Data Science

In this video you will learn how to perform cluster analysis using Proc Cluster in SAS. Cluster Analysis is a unsupervised Learning Model used for many ...