Data Science CareerJun 24, 2020

Toughest Data Science /Machine Learning Interview Questions

Hi All, I am bored of hearing the typical Data Science Questions Bias/Variance Trade off, handing Imbalanced Dataset/Outliers blah blah..... Can you all suggest difficult ML questions apart from the standard ones. love to hear some new and hard questions. #datascience #machinelearning #machinelearningengineer #amazon #lyft #microsoft #interview #apple #faang

@Google @Meta @Uber

Sort by :

Amazon granitle Jun 24, 2020

I used to ask this question to my candidates. It is simple but gives insight how much they know about basic algorithms. What would happen if you try to fit logistic regression to a perfectly linearly separable binary classification dataset. What about SVM? Also I was once asked to prove mathematically that PCA is a special case of autoencoder. But I think asking math is not the right way to interview. I couldn't answer of course and interviewer tried to show off writing all the optimization equations.

Amazon Kjdy17 Jun 24, 2020

For question one are you saying the data is linearly separable or separable in general.

Amazon granitle Jun 24, 2020

Linearly separable. Sorry. Not a very difficult question but checks understanding

PLAYSTUDIOS tomcat14 Jun 24, 2020

—Implement KNN algorithm without using any inbuilt library. — Google —Writing code for gradient descend and custom error/loss function — Siemens —anomaly detection for key metrics in real time on near-real-time data stream, push alerts to slack/outlook — Visa

Qualcomm cyxI25 OP Jun 24, 2020

Could you plz explain the third one...

PLAYSTUDIOS tomcat14 Jun 24, 2020

https://zindi.africa/blog/introduction-to-anomaly-detection-using-machine-learning-with-a-case-study https://towardsdatascience.com/how-to-use-machine-learning-for-anomaly-detection-and-condition-monitoring-6742f82900d7 Push alerts to slack/outlook — reporting on those anomalies so that business can quickly take steps to revert changes, mitigate risk and avert big loss... dm for more details

Qualcomm cyxI25 OP Jun 24, 2020

I was once asked at FANG Before production of model Ur training,Val and test accuracy is more than 90% accuracy Once in production, the model starts to behave weirdly. How will u identify what’s happening and how will you correct it..

Airbnb careerista Jun 25, 2020

This is a great question. Online models almost never perform as well as offline, this is why experimentation is important. Some of the causes are: (1) real time signals may be empty or malformed or contain unexpected values more than training/test/val data - you should fix this by thinking about this during the feature engineering step (2) underlying behavior of the features could have changed - you should always setup monitoring for the distributions of all the features and check changes in distribution (3) underlying features that are produced by DE could change

Cerner thuol Jun 24, 2020

Imbalanced data question is actually a pretty good one. People will mostly answer oversampling, undersampling etc. but there are few other clever approaches. You can take the majority class and divide into some clusters and replace the examples with the cluster centroids. You can also learn k separate models, each time take n/k samples of the majority class and all the samples of the minority class and then have an ensemble of the k models at the end. Not many people answer this way.

Amazon granitle Jun 24, 2020

@thuol Did you mean SMOTE or ENN schemes? I like these answers from candidates too.

Cerner thuol Jun 24, 2020

No, these are simpler approaches than SMOTE or ENN. But, yeah that's why this question is a good one. There are several interesting approaches, but most will give the typical oversampling, undersampling answer.

Cerner thuol Jun 24, 2020

By the way, does anyone know a good answer to how to classify with noisy labels or many incorrect labels? I don't have a very good answer for this question .

Amazon granitle Jun 24, 2020

Short answer: Error analysis and redistribute the efforts. Andrew Ng's ML handbook has a great practical explanation for it whether to focus on correcting labels or not. Little bit of noise is okay. Read those chapters.

Cerner thuol Jun 24, 2020

You mean machine learning yearning ?

Uber haihai Jun 25, 2020

Does boosting work for linear models?

Amazon granitle Jun 25, 2020

That's a good one!

Qualcomm cyxI25 OP Jun 25, 2020

👌👌Don’t have an intuitive answer for it.. Can anyone plz share their views on this

Uber haihai Jun 25, 2020

What's the difference between xgboost and GBDT?

Bank of The West zingxing Aug 13, 2020

XGBoost and gbdt are same. Xgboost supports parallel computing and is quite useful for complex models. But the underlying concept of xgboost is gbdt

Ericsson RlBV17 Oct 22, 2021

We have capabilities for performing regularization like l1 and l2 but it’s not supported in Xgboost

Uber haihai Jun 25, 2020

What is a confidence interval?

New

G_mane Jun 26, 2020

The interval period during which there is high confidence ;)

Deliveroo OFXs71 Jun 25, 2020

Linear regression assumptions? And what happens if each assumption is broken? Derive loss function of multiclass logistic regression (cross entropy) using MLE (maximum likelihood estimation).

Bank of The West zingxing Aug 13, 2020

1. Linear relationship in data 2. Errors are normally distributed 3. Errors are statistically independent and don’t influence each other.

Swiss Re WOay05 Apr 9, 2022

4. no perfect multicollinearity in the input data (including the intercept) 5. homoscedasticity of the errors (errors have the same variance) and the expected value of errors is zero.

Indeed rainwater Jun 25, 2020

Xgboost vs gbm

Sort by :

Amazon granitle Jun 24, 2020

Amazon Kjdy17 Jun 24, 2020

For question one are you saying the data is linearly separable or separable in general.

Amazon granitle Jun 24, 2020

Linearly separable. Sorry. Not a very difficult question but checks understanding

PLAYSTUDIOS tomcat14 Jun 24, 2020

Qualcomm cyxI25 OP Jun 24, 2020

Could you plz explain the third one...

PLAYSTUDIOS tomcat14 Jun 24, 2020

Qualcomm cyxI25 OP Jun 24, 2020

Airbnb careerista Jun 25, 2020

Cerner thuol Jun 24, 2020

Amazon granitle Jun 24, 2020

@thuol Did you mean SMOTE or ENN schemes? I like these answers from candidates too.

Cerner thuol Jun 24, 2020

By the way, does anyone know a good answer to how to classify with noisy labels or many incorrect labels? I don't have a very good answer for this question .

Amazon granitle Jun 24, 2020

Cerner thuol Jun 24, 2020

You mean machine learning yearning ?

Uber haihai Jun 25, 2020

Does boosting work for linear models?

Amazon granitle Jun 25, 2020

That's a good one!

Qualcomm cyxI25 OP Jun 25, 2020

👌👌Don’t have an intuitive answer for it.. Can anyone plz share their views on this

Uber haihai Jun 25, 2020

What's the difference between xgboost and GBDT?

Bank of The West zingxing Aug 13, 2020

XGBoost and gbdt are same. Xgboost supports parallel computing and is quite useful for complex models. But the underlying concept of xgboost is gbdt

Ericsson RlBV17 Oct 22, 2021

We have capabilities for performing regularization like l1 and l2 but it’s not supported in Xgboost

Uber haihai Jun 25, 2020

What is a confidence interval?

New

G_mane Jun 26, 2020

The interval period during which there is high confidence ;)

Deliveroo OFXs71 Jun 25, 2020

Linear regression assumptions? And what happens if each assumption is broken? Derive loss function of multiclass logistic regression (cross entropy) using MLE (maximum likelihood estimation).

Bank of The West zingxing Aug 13, 2020

1. Linear relationship in data 2. Errors are normally distributed 3. Errors are statistically independent and don’t influence each other.

Swiss Re WOay05 Apr 9, 2022

4. no perfect multicollinearity in the input data (including the intercept) 5. homoscedasticity of the errors (errors have the same variance) and the expected value of errors is zero.

Indeed rainwater Jun 25, 2020

Xgboost vs gbm

Industries

Job Groups

General Topics

Sponsored

Most Read

Toughest Data Science /Machine Learning Interview Questions

Most Read