Hi All, I am bored of hearing the typical Data Science Questions Bias/Variance Trade off, handing Imbalanced Dataset/Outliers blah blah..... Can you all suggest difficult ML questions apart from the standard ones. love to hear some new and hard questions. #datascience #machinelearning #machinelearningengineer #amazon #lyft #microsoft #interview #apple #faang
—Implement KNN algorithm without using any inbuilt library. — Google —Writing code for gradient descend and custom error/loss function — Siemens —anomaly detection for key metrics in real time on near-real-time data stream, push alerts to slack/outlook — Visa
Could you plz explain the third one...
https://zindi.africa/blog/introduction-to-anomaly-detection-using-machine-learning-with-a-case-study https://towardsdatascience.com/how-to-use-machine-learning-for-anomaly-detection-and-condition-monitoring-6742f82900d7 Push alerts to slack/outlook — reporting on those anomalies so that business can quickly take steps to revert changes, mitigate risk and avert big loss... dm for more details
I was once asked at FANG Before production of model Ur training,Val and test accuracy is more than 90% accuracy Once in production, the model starts to behave weirdly. How will u identify what’s happening and how will you correct it..
This is a great question. Online models almost never perform as well as offline, this is why experimentation is important. Some of the causes are: (1) real time signals may be empty or malformed or contain unexpected values more than training/test/val data - you should fix this by thinking about this during the feature engineering step (2) underlying behavior of the features could have changed - you should always setup monitoring for the distributions of all the features and check changes in distribution (3) underlying features that are produced by DE could change
Imbalanced data question is actually a pretty good one. People will mostly answer oversampling, undersampling etc. but there are few other clever approaches. You can take the majority class and divide into some clusters and replace the examples with the cluster centroids. You can also learn k separate models, each time take n/k samples of the majority class and all the samples of the minority class and then have an ensemble of the k models at the end. Not many people answer this way.
@thuol Did you mean SMOTE or ENN schemes? I like these answers from candidates too.
No, these are simpler approaches than SMOTE or ENN. But, yeah that's why this question is a good one. There are several interesting approaches, but most will give the typical oversampling, undersampling answer.
By the way, does anyone know a good answer to how to classify with noisy labels or many incorrect labels? I don't have a very good answer for this question .
Short answer: Error analysis and redistribute the efforts. Andrew Ng's ML handbook has a great practical explanation for it whether to focus on correcting labels or not. Little bit of noise is okay. Read those chapters.
You mean machine learning yearning ?
Does boosting work for linear models?
That's a good one!
👌👌Don’t have an intuitive answer for it.. Can anyone plz share their views on this
What's the difference between xgboost and GBDT?
XGBoost and gbdt are same. Xgboost supports parallel computing and is quite useful for complex models. But the underlying concept of xgboost is gbdt
We have capabilities for performing regularization like l1 and l2 but it’s not supported in Xgboost
What is a confidence interval?
The interval period during which there is high confidence ;)
Linear regression assumptions? And what happens if each assumption is broken? Derive loss function of multiclass logistic regression (cross entropy) using MLE (maximum likelihood estimation).
1. Linear relationship in data 2. Errors are normally distributed 3. Errors are statistically independent and don’t influence each other.
4. no perfect multicollinearity in the input data (including the intercept) 5. homoscedasticity of the errors (errors have the same variance) and the expected value of errors is zero.
Xgboost vs gbm
Working Parents
Yesterday
815
What do you think is wrong with a kid who got rejected by 9 colleges?
Tech Industry
Yesterday
13503
RIP Google Core Employees replaced with Mexico and India Workers
India
Yesterday
880
Modi is a legend, will be remembered for centuries to come
Tech Industry
Yesterday
2042
The end of Backdoor Roth?!
Tech Industry
Yesterday
2589
Quitting this Slave life
I used to ask this question to my candidates. It is simple but gives insight how much they know about basic algorithms. What would happen if you try to fit logistic regression to a perfectly linearly separable binary classification dataset. What about SVM? Also I was once asked to prove mathematically that PCA is a special case of autoencoder. But I think asking math is not the right way to interview. I couldn't answer of course and interviewer tried to show off writing all the optimization equations.
For question one are you saying the data is linearly separable or separable in general.
Linearly separable. Sorry. Not a very difficult question but checks understanding