Hi team- looking to create a mini take home for a data engineering role which I’m hiring for. This is my thought for the take home- 1. Have the candidate perform joins in three or so flat files to answer basic questions. Add some data duplication problems in place. Add some wrong value types in the columns. (If you have decent exposure to spark/sql, it shouldn’t take more than 20-30 minutes). 2. During the interview, ask them about how they work with messy data, how would they check data quality, etc., Don’t want it to be a burden for the person, but at the same time, want to make sure that they’re qualified. What do you all think? Fyi- i work at a small startup that is trying to expand their data engineering team. (I’m the only one who does that here today)
Test their pandas skills
Specifically looking for their big data skills. Pandas may be a good proxy to work with in this case.
I use to give a CSV file and ask them to extract certain information live
Please don't do it
Why not? Much better than LC imo
Yeah don't do take homes. Just cuz one system is shit, doesn't mean you should create a worse one. There are very very few companies I would suffer a takehome to be considered and I doubt this is one of them
Big data is largely conceptual and domain specific. If you're using spark/SQL most of the concepts are same as single machine concepts. Testing someone's ability to do a join really has no direct big data implications. It doesn't matter if their execution layer is in SQLite or Hive or Sparksql in that case, unless you are in an infrastructure team. Tuning Spark jobs is a whole separate beast, but not likely something that's easy to replicate in an interview setting. Either you need both 1)someone with enough experience to hit the ground running 2) need a very specific skillset, in which case just ask them to do a slice of the actual work Or you just need solid fundamentals in which case LC is a fine assessment, supplemented with some domain specific high level questions. People love to shit on leetcode and think their particular team is too good to ask leetcode questions, but truth be told, it's sufficient for a lot of roles (most jobs aren't that specific/complex). If the alternative is literal homework, then is that really an improvement?
I’m going with number 2. The typical workflow involves- 1. Being able to perform ETL efficiently 2. Being able to fine-tune spark jobs so that we are cost efficient. I do not mind limitations in skill 2 since that can be learned with experience, but really don’t want someone fresh out of analyst role who has never worked with databases before but feels confident to say that they’re an expert because they took a data science foundational course in pandas on coursera. (Nothing wrong with it in other cases, but I’m specifically hiring for someone who can alleviate some of my workload).
This sounds very similar to what we hire/hired for on my team. The question I'd then ask is, is what exactly would be missing (job wise) for a strong analyst vs a data engineer? Being able to produce experimental datasets/do exploratory work (analyst or data science) can sometimes be a stronger fit than a true data engineer. The angle I come at is: big data *reasoning* skills vs pipelining skills. The former interview loop I usually just ask a LC array problem (the go-to lately has been count distinct values in each window size k), which can then be extended to a "what if this doesn't fit in memory" => becomes a big data intuition problem (can get into stream processing, etc). A subset of array problems can be solved efficiently using divide and conquer, which is essentially mapreduce, which is/was a foundational component of a lot of big data tech. For pipelining skills, I have something similar to what you outlined. A hypothetical pipelining scenario. Present some abstract problem/data => talk about data cleaning/data quality => talk about pipeline, maybe data warehousing component. Can talk a bit about technologies to use, different processing models, etc. And then I usually still ask an easier coding problem (LC) just to see if they can actually write code. You can ask a lot of what you outlined as high level/discussion questions, rather than doing a take-home (I'd be wary of cheating anyways). Imo if a candidate can outline a good reasonable approach + also can write general code (screen via LC), I have reasonable confidence that they could bridge the 2 and figure out the pipeline. Edit for clarity: the above outlines 2 separate interviews I do. I usually pick one based on what the job requirements are: data engineering generalist for #1, pipelining specific for #2. Really, my problem is, even with multiple rounds, you can only collect so much data, so you really have to carefully define what you need the most signal on. LC often provides best "bang for buck" since there's little boilerplate whereas story problem will eat up a large chunk of time. Takehome is essentially just extending this interview time into the candidate's own personal time, in order to collect a bit more signal, but imo isn't fair to candidate and it's easy for candidate to just Google or ask a friend anyways.
Gonna go against the grain here — if it will take the candidate less than an hour and you explicitly aren’t doing LC in the interview, I would have no problem w doing the take home. The take homes I hate are the ones that are estimated to take 5+ hours to do
I typically enjoy take homes as I tend to forget everything under interview stress. However I'll say that companies tend to grossly underestimate the time these assignments take.
Very good points. The way I plan to calibrate this is by working with a couple of team members who aren’t familiar with the datasets to run through it and see how long it takes them.
Test their ability to clean data and organize it into a database so you can pull the data with an API. Do it on a small dataset. Maybe cut out the api part. Or just clean data and pandas skills
India
3h
356
Why Worshipping Lord Ram Important in Hinduism?
Tech Industry
Yesterday
2171
How bad is it in meta?
Personal Finance
Yesterday
1169
IRS Warns Thousands of Taxpayers They Could Face Jail Time
Tech Industry
8h
420
Does Google force everyone presenting to use pronouns in public events?
Tech Industry
2d
4668
Job market is brutal for SWEs 🥲
I hate take homes
Not sure what’s a better way to gauge the candidate? I don’t want to do LC.