How to approach SRE Troubleshooting interview?

New
SGqY57

New

SGqY57
Jan 23, 2021 13 Comments

I interviewed at FB for SRE Manager and was rejected; I was told my troubleshooting interview was weak.

The interview began with the general question: "You've started seeing random complaints online that one of your web services is slow. This service is running on 50,000 machines with no metrics, tooling, or automation. What do you do?"

I asked some clarifying questions-
Q: When did we start seeing these complaints?
A: A few weeks ago.
Q: What type of service is this? What are the user expectations?
A: That doesn't matter.
Q: What details are included in the complaints? Times, user actions, etc.
A: The complaints are general and don't include details.
Q: Can we look at code that was pushed around the time the complaints started?
A: During that time, several hundred commits were pushed.

Despite having years of experience in the SRE field, I was a bit at loss for how to effectively approach this. I eventually came to a solution, but I think the interviewer accepted that solution to conclude the interview.

I'm now preparing for a similar interview with Google (again, SRE management) and am feeling unsteady on this portion.

Does anyone have recommendations on how I should better approach this? Should I ask fewer questions and make more assertions (e.g. "I would install a telemetry daemon on the systems and observe performance metrics.")?

Any other thoughts on how to best perform in this type of interview?

#sreinterview #sre #google

comments

Want to comment? LOG IN or SIGN UP
TOP 13 Comments
  • Rackspace
    clownGuy

    Go to company page Rackspace

    PRE
    Rackspace
    clownGuy
    You asked questions, but what did you answer for troubleshooting?
    Jan 23, 2021 4
    • New
      SGqY57

      New

      SGqY57
      OP
      Thank you, wish I knew that going in. The “50,000 systems” part made me assume I had to work beyond single-system investigation.
      Jan 23, 2021
    • PayPal
      !TheCoach

      Go to company page PayPal

      !TheCoach
      You just have to pick a few from 50k and look for similar patterns in issues
      Jan 24, 2021
  • PayPal
    !TheCoach

    Go to company page PayPal

    !TheCoach
    Question mentions “random complaints” - I would have considered that a hint. Gg generally such problems can be isolated to a specific zone and infra including network, lb, sdn, cdn or even databases of caches. Also is the slowness perceived on reads/writes, is this slowness latency perceived from a specific location etc are good questions to ask.

    Oftentimes, in modern infrastructure, there are issues like “noisy neighbor” that may unintentionally downgrade a specific service too.

    Installing anything into a production system when a problem is ongoing isn’t a good idea.
    Jan 23, 2021 1
    • New
      SGqY57

      New

      SGqY57
      OP
      Thank you. How would you have approached the “randomness” of the complaints? When I asked for details or clarification, the interviewer said “Users just complain that it’s slow with no other information.”, so it seemed like a dead end.
      Jan 24, 2021
  • OP was there a coding interview for this role ? Is this a PE mgr role ?
    Jan 23, 2021 4
    • New
      SGqY57

      New

      SGqY57
      OP
      LC medium, but including needing to know some system libraries and /proc file system. Work with stdin/stdout, get os/system info, then algo work on top. No real data structure work.
      Jan 24, 2021
    • Thanks . I attended PE onsite but for IC last year .. wondering how it was for mgr
      Jan 24, 2021
  • Amazon
    xdgaio

    Go to company page Amazon

    xdgaio
    Ok bud.
    Let's break this down. No sane system will have 50k servers in a given location. These are distributed across Geos. This boils down the problem to Geo specific.
    Then again there are definitely CDNs in such setup.
    This will again boil down to a handful of servers then you can start you general troubleshooting.
    Feb 19, 2021 0