Facebook, Google, Amazon | SRE methodology debug question

Nov 27, 2019 6 Comments

Situation:
For folks that work on SRE type teams on Facebook, google, twitter, amazon, Netflix that get engaged
when services go down or report performance problems -

1. Do you use system wide graphs that show you - how is the health of the overall service?
If yes - What all metrics do these graphs show
2. Once you find out that one of the graph is bad, how do you go further into debugging?
eventually narrowing down issue to a new version of software installed or hardware problems or hot-spots etc?

If there is a talk or blog post that covers this , please let me know.

No, i am not trying to get someone to do my job. Curious to know about this, hence asking it here.

TC - ~ 200K $, YOE - 11 years.

comments

Want to comment? LOG IN or SIGN UP
TOP 6 Comments
  • Amazon OMwD10
    When things are in alarms or when we notice there are issues, log dives are pretty a common next step. Depending on the scale of the event, there may or may not be a conference call where many people are involved. So we start with the impact then dive super deep to hypothesize and determine the root cause. Amazon has a world class infrastructure on logging and monitoring. So it’s actually great to do the detective works when complex issues happen. I will miss this the most when leaving. We don’t have SREs but support engineers who help manage high severity events. Most SDEs are supposed to be on an oncall rotation, so they are trained to handle when things are in alarm.
    Nov 27, 2019 1
    • Dell LUHV04
      OP
      Thank you

      This is helpful insight
      Nov 27, 2019
  • Google mooncalf1
    Yes, you can look at things like users being served 500s or RPC errors or latency. Assuming you're working with a distributed system, the first thing to determine is where exactly are things going funky. Is it a global problem or localized? If a specific component is alerting, is that the real problem or is it further down the stack?Next you might try to see if it lines up with a binary or config push, we have graphs for these also. Bad pushes account for many issues and rolling back to mitigate is common. If it's a local issue that can't be root caused and/or fixed quickly, it might make sense to drain your service from that region if you have capacity elsewhere. The imperative is always first to stop the bleeding, deeper analysis can be done after that.
    Nov 28, 2019 2
  • Cisco 🌚💡R
    Great post 👍
    Dec 3, 2019 0

Salary
Comparison

    Real time salary information from verified employees