Situation: For folks that work on SRE type teams on Facebook, google, twitter, amazon, Netflix that get engaged when services go down or report performance problems - 1. Do you use system wide graphs that show you - how is the health of the overall service? If yes - What all metrics do these graphs show 2. Once you find out that one of the graph is bad, how do you go further into debugging? eventually narrowing down issue to a new version of software installed or hardware problems or hot-spots etc? If there is a talk or blog post that covers this , please let me know. No, i am not trying to get someone to do my job. Curious to know about this, hence asking it here. TC - ~ 200K $, YOE - 11 years.
Yes, you can look at things like users being served 500s or RPC errors or latency. Assuming you're working with a distributed system, the first thing to determine is where exactly are things going funky. Is it a global problem or localized? If a specific component is alerting, is that the real problem or is it further down the stack?Next you might try to see if it lines up with a binary or config push, we have graphs for these also. Bad pushes account for many issues and rolling back to mitigate is common. If it's a local issue that can't be root caused and/or fixed quickly, it might make sense to drain your service from that region if you have capacity elsewhere. The imperative is always first to stop the bleeding, deeper analysis can be done after that.
This chapter from the Google SRE book may interest you: https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/
Thank you I will read this today evening
Great post 👍
Health & Wellness
7h
742
How can I find success dating in NYC
Tech Industry
Yesterday
2313
I paid 250 for a Google Referral and got Scammed
India
Yesterday
1838
Slavery has REVERSED! the US is the slave!!! Check out this dude who pays a personal trainer in India
2024 Presidential Election
7h
146
Which side is better?
Tech Industry
Yesterday
1206
Do you really think Amazon is that bad
When things are in alarms or when we notice there are issues, log dives are pretty a common next step. Depending on the scale of the event, there may or may not be a conference call where many people are involved. So we start with the impact then dive super deep to hypothesize and determine the root cause. Amazon has a world class infrastructure on logging and monitoring. So it’s actually great to do the detective works when complex issues happen. I will miss this the most when leaving. We don’t have SREs but support engineers who help manage high severity events. Most SDEs are supposed to be on an oncall rotation, so they are trained to handle when things are in alarm.
Thank you This is helpful insight