Facebook, Google, Amazon | SRE methodology debug question
For folks that work on SRE type teams on Facebook, google, twitter, amazon, Netflix that get engaged
when services go down or report performance problems -
1. Do you use system wide graphs that show you - how is the health of the overall service?
If yes - What all metrics do these graphs show
2. Once you find out that one of the graph is bad, how do you go further into debugging?
eventually narrowing down issue to a new version of software installed or hardware problems or hot-spots etc?
If there is a talk or blog post that covers this , please let me know.
No, i am not trying to get someone to do my job. Curious to know about this, hence asking it here.
TC - ~ 200K $, YOE - 11 years.