I interviewed at FB for SRE Manager and was rejected; I was told my troubleshooting interview was weak.
The interview began with the general question: "You've started seeing random complaints online that one of your web services is slow. This service is running on 50,000 machines with no metrics, tooling, or automation. What do you do?"
I asked some clarifying questions-
Q: When did we start seeing these complaints?
A: A few weeks ago.
Q: What type of service is this? What are the user expectations?
A: That doesn't matter.
Q: What details are included in the complaints? Times, user actions, etc.
A: The complaints are general and don't include details.
Q: Can we look at code that was pushed around the time the complaints started?
A: During that time, several hundred commits were pushed.
Despite having years of experience in the SRE field, I was a bit at loss for how to effectively approach this. I eventually came to a solution, but I think the interviewer accepted that solution to conclude the interview.
I'm now preparing for a similar interview with Google (again, SRE management) and am feeling unsteady on this portion.
Does anyone have recommendations on how I should better approach this? Should I ask fewer questions and make more assertions (e.g. "I would install a telemetry daemon on the systems and observe performance metrics.")?
Any other thoughts on how to best perform in this type of interview?
#sreinterview #sre #google
Want to see the real deal?
More inside scoop? View in App
More inside scoop? View in App
blind
SUPPORT
FOLLOW US
DOWNLOAD THE APP:
FOLLOWING
Industries
Job Groups
- Software Engineering
- Product Management
- Information Technology
- Data Science & Analytics
- Management Consulting
- Hardware Engineering
- Design
- Sales
- Security
- Investment Banking & Sell Side
- Marketing
- Private Equity & Buy Side
- Corporate Finance
- Supply Chain
- Business Development
- Human Resources
- Operations
- Legal
- Admin
- Customer Service
- Communications
Return to Office
Work From Home
COVID-19
Layoffs
Investments & Money
Work Visa
Housing
Referrals
Job Openings
Startups
Office Life
Mental Health
HR Issues
Blockchain & Crypto
Fitness & Nutrition
Travel
Health Care & Insurance
Tax
Hobbies & Entertainment
Working Parents
Food & Dining
IPO
Side Jobs
Show more
SUPPORT
FOLLOW US
DOWNLOAD THE APP:
comments
Oftentimes, in modern infrastructure, there are issues like “noisy neighbor” that may unintentionally downgrade a specific service too.
Installing anything into a production system when a problem is ongoing isn’t a good idea.
Let's break this down. No sane system will have 50k servers in a given location. These are distributed across Geos. This boils down the problem to Geo specific.
Then again there are definitely CDNs in such setup.
This will again boil down to a handful of servers then you can start you general troubleshooting.