Tech IndustryOct 4, 2021
MozillaTOYV61

If you’re the SRE that posted status updates on FB outage

I hope you made it through today without any consequences. Looks like your Reddit account and posts are deleted. /r/sysadmin misses you. Pouring one out for my homies 🍻

Microsoft zhwud2 Oct 4, 2021

Zuck got him

Google iksjdo Oct 4, 2021

Tldr?

LinkedIn gahhhhhhhh Oct 4, 2021

Fb sys admin was posting live updates to Reddit with insider information about the outage. They suddenly deleted their account

HYFN aspava Oct 4, 2021

??

Nutanix ntnx@160$ Oct 4, 2021

More context here

Google 100zeros Oct 4, 2021

Get Zucced lmao 👽

Google @sundar Oct 4, 2021

So is that why no FB people are responding to comments about this topic? There are a couple other threads that I found on here and crickets.

New
griddle Oct 4, 2021

Maybe they should avail themselves of the Dodd-Frank whistleblower provisions and report the root cause to the SEC.

AMD nipser Oct 4, 2021

Zuck, dont hesitate - dont be evil, be our commander, even if from Princeville, we need you on the side of light. Choose light (maybe skip burning the house like Larry)

ByteDance tidf88 Oct 4, 2021

Ngl the issue described by the SRE seems pretty believable. How would one prevent this disaster? Have some reserve static IPs for DNS configuration?

Mozilla TOYV61 OP Oct 4, 2021

Without knowing a lot of specifics, I can only speculate. Assuming the description of a router config change breaking BGP announces is an accurate RCA, one common strategy is having multiple ASNs for the org and announcing ASNs in separate / independent router clusters. For DNS specifically, you can have multiple NS records, each should be served by an anycasted IP on a different ASN. For an org the size of FB, I’d expect them to have at least 4 ASNs with anycast entries to support public DNS. The goal is that a single router misconfiguration only breaks at most the routes announced for one ASN. In this outage it appears they’ve stopped BGP announcing all FB routes, which of course included DNS.

Mozilla TOYV61 OP Oct 4, 2021

I would be surprised if they fully recover in less than 48 hours.

Mozilla TOYV61 OP Oct 4, 2021

And it’s back up. How quickly they recovered considering the circular dependencies inherent in the system is impressive and honestly speaks to the quality of the SRE team. I’m also certain they will take immediate action to mitigate the possibility of such an outage occurring in the future. I hope the one SRE willing to be public survived unscathed.