FB Production Engineer system design question
Dec 29, 2021
6 Comments
E4 @Facebook
Distribute a 1GB ISO file to tens of thousands of servers for a firmware upgrade
How would you do it? I froze and muttered something about bittorrent and was walked out (it was the last panel but still).
YOE 10
TC 170
comments
-All gather
Have 1 server be the root. Scatter the file into N pieces. And have all the servers perform AllGather operation. Discuss the work and communication time. Look up MPI.
The time will depend on the link topology. If there is a lot of contention (linear network) there will be slow down in communication. Comm will be better with 2d torus or hypercube topology.
IIRC, you can come up with the lower bound on work and communication time. Depending on which algo you use for scatter/all gather, you can achieve optimal W and T.
a) Are servers in same data center/region.
b) Is there enough disk space on servers.
c) Is there any requirements for timing (as fast as possible, rolling updates, slow background updates)
d) Is there any costs for traffic (like an expensive server in middle of Ocean or on the Moon)
e) What type of connections are available? Two-directional, pull vs push, multi-casting
f) How to validate integrity of file. How trusted are connections?
2. Based on answers propose various implementations with different trade-offs.
3. Discuss alternative options based on timing of your project (MVP vs. features for next releases).
4. If it’s senior position - need to decide on splitting of project across multiple teams or at least team members.
5. Various metrics you will be collecting and optimizing as part of your service run.