FB Production Engineer system design question

NVIDIA
LC > TC

Go to company page NVIDIA

LC > TC
Dec 29, 2021 6 Comments

E4 @Facebook

Distribute a 1GB ISO file to tens of thousands of servers for a firmware upgrade

How would you do it? I froze and muttered something about bittorrent and was walked out (it was the last panel but still).

YOE 10
TC 170

comments

Want to comment? LOG IN or SIGN UP
TOP 6 Comments
  • Box
    mumuo

    Go to company page Box

    mumuo
    -Scatter
    -All gather

    Have 1 server be the root. Scatter the file into N pieces. And have all the servers perform AllGather operation. Discuss the work and communication time. Look up MPI.

    The time will depend on the link topology. If there is a lot of contention (linear network) there will be slow down in communication. Comm will be better with 2d torus or hypercube topology.

    IIRC, you can come up with the lower bound on work and communication time. Depending on which algo you use for scatter/all gather, you can achieve optimal W and T.
    Dec 29, 2021 0
  • 1. Ask clarifications.
    a) Are servers in same data center/region.
    b) Is there enough disk space on servers.
    c) Is there any requirements for timing (as fast as possible, rolling updates, slow background updates)
    d) Is there any costs for traffic (like an expensive server in middle of Ocean or on the Moon)
    e) What type of connections are available? Two-directional, pull vs push, multi-casting
    f) How to validate integrity of file. How trusted are connections?

    2. Based on answers propose various implementations with different trade-offs.

    3. Discuss alternative options based on timing of your project (MVP vs. features for next releases).

    4. If it’s senior position - need to decide on splitting of project across multiple teams or at least team members.

    5. Various metrics you will be collecting and optimizing as part of your service run.
    Dec 29, 2021 0
  • NetApp
    topspin32

    Go to company page NetApp

    topspin32
    Read about pxe boot as well. It's old tech but should work for well 1gb files
    Dec 29, 2021 0
  • GSPANN / Eng
    can0nical

    Go to company page GSPANN Eng

    can0nical
    Off the top of my head, one leader hosting the ISO per datacenter/subnet, servers get a firmware url specific to their location. Rolling updates to avoid link saturation, yadda yadda. 1GB is not that much over a local network.
    Dec 29, 2021 0
  • P2P should be one of good solution for this !
    Dec 30, 2021 0