Divide work amongst instances - system design

Honeywell wellHoney
May 15 10 Comments

In system design interviews, I’ve been asked this question a couple times and the interviewer doesn’t seem to quite like my answers. How would you answer this question?

A database has units of work to be done every <interval>. The same unit of work will need to be done again and again. A service will filter through the database looking for any work that needs to be done. It’ll grab the entries and do the work.

But the load of work is too much for one instance. You load balance it and make it highly available. Now you have multiple instances all trying to grab for the same work, but you need to prevent the work from being done multiple times at once. Especially consider a fresh start when the first service starts up, then a second, then a third. How to divide work amongst them?

I suggested adding a lock to the database during a grab. Lock, Grab any data that needs to be done, release lock. Now each instance will grab its own set of work to be done sequentially.

Is there a better alternative that I’m missing?

comments

Want to comment? LOG IN or SIGN UP
TOP 10 Comments
  • Long story short, interviewer is looking for this keyword: consistent hashing. Google it.
    May 15 1
    • Amazon / Eng о
      Or distributed mutual exclusion. Or idempotency.
      May 15
  • Accenture / Eng __xjwg
    you can use a hash function on the key to spread the rows to multiple workers

    read the "manual parallelism" on this URL https://docs.oracle.com/en/database/oracle/oracle-database/18/adfns/rwp.html#GUID-413B3E68-5590-4337-956B-B6CFC11FC2DD
    May 15 0
  • Cisco soumikb
    Taking a cue from the real world clouds, I would try dividing this problem in 2 different parts. First is on what basis you divide the workload, the second, how do you choose available instances to assign the chunks received in the previous step. The first step would almost surely be a hash function as mentioned in the other comment. To add to it, you can discuss the parameters for hashing - ideally a combination of them with relative weights would be best. You can also add linked list approach in case if collisions to make your answer robust. For choosing the instances / group criteria, typically least busy CPU or least number of requests-per-unit-of-time are used but you can throw other options like memory usage. Also, icing on the cake would be to discuss few latest advances on resource distribution & load balancing algorithms like Ant Colony, Flock of Birds or Honey bee algo.
    May 15 2
    • Honeywell wellHoney
      OP
      Ok that helps. I’m pretty good on how to divide the work, using hashing and such. And which instance should be chosen via load balancing intelligently. But who actually assigns the work/hash to the instances that are up. And when a new instance comes up or one goes down, who/how is redistributing the work/hashes to the up instances.
      May 15
    • Cisco gutki
      The load balancer does it - the backend controller in LB would have the algorithm implemented and in turn applied via typically a pub-sub mechanism. The health check module in the load balancer would make sure unhealthy instances are kept out of consideration while new or changed healthy instances are added to candidates list.
      May 15
  • Google HegYGql26u
    Your answer is bad because you introduce a bottleneck that will not scale. There are a few ways to slice this cake. But the most straightforward solution is to use consistent hashing modulo number of instance replicas to partition the database rows, and have the replicas pull work from the database rows. Good follow-up question to consider is what happens if the distribution of processing time per unit work is highly heterogenous. (E.g. the work is to transcode videos whose size range from 1mb to 2gb)
    May 16 1
    • Another good follow up question is what happens if a node crashes? Think of it as - how to make sure the trafic remains evenly split when nodes go up and down? Answer: create virtual nodes.
      May 16
  • Indeed / Eng t6
    One option is to use consistent hashing to split the rows in db for each instance, that way if a node is down other nodes can work on those rows that belong to failed node. Zookeeper can hold the consistent shard ranges to co ordinate
    May 15 1
    • Honeywell wellHoney
      OP
      Do you have more information on how Zookeeper does this? I understand you can use a hash and split up the data. But who does the assignment of
      Instance1 = hashes 1-10

      Then once another instance spins up the hashes are redistributed
      Instance1 = hashes1-5
      Instance2 = hashes6-10

      Or alternatively each instance can have an ID, but when a node goes down how specifically does zookeeper inform the remaining nodes they have more slack to pick up?
      May 15