Tech IndustryNov 3, 2018
Chaseghosted!

Please critique my system design for 'Top K most searched terms'

The problem with system design preparation is there lacks any 'solution' or discussion board unlike LeetCode algorithm questions. So if any experienced interviewers who are familiar with this class of systems design question: 'Return top K most searched term' in real time, 'Top K most listened song in real time', etc, would you kindly review what is a very ROUGH sketch of my design. 'Realtime' does not have to be strictly real. A few mins of delayed data is tolerable. [Collecting user inputs] 1) For each term submitted by a user, it'll (it being the term) hit (from the browser) one of many application servers(can be a simple REST web service) , each on a different node, via Load Balancer. 2) [Application logic] Each of the said web service will maintain in memory a hashmap as well as a minimum heap. And keep count of {Term:Frequency} in the hashmap, and update the min-heap to keep track of top K (same as Leetcode question for top K most frequent elements). This 'in-memory' structure can be the native data structure of whtever language the web service is running in: Java, Python, JS, etc. Or it can be the data type of a KV in-memory cache such as Redis. 3) [Persistence] So now you have multiple nodes, each containing a hashmap and a min-heap with data populating. Every 10 seconds, the server will go through each key (search term) in the in-memory hashmap, and for each search term, compute Hash(SeachTerm) % N, where N is the number of database clusters running (more on this later). ( In other words I'm sharding by the search term when saving to db.) and it'll save that particular {Term:Frequency} into TWO tables of the DB. The two tables are basically database versions of the hashmap and heap. Of course we can choose a traditioal RMDS like MySQL and create those 2 tables to mimic hashmap and heap. Or we use a KV store such as Redis's own file-based persistance to more naturally those 2 structures. In other words, each application server flushes all contents of its hashmap and heap into the DB, where each item of its hashmap may go into a different DB due to sharding. The sharding alleviates heavy writes into a single DB instance. The every-5 second flushing alleviates heavy writes due to hot seach terms that requires constant updates to the same key/value. Of course by 'DB" i mean a cluster of DBs in a master-slave replication. So we got N such clusters (N "Master-slave" clusters each on a different DB node), and in each cluster, all the writes are going to the master. If the any application server goes down and loses the cache, then upon its starting, it can seed its cache by reading from the DB. Worst case that it may lose 5 seconds worth of data as it may crash between 2 flushings. 4) [ Aggregation ] Then we run a separate application server called 'Aggregator', which will every 5 seconds read from the slave of each shard, in particular the 'heap' table ( it does not need to care about the hashmap table), so it will get N TopK's , and it then creates a global heap of size K, and merges all content of all the N topK's into this global minHeap, and there you got your global top K. And it'll write this top K items into a separate DB. So we have an update topK every 5 seconds. 5) Notes I haven't touched on space estimation, nor the details of my consistency/availability choices (PA or PC). I also realize there's more sophisticated TopK algorithms besides hashmap+heap, such as lossy count or count-min sketch, but I think as a system design question, the hasmap+heap algorithm (same as Leetcode's top K most frequent items) is sufficient, and interviewer is more interested in the system design part. I'm also not expanding the scope of the question to include 'top K in the past min, hour, day, etc', that calls for more enhanced designs that I will tackle in a different version. At this point, is there anything major that's wrong or missing from the above that I need to revise or add? And what details would you want to hear about in the above design. What areas would you typically ask the candidate to zoom in at this point? Thank you for reading!!!!

New
7üūù it úû Nov 3, 2018

😎

Google hvnner Nov 3, 2018

In step 4, this will result in incorrect results. An entry in global top k does not necessarily have to be in any on the local top k. Basically, entry could have been k+1th in all shards and still make it to global topk. Also generally start out with simple high level solution and approach and then go ahead with optimizations. Your solution trades off correctness, freshness etc which all could be valid but you dont want to start with that. Also you need to talk about read side issues, since writes could be order of magnitude less that reads possibly. Also simplify and eliminate components as much or justify the complexity. no need for heap table. Can we async compute top ks from application logs and simplify application themselves ?

Chase ghosted! OP Nov 3, 2018

Thanks. Can you give an example of entry in global topk that is not in any local topk? Also if I eliminate heap table, will my aggregator server read from each in memory heap of the application servers to aggregate global top k ?

Google hvnner Nov 3, 2018

Shard 1 - (a:4, b:3) , c:2 Shard 2 - (x:6, y:5), c:4 Global Top k - (x:6, y:5) However ideally it should be x:6, c:6. Try and eliminate heap entirely and start with that.

Amazon awssdm Nov 3, 2018

I don't get a sense of how this would work. What's the query that your aggregator uses? Also when you say things like mysql or of course a kv store like redis, it looks bad. Because they have vastly different capabilities as applied to your problem, so it looks like you're hand waving through them without understanding or thinking it through

Chase ghosted! OP Nov 3, 2018

Thanks for pointing out hand waving, I often struggle with how detailed i need to be as time is limited and don't want risk getting bogged down in things interviewer is not interested in

Snapchat qkSQ81 Nov 3, 2018

One other impractical design choice is to shard the key by using a modulo on number of db instances. This means you will have to repatriation all the keys if you scale up or down the dbs. Read about consistent hashing. Also in general, read this book: https://dataintensive.net/. Very highly recommended. A must read for people that have not taken distributed systems course in colleges or have not attended Principal Engineer talks in major companies like Amazon or Facebook or Google.

Chase ghosted! OP Nov 3, 2018

Yes good point about consistent hashing, doesn't it not mean all other hashing techniques are no good as there will always be need to rebalance hash? Should I always choose consistent hashing otherwise interviewer will think I have knowledge gap in sharding design?

Snapchat qkSQ81 Nov 3, 2018

There are broadly two kinds of partitioning used in practice. Hash based (consistent hashing is assumed) and key-range based. You can choose either of them when talking about system design as long as you also talk about their tradeoffs. Read the book for more details.

Chase ghosted! OP Nov 3, 2018

Yes good point about consistent hashing, doesn't it not mean all other hashing techniques are no good as there will always be need to rebalance hash? Should I always choose consistent hashing otherwise interviewer will think I have knowledge gap in sharding design?

Google Bazelblaze Nov 3, 2018

Look into data sketches - these will reduce your memory/storage requirements. The key here is you want the sketches to be mergeable as you are aggregating. https://datasketches.github.io/docs/FrequentItems/FrequentItemsOverview.html Some other terms to google are heavy hitters sketch.

Google UUKg61 Nov 4, 2018

Key to successful system design interviews are that you should be able to defend your answers. And even more so that you make your assumptions very very clear with the interviewer. Remember that there is no one right solution for such questions. 1) why REST and not something efficient like gRPC or websockets? Why directly from browser? Should it be from main service internally? Is this the main service behind firewall for internal consumption or available to external users ? Is the mode of ingestion same as mode of consumption? Point is lesser the assumption, better it is. 2) tradeoffs of storing in-memory vs Redis . How would you manage consistency between the map and heap? Easier for in memory structure but harder for distributed caches. Consistency and performance tradeoff. When do you select one large node vs multiple small nodes- does search term vocabulary makes a difference? How to avoid hotspotting , eg, 80% searching a single term within a short time ? What about read vs write ratio? If it is mostly write and few read , does keeping the data sorted makes sense? 3) Why to do this in async way? Can we put a queue/buffer between app and db? The speed of processing is then limited by back pressure from DB writes instead of chance of losing 10 second data? While flushing can we consider writing the count and value together so that they can be persisted as atomic transaction. The persistence model can be very different from in memory structures. No need to store them in two different tables. Multi DB is a nice idea but use consistent hashing instead of modulo N hashing. 4) unnecessary complication. If you already have the in memory representation, what is the use case of this? Use case can be error recovery from one node failure, but creating a min heap every 5 seconds is doing more work than needed. Why not have the global heap/map updated directly? It would not have any memory implications as it already consumes. I would stay away from 5 sec, 10 sec kind of timelines. What is the guarantee that the task gets over within this window. If not , what is the protection against delays or jobs running in parallel. As I said earlier, there are no right answers for such questions :) Just that make your decisions explicit and make sure to give alternative. Let the interviewer pick the path he wants to take. And you are right about staying a bit away from advanced DS/Algo during system design.

Chase ghosted! OP Nov 4, 2018

Thank you for the detailed comments. I may reach out to you in private for some follow up If u don't mind

Google UUKg61 Nov 4, 2018

Sure. Not a problem.

Commvault 🖕racism Nov 4, 2018

Shouldn’t you be using a max heap and not a min heap?

Chase ghosted! OP Nov 4, 2018

Min heap is right. See leetcode question top k most frequent elements

Nvidia Saaam Nov 4, 2018

Should trie be better data structure than hasm map? It will be memory efficient

Wish dfAH63 Feb 1, 2023

I found the same thing is missing in all solutions for this particular question. that is, the top k is a "sliding window" or reset at fixed window boundary? if it's a sliding window (e.g. top k in the past hour, refresh every 5 mins), then some TTL indexing is needed so we don't count stale data.