Say we design our data structures for meta data in Instagram as : 1) Photos table (kv store or relational): key is PhotoId, and value is a list of [UserId, PhotoPath, UserLocation] 2) UserPhoto table: key is UserId, value is list of PhotoIds the user has uploaded. 3) Follows table: key is UserId, and value is a list of UserIds this user follows. On grokking system design, it says to shard by PhotoId instead of UserId, and it concludes it's better that way. But don't you need to shard for each of those 3 tables, how exactly do you have a choice between sharding by PhotoID or UserId? If you are storing UserPhotos table, you would need to share by UserId. And the Follows table needs to be sharded by UserId too. How can you only shard by PhotoId without sharding UserId separately ? Do we not have to shard all three tables? What's the general sharding / hashing strategy in this case ?
So the conclusion is sharding the photo id with time duration, share the relationship data and user data with sharding . It can work
Try sharting
Instagram example doesn't really go deep into detail, but let me help you connect the dots for you. First, why do we need sharding? We want to store more data than one server can fit. That's data partitioning. Second, what's our data access pattern? In case of photos we want all photos for a given user thus we want to keep them together. That's data locality (I may misuse the term here but the idea is to send one request to fetch them all). Based on access pattern and scalability requirements you can derive your sharding approach.
But following your thought process, wouldn't we want to shard by user ID instead of by photo ID?
Thanks. What you said makes sense for the photo meta data table. But there's another table that is keyed by user id and keeps a list of photo IDs for each user. Shouldn't we shared that table too, and if we do, we need to shard by user id don't we?
Tech Industry
2d
53864
Goog Employees Arrested
Tech Industry
Yesterday
623
Chances of meta clearing E5 with screwing up one coding one round and acing all other
Layoffs
2d
41689
Google CFO confirms 'large-scale' layoffs (Apr 17)
2024 Presidential Election
Yesterday
2251
Biden ruined America and tech! Tax plans are insane
Tech Industry
Yesterday
1243
RIP Lacework
You should walk through your use case. For the Instagram like system, home which should show all followers recently photo should be accessed more, so the first sharding should based on that, maybe time of photoid is the first layer , and then follow relationship can be shared , thinking of this kind of information will not changed frequently, we can make it and user information even static , and using some methodology to do sync between the master data with static share data.