Say we design our data structures for meta data in Instagram as : 1) Photos table (kv store or relational): key is PhotoId, and value is a list of [UserId, PhotoPath, UserLocation] 2) UserPhoto table: key is UserId, value is list of PhotoIds the user has uploaded. 3) Follows table: key is UserId, and value is a list of UserIds this user follows. On grokking system design, it says to shard by PhotoId instead of UserId, and it concludes it's better that way. But don't you need to shard for each of those 3 tables, how exactly do you have a choice between sharding by PhotoID or UserId? If you are storing UserPhotos table, you would need to share by UserId. And the Follows table needs to be sharded by UserId too. How can you only shard by PhotoId without sharding UserId separately ? Do we not have to shard all three tables? What's the general sharding / hashing strategy in this case ?
So the conclusion is sharding the photo id with time duration, share the relationship data and user data with sharding . It can work
Try sharting
Instagram example doesn't really go deep into detail, but let me help you connect the dots for you. First, why do we need sharding? We want to store more data than one server can fit. That's data partitioning. Second, what's our data access pattern? In case of photos we want all photos for a given user thus we want to keep them together. That's data locality (I may misuse the term here but the idea is to send one request to fetch them all). Based on access pattern and scalability requirements you can derive your sharding approach.
But following your thought process, wouldn't we want to shard by user ID instead of by photo ID?
Thanks. What you said makes sense for the photo meta data table. But there's another table that is keyed by user id and keeps a list of photo IDs for each user. Shouldn't we shared that table too, and if we do, we need to shard by user id don't we?
India
Yesterday
1572
Slavery has REVERSED! the US is the slave!!! Check out this dude who pays a personal trainer in India
Tech Industry
Yesterday
976
Do you really think Amazon is that bad
Health & Wellness
3h
496
How can I find success dating in NYC
Tech Industry
Yesterday
598
Pipped Googler applying for other companies
Tech Industry
Yesterday
770
Best LCOL or MCOL city?
You should walk through your use case. For the Instagram like system, home which should show all followers recently photo should be accessed more, so the first sharding should based on that, maybe time of photoid is the first layer , and then follow relationship can be shared , thinking of this kind of information will not changed frequently, we can make it and user information even static , and using some methodology to do sync between the master data with static share data.