Tech IndustryJul 16, 2019

Sharding/Hashing strategies in a social network

Say we design our data structures for meta data in Instagram as : 1) Photos table (kv store or relational): key is PhotoId, and value is a list of [UserId, PhotoPath, UserLocation] 2) UserPhoto table: key is UserId, value is list of PhotoIds the user has uploaded. 3) Follows table: key is UserId, and value is a list of UserIds this user follows. On grokking system design, it says to shard by PhotoId instead of UserId, and it concludes it's better that way. But don't you need to shard for each of those 3 tables, how exactly do you have a choice between sharding by PhotoID or UserId? If you are storing UserPhotos table, you would need to share by UserId. And the Follows table needs to be sharded by UserId too. How can you only shard by PhotoId without sharding UserId separately ? Do we not have to shard all three tables? What's the general sharding / hashing strategy in this case ?

Add a comment
New
vTbL18 Jul 16, 2019

You should walk through your use case. For the Instagram like system, home which should show all followers recently photo should be accessed more, so the first sharding should based on that, maybe time of photoid is the first layer , and then follow relationship can be shared , thinking of this kind of information will not changed frequently, we can make it and user information even static , and using some methodology to do sync between the master data with static share data.

New
vTbL18 Jul 16, 2019

So the conclusion is sharding the photo id with time duration, share the relationship data and user data with sharding . It can work

Facebook madhuridix Jul 16, 2019

Try sharting

Snapchat bFCN68 Jul 16, 2019

Instagram example doesn't really go deep into detail, but let me help you connect the dots for you. First, why do we need sharding? We want to store more data than one server can fit. That's data partitioning. Second, what's our data access pattern? In case of photos we want all photos for a given user thus we want to keep them together. That's data locality (I may misuse the term here but the idea is to send one request to fetch them all). Based on access pattern and scalability requirements you can derive your sharding approach.

eBay bluechi Jul 16, 2019

But following your thought process, wouldn't we want to shard by user ID instead of by photo ID?

Chase piv OP Jul 17, 2019

Thanks. What you said makes sense for the photo meta data table. But there's another table that is keyed by user id and keeps a list of photo IDs for each user. Shouldn't we shared that table too, and if we do, we need to shard by user id don't we?