Tech IndustryJul 18, 2019

Is it common to have multiple tables inside each 'logical shard' ?

When reading https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5c I came across this quote: "Each ‘logical’ shard is a Postgres schema in our system, and each sharded table (for example, likes on our photos) exists inside each schema." Setting aside specific settings in Postgres for a minute, and just consider this abstractly: is it saying that for each 'logical shard', it'll contain multiple tables (photos, users, etc)? So if they have 100 logical shards, then each shard would contain a table called Photos, and another table called Users (just an example, could have more tables)? And within each logical shard, its Photos table and Users table would contain a subset of all Photos and all Users ? So if a user snaps a photo and wants to upload, it'll first find the shard ID by hash(userId) % N (N number of logical shards), and go to Shard number N, and then update its Photos table and Users table respectively, except it'll attach the shard ID to the photoID when updating the photos table. So when the grokking system interview case study says 'we can shard by UserId or PhotoId', they mean: we can apply a hash function to either a given userId or photo Id, and find the logical shard, and inside the logical shard, we will still have to update both the users table and photos table? And whatever ID (userId or photoID) we decide to shard by, we'll have to find a way to append the shardID to the other ID if we ever want to search by the other ID. Am I more or less correct? Anything fundamental that I misunderstood? Thanks

Sharding & IDs at Instagram
Sharding & IDs at Instagram
Instagram Engineering
Add a comment
Uber uber101 Jul 18, 2019

its listed at lot of places for simplicity but its bad practice to shard your data based on an attribute value like userID or photoID. sharding should be independent of actual data, should be determined based mostly on balancing storage, traffic evenly across shards

Facebook vjuiu Jul 18, 2019

Lol, wtf... you are sharding on what exactly?

Chase piv OP Jul 19, 2019

That point aside is my understanding regarding logical shard and multiple tables correct?

Twitter Oomnj Jul 18, 2019

You can and should have a separate ID generation service that has is almost sortable

Google crypt_hoe Jul 18, 2019

Such a snowflake

ABC Financial LLL30987 Jul 18, 2019

This article leaves a lot to the imagination. The sql function does not in fact set the shardId so we can assume that it is a partial example. It does not make clear if done by the database default value via the next_id() or if done in app code. After thoroughly reading the article, it gives the example of userId as the shard Id, which means all data would effectively use the userId to shard consistently. It does not explain how the sql function does this. So yes, I think your assumptions are correct.

Google 🐴~~~ Jul 21, 2019

Yes it looks like they’re saying they shard by userID and include the shardID in the photoID so that you can directly look up the photo without knowledge of the userID. And they’re indirectly claiming this scales because you can easily map postgresql logical shards to physical shards. They probably using a really big N for the modulus. This isn’t consistent hashing.

Chase piv OP Jul 21, 2019

Is this bad design?

Google 🐴~~~ Jul 21, 2019

No I think it’s fine!