In a system design interview i was asked to design the feature for "Like" function on youtube (assume "dislike" out of scope). The interviewer was particularly interested in data model for how would i store the videos liked by any user. I proposed a SQL DB design having a simple table schema that will store all Liked videos. The table will have 3 columns: UserId, VideoId, Timestamp. Everytime any user likes a video it will insert a new row in this table. I got a counter question that in this model userid will repeat for every video liked by the user. How can it be optimized further? I struggled a lot there. Can someone tell what is the ideal data model for storing the videos liked by youtube users? Also would be helpful if you can point me to some reading material to make my fundamentals strong. Didn't get the offer due to overall poor performance #engineering #software #dataengineer
It depends on the database you're using. You're probably going to be using an OLAP MPP database like Redshift, Snowflake, or Vertica for this type of offline analysis, and the columnar data on disk would optimally be run-length compressed, so stored on disk as customer_id * n, where n is the number of times it appears, which will compress very nicely.
Usually platforms like Facebook, YouTube, LinkedIn would store these in a graph dB as nodes and edges - reason is to be able to easily “suggest” videos/friends/colleagues to you based on your relations. Ex. Based on videos you liked, you get suggestions of videos - that’s derived based on what most of other users liked so also liked the video you liked. Same with FBs “people you may know” feature. For the interviewers specific question - I would say use case drives the storage. If just retrieving a list of videos you liked is the use case, any OLTP could do it. If you want to run analytics, an OLAP like Redshift, snowflake could do. If you want more functionality like suggestions, traversal, a graph dB does it. And based on the storage the model changes - OLTP, you would go for normalized, OLAP - denormalized. Graph has its own model and is the most apt - one node for user, one node per video, and based on what you watched, all those videos get linked to the user node with relationship “watched”. If you liked a few of those, an additional relation can be added like “liked” of an attribute of the relation “watched” can be “liked”.
Health & Wellness
Yesterday
982
Lasik cost
Tech Industry
14h
600
What is the minimum number of years you should spend at a company so that it doesn’t look bad?
World Conflicts
11h
462
Israeli precision-guided munition likely killed group of children playing foosball in Gaza, weapons experts say
AMA
Yesterday
1274
PM Manager, early 40s, married and ENM (Ethical Non Monogamous) AMA
India
13h
563
'Hindutva': The Radical Hindu Ideology That Seeks to 'Push Christianity Out of India’
You might want another table with like aggregates for each video so that you don’t have to do a select count(*) where video_id on that table each time you want to display total likes. Instagram had a similar problem and they solved it by storing the total somewhere and so they can do a single select vs select count()
Thanks. His specific question was- "is it ok that userId is repeating for every video liked by that user? Can there be a better data model?"
It's okay