Software Engineering CareerDec 13, 2020

Data model for videos liked on youtube

In a system design interview i was asked to design the feature for "Like" function on youtube (assume "dislike" out of scope). The interviewer was particularly interested in data model for how would i store the videos liked by any user. I proposed a SQL DB design having a simple table schema that will store all Liked videos. The table will have 3 columns: UserId, VideoId, Timestamp. Everytime any user likes a video it will insert a new row in this table. I got a counter question that in this model userid will repeat for every video liked by the user. How can it be optimized further? I struggled a lot there. Can someone tell what is the ideal data model for storing the videos liked by youtube users? Also would be helpful if you can point me to some reading material to make my fundamentals strong. Didn't get the offer due to overall poor performance #engineering #software #dataengineer

Sort by :

New

pelmeni Dec 13, 2020

You might want another table with like aggregates for each video so that you don’t have to do a select count(*) where video_id on that table each time you want to display total likes. Instagram had a similar problem and they solved it by storing the total somewhere and so they can do a single select vs select count()

Microsoft OQUG41 OP Dec 13, 2020

Thanks. His specific question was- "is it ok that userId is repeating for every video liked by that user? Can there be a better data model?"

Lyft a*kw*rd$af Dec 13, 2020

It's okay

Amazon GFbS65 Dec 13, 2020

It depends on the database you're using. You're probably going to be using an OLAP MPP database like Redshift, Snowflake, or Vertica for this type of offline analysis, and the columnar data on disk would optimally be run-length compressed, so stored on disk as customer_id * n, where n is the number of times it appears, which will compress very nicely.

Asurion tokkale Dec 13, 2020

Usually platforms like Facebook, YouTube, LinkedIn would store these in a graph dB as nodes and edges - reason is to be able to easily “suggest” videos/friends/colleagues to you based on your relations. Ex. Based on videos you liked, you get suggestions of videos - that’s derived based on what most of other users liked so also liked the video you liked. Same with FBs “people you may know” feature. For the interviewers specific question - I would say use case drives the storage. If just retrieving a list of videos you liked is the use case, any OLTP could do it. If you want to run analytics, an OLAP like Redshift, snowflake could do. If you want more functionality like suggestions, traversal, a graph dB does it. And based on the storage the model changes - OLTP, you would go for normalized, OLAP - denormalized. Graph has its own model and is the most apt - one node for user, one node per video, and based on what you watched, all those videos get linked to the user node with relationship “watched”. If you liked a few of those, an additional relation can be added like “liked” of an attribute of the relation “watched” can be “liked”.

Sort by :

New

pelmeni Dec 13, 2020

Microsoft OQUG41 OP Dec 13, 2020

Thanks. His specific question was- "is it ok that userId is repeating for every video liked by that user? Can there be a better data model?"

Lyft a*kw*rd$af Dec 13, 2020

It's okay

Amazon GFbS65 Dec 13, 2020

Asurion tokkale Dec 13, 2020

Industries

Job Groups

General Topics

Sponsored

Most Read

Data model for videos liked on youtube

Most Read