I’m preparing for meta on-site and trying to come up with data models for scenarios mentioned in the subject. I certainly feel a regular dimensional model is not going to help w.r.t inserts/updates easily. Can any one share some good examples? Any ideas on spider schema ? #meta #dataengineerinterview #eventdriven #eventdatamodel#tech
Ask Blinders
Yesterday
899
Why is Bay Area terribly dressed?
Ask Blinders
16h
1700
How big are the balls of Google to lay off thousands and then do 70 BILLION in stock buyback?
Today I Learned
Yesterday
2240
White male privilege is real
Tech Industry
4h
443
Should I grind LC then interview or just stay in Amazon?
Tech Industry
Yesterday
1709
Companies that pay as much as meta
It usually deals with creating minutely aggregates , so your millions of events per second turn into total number of unique entities liked in that minute which would be in 1000s, that’s how you minimise the number of writes. Now the 1000s of minutely aggregates can be broadcasted over datacenters to achieve global consistency after a minute, or you could do database replication. The raw events can be persisted at their own cadence in a nosql with fast writes capabilities
Thanks for the reply. What you described makes sense for operational efficiency. My use case is more geared towards analytical data models. How efficiently we can store them. I’m looking for a simple example like how a normalized source system level information capture can be transformed into a target schema. Anyway, I get to know something from your reply. Thank you.
Given the scale of data that is flowing in, for analytical use cases you have to define your dimensions and cardinality of each of the dimensions to come up with a data schema , say you want to know across which cities , gender, category and age group have most pictures uploaded, then your dimensions become the things you want to query upon and cardinality becomes the numerous values the dimensions can take. Again this is not flexible as you need to come up with the dimensions well in advance but that’s the trade off you have to make
We have OLAP stores like Druid and Pinot to address that use case . You may want to take a look in there. Also a lot depends on if you want real-time querying capability vs batch ones
Needs this too.