How would you design this system? System outline: 1B records coming in per day, for an advertising company. Focused on real time reporting with specific focus on how to partition the system. Data comes in on a hit stream - and is allowed to be sliced on any of the fields Each hit has the following information: Time Hit type(mobile/desktop) Account Campaign Impressions Interactions Country Browser Bid The first question they asked was how would I design a system to house said data and make it reportable within moments of new data arriving. Specifically they wanted to know how I would partition the data using spark.
Any of the NoSql solutions should work but it really depends on what kinds of metrics the system will be used for ? If say you want counts of impressions and clicks, a columnar database like Cassandra can be used(the answer depends on the what the system is catering for)
Why columnar specifically? They wanted me to describe the system in terms of spark, and I made a faulty assumption it needed to stay in HDFS.
Updating the question to respond to questions below.
Looks like a streaming analytics use case. If so, look at Apache storm. Also Cassandra or DynamoDB or any key value database.
Producer(streams) -- Kafka -- consumer(EMR)-- Dynamodb UI -- API -- Dynamodb EMR is written in Java/Scala with spark For dynamodb you need to identify the unique partitionkey..here like timestamp and hit type
Relationships were not discussed - essentially I was given the above, and then any follow up questions were discouraged