Interview question: design for storing and analyzing big data

eBay WhatTheH
Aug 17, 2018 11 Comments

In one of the interviews I was asked to design a system which which handles following,
This service gets around 20 million requests per day for storing data. We need to develop a dashboard (graph where x axis is date range and y axis some requests per day with some filter). This should be real time i.e. All requests should be reflected on dashboard with no or little delay. I don't have any experience with big data or NoSQL databases so couldn't give the correct to answer. Since users can select filters on dashboard using relational dbs will not work as joins will be expensive and not fast enough.

How do you solve it? It will be great if someone can explain or point to any useful links.

comments

Want to comment? LOG IN or SIGN UP
TOP 11 Comments
  • Salesforce sZlm46
    we’re gonna need a giant series of tubes
    Aug 17, 2018 0
  • eBay BGeez
    There are 2 parts here.
    1. The storage of the data to make the computations on the fly
    2. The delivery of data to the storage.

    2. You could use a Kafka like pub sub system where the topics are filters.

    For 1. You would end up using a columnar database with pages and partitions. The partitions would map to date range and you can allocate the pages on a minute to minute basis. Each column can map to specific filter and the data will be the #request for that minute.

    And the dashboard request would be read on the specific partition filter and you can compute sum, averages etc on the fly.

    This is with limited understanding of the question and not knowing what is the interviewers inclination. I have a lot of clarifying questions though.

    Edit: looking for other responses/suggestions
    Aug 17, 2018 1
    • eBay WhatTheH
      OP
      Thanks. What kind of clarifying questions would you ask to interviewer?
      Aug 17, 2018
  • Akamai Technologies gain
    In my opinion I would have a kafka->spark-> S3 for storing data(dumps) and integrate with ELK stack(elasticsearch, logstash, kibana) for metrics and visualization
    Aug 17, 2018 0
  • Apple / Eng cho Chang
    Ya this seems like designing telemetry.
    Aug 17, 2018 0
  • POWER Engineers / Biz Dev NtTy44
    If this isn’t your domain, why would you be asked this question?
    Aug 17, 2018 1
    • Apple Rjcricksnf
      Haven’t you done system design style interviews before?
      Aug 17, 2018
  • SAP / Other Avacafo
    I’m not an engineer but logging services like Elasticsearch / kibana are built around such usecases?
    Aug 17, 2018 0
  • Twitter / Eng T Rex
    Is the filter simple (so that a counter/hour for all filter is easy to store) or complex (where precomputing every filter is not feasible)? Does precomputing commonly used filers help?
    Aug 17, 2018 1
    • eBay WhatTheH
      OP
      Filters could be based on category, location, device like n number of attributes. Since data is being updated every second I'm not sure if precompute would work here
      Aug 17, 2018
  • I think a key detail is you are not needing to display the requests themselves, just a count per day based on filter.
    Aug 17, 2018 0