Really heavy dataset, how do i fetch queries efficiently ? Assume 1 trillion json records. Any no sql database optimization engines available in market? I havn’t worked in no sql database optimization engines , any pointers will be helpful. Lets say if the current query takes 10 sec, how do I improve the performance to 1 sec.
Scehma? Is query based on key? Based on value? It's a vague question.
Yeah what’s your query? How’s your data partitioned? What are thé writes to your table? Is it a stationary dataset? How big is each record? What information are you retrieving from each query - the entire record or just id? What is the struct of your Json records? What is the memory of your no-sql store. Is your table key-space partitioned by date? Do you do aggregates on write time? I think they are just looking for you to ask the right questions
Write an etl job to make the data fit a few relational db. Create keys and add an index. They could be testing if you say something stupid like "use MongoDB" without understanding the tradeoffs. Looks like they were specifically asking about nosql though
Or if cost isn’t an issue, use AWS DocumentDB if each record is large, or use AWS dynamoDB is each record is less than 4Kb. Both have SLA for single millisecond latency
Was thinking same thing
What is the query you are trying to optimize? Single gets? Range queries? Joins?
Its a mix of simple and complex queries. Simple as get names from all records, also could be complex as get range of salaries.
If you’re getting names or all records , the issue isn’t the query time it’s probably the data transfer time. Also you should probably be paginating such queries
You can't run joins for sure, rather they will be useful . For a db like Cassandra , we have some static columns that might help you store latest state of data and they will be avlbl at all rows of the index. Maybe that will help
Follow up question: what are some sample queries. Naive answer direction: Define secondary key/index on your data using the query pattern in the following fashion (assuming your existing giant data will already have a primary key): 1. Partition keys to distribute data for uniform activity across distributed nodes/partitions. 2. Sort keys to query that well distributed data efficiently (since data inside that partition is stored in the order of sort key)
This and the etl job answer are the only ones that really answer the question. They want you to demonstrate understanding of how a secondary index is built and maintained, either that or you have a denormalized intermediate table optimized for your queries like the Twitter guy suggested.
OP, are you a data engineer at Visa?
Nope. I am a full stack engineer. But the interviewers experience was more aligned towards databases.
Index
Seems like a leetcode hard problem.