Tech IndustryMay 18, 2019
VisaHdjrjfn

How to optimize no-sql database queries ?

Really heavy dataset, how do i fetch queries efficiently ? Assume 1 trillion json records. Any no sql database optimization engines available in market? I havn’t worked in no sql database optimization engines , any pointers will be helpful. Lets say if the current query takes 10 sec, how do I improve the performance to 1 sec.

Scotiabank 00x00 May 18, 2019

Seems like a leetcode hard problem.

Google SonJnow May 18, 2019

Scehma? Is query based on key? Based on value? It's a vague question.

Visa Hdjrjfn OP May 18, 2019

Yea its a vague question. Thats all the input i got in interview. I was not sure what else i could ask him, as i have not worked on this stuff before.

Visa Hdjrjfn OP May 18, 2019

Its a no sql database. Every record is json format.

Twitter Oomnj May 18, 2019

Yeah what’s your query? How’s your data partitioned? What are thé writes to your table? Is it a stationary dataset? How big is each record? What information are you retrieving from each query - the entire record or just id? What is the struct of your Json records? What is the memory of your no-sql store. Is your table key-space partitioned by date? Do you do aggregates on write time? I think they are just looking for you to ask the right questions

Visa Hdjrjfn OP May 18, 2019

Thanks. Makes sense. He asked me to create a db optimization engine on top of database. He said we cannot use mongodb.

Twitter Oomnj May 18, 2019

Feel free to DM if you wanna talk more

Google Methane May 18, 2019

Write an etl job to make the data fit a few relational db. Create keys and add an index. They could be testing if you say something stupid like "use MongoDB" without understanding the tradeoffs. Looks like they were specifically asking about nosql though

Twitter Oomnj May 18, 2019

Or if cost isn’t an issue, use AWS DocumentDB if each record is large, or use AWS dynamoDB is each record is less than 4Kb. Both have SLA for single millisecond latency

Salesforce Full-Send May 18, 2019

Was thinking same thing

LinkedIn nx of pc q May 18, 2019

What is the query you are trying to optimize? Single gets? Range queries? Joins?

Visa Hdjrjfn OP May 18, 2019

Its a mix of simple and complex queries. Simple as get names from all records, also could be complex as get range of salaries.

Twitter Oomnj May 18, 2019

If you’re getting names or all records , the issue isn’t the query time it’s probably the data transfer time. Also you should probably be paginating such queries

Barclays PLC FAANGHNTER May 18, 2019

You can't run joins for sure, rather they will be useful . For a db like Cassandra , we have some static columns that might help you store latest state of data and they will be avlbl at all rows of the index. Maybe that will help

Microsoft UMbR31 May 18, 2019

Follow up question: what are some sample queries. Naive answer direction: Define secondary key/index on your data using the query pattern in the following fashion (assuming your existing giant data will already have a primary key): 1. Partition keys to distribute data for uniform activity across distributed nodes/partitions. 2. Sort keys to query that well distributed data efficiently (since data inside that partition is stored in the order of sort key)

Marqeta GskP9jqS May 18, 2019

This and the etl job answer are the only ones that really answer the question. They want you to demonstrate understanding of how a secondary index is built and maintained, either that or you have a denormalized intermediate table optimized for your queries like the Twitter guy suggested.

Microsoft vxefcdrvsw May 18, 2019

OP, are you a data engineer at Visa?

Visa Hdjrjfn OP May 18, 2019

Nope. I am a full stack engineer. But the interviewers experience was more aligned towards databases.

Nvidia samelove May 18, 2019

Just use notepad++

Visa Hdjrjfn OP May 18, 2019

🧐🤔

Amazon kudobear May 18, 2019

Index