System design interviews

Can anyone give me a proper explanation on the differences between document and column databases and what their use cases might be? Additionally, when every DB is tunable, (Cassandra is leaderless AP by default but could be made CP consistency tuning, vice versa for MongoDB), how do you properly vet a database solution on a system design interview. Even on YouTube system design vids, they’ll say things like “we should use dynamodb because it’s a hosted solution managed by AWS, which we are already using since I said S3 earlier”. How do folks make decisions between different noSQL dBs in their system design interviews?

Apple re:Desu Apr 6

- “Document DB” doesn’t actually mean much in terms of system design trade-offs, it’s more of a schema-on-read vs. schema-on-write term, and isn’t super consequential - idk what “column nodal databases” means, but “wide column” vs “column-oriented” vs “columnar” is stupendously confusing terminology — I think you’re referring to cassandra’s “wide column”. I don’t think it actually necessarily supports more columns than DynamoDB but it likely supports more secondary indexes than DynamoDB… DynamoDB can only handle up to around 30 secondary indexes. Another important piece is transaction support… you might be able to get CP with cassandra, but its transactions are actually pretty sh*t. In all honesty, unless you start memorizing what does snd doesn’t support r-trees, inverted indexes, and B-trees vs LSM-trees, then all of the following are basically equivalent: - DynamoDB - Cassandra - MongoDB - ElasticSearch - Riak Pro-tip 1: basically nobody uses MongoDB or riak rn (although the CRDTs in riak are cool af) you’d have to spin it up and manage it yourself, like cassandra Pro-tip 2: you’ll look foolish and rather junior if you actually use redis for anything beyond caching. I’ll admit the CRDT support is cool and it really does support an incredible variety of features, but it can’t do linearizable transactions and I just don’t trust the durability. I trust more mature solutions like postgreSQL and kafka for all the wild extra crap it does. If you’re feeling ready to dive into that advanced indexing choices stuff, I made the following which lists what index to use when and which DB solution supports each: https://youtu.be/Qhc8gFF2qS8?si=ahNSNDIHMtIMefd6

Meta spizle Apr 6

Maybe you mean column oriented vs row oriented? I don't think "nodal" is a thing

Capital One bank$y! OP Apr 7

Sorry idk why I wrote nodal I definitely just meant to say “columnar” db, but as one of the commenters pointed out, there’s a number of different types of “column” oriented databases

Apple re:Desu Apr 7

“columnar” typically refers to a trick done in OLAP / data warehouses, where you get a lot of full table scans. It’s where you store the columns on disk in the way that you’d normally store the rows. I think parquet files for trino/presto do this, as well as BigQuery/Dremel I don’t think secondary indexes are usually added, so it’s really bad for OLTP use cases

Meta spizle Apr 7

Let's say you have a table: ID fname lname dob 1 Jon Smith Jan 1 2000 2 Bob burgers Jan 2 2002 3 Bart Simpson Jan 3 2003 Row oriented is laid out on disk exactly like the above. If your fname and lname are each, say, char(20) and ints and dates each take 4 bytes then row length might be, say, 48 bytes. to get record 2 you seek in the file to bytes offset 48, read the next 48 bytes, and that gives you the full set of data for row 2. In column oriented, you store all of the IDs and then all of the fnames and then all of the lnames etc. If you want all the data for 2. Then you have to offset first 4 bytes into the ID column then 20 bytes into the fname column then 20 bytes into the lname column then 4 bytes into the dob column. That'd going to be pretty inefficient on disk. But say you want all the fnames that start with the letter B, in a large table column oriented will be more efficient to scan since it stores that column separately.