Let's say we have a users table that has a few hundred million users. how do you store this table in multiple servers? does the dbs support it or is it mostly application side logic? I know it is easier in mongodb but jjust wondering how this is done in relational dbs.
if you vertically scale it then it might all work on one server. mysql I believe has some auto sharding options. for postgres you would probably need application sharding logic based off hash or ranges.
at some point vertical scaling would hit its limit right? I wonder what are those limits. Can you store billions of rows with terabytes/petabytes of data in one db?
right but a hundred million users is probably ok. otherwise you need to shard
How does it work in mongo?
i am not an expert but it is a builtin configuration. probably a hashring sharded by the _id value.
Why u want rdbms . They won’t scale until you take some commercial new entries . Use data lake.
Check out storage disaggregated DBs like AuroraDB if you want to stick with RDBMS. They should perform well for your use case.
The first thing is to determine whether your queries are stable. If you are working in a very new domain, chances are they are not. If your queries are not stable then you first shard the database and then shoot yourself. The second step is to differentiate between queries that are transactions and queries that are analytics (aggregation over many rows and tables). You can forget all the hard work you have done for analytics at this step, you will need to rebuild the whole system for analytics queries after sharding. Now, for all the queries that serve the online users, consider all the rows\columns are never going to be part of the same transactions. If there are two different cells in the same row that are not going to be part of the same transactions, split the table into two. If there are two rows that are not going to be part of the same transactions, define the shard key based on the row differentiator. You will end up with islands of data which are strongly connected internally but not connected with other islands. You define an assignment and routing policy for these islands and put them on different RDMS instances. If there is some lose threads between two different clusters of data, you may need to use eventually-consistency scissors to make them islands. Define a thin routing aware client library so that the complexity of routing does not go beyond a few lines of code. Depending on the policy you are going to use, you may need a routing service as well, although for many cases consistent/randevous hashing can serve as a static configuration. Note, these steps are not free in the MongoDB world as well.
thanks
Tl:dr is don't shard until you absolutely have to. Yes horizontal scaling has a higher limit than vertical scaling but most transactional workloads aren't truly limitless. You're not going to go from 1m users to 1b users in a week.
I wonder how systems that require high consistency such as systems that handle financial transactions work. Most of them use relational dbs.
One of the worst lies that RDMS books tell people is the total amount of money across all accounts needs to be invariant. No, it's not. Bank transfers are inherently eventually consistent. Take any two bank accounts and send your money from one to another and check for yourself. The people misunderstand eventual consistency by thinking there are times where the system goes through incorrect states. This is wrong. If I send my money from account A to B, it is eventually consistent in the following order, 1. Atomically A -= amount and TransactionIntent(A, B, amount) gets created 2. TransactionIntent eventually applies where B gets credited with the amount 3. The intent executor has reply protection, where it ensures a TransactionIntent does not get applied twice. Eventual consistency is meant for people. The system must be always strongly consistent no matter what your technology of choice. You can do this in MongoDB too.
The important part here if anyone missed it, is the A in ACID giving you both 1) deduct from source account and 2) create TransactionIntent, atomically. If either fails then then transaction fails entirely. Banks are distributed systems by nature. No matter what DB you choose you won’t atomically deduct from Chase and add those funds to Wells Fargo.
Shard it? It falls on the application side to decide which DB cluster to hit. Usually that is also abstracted away from the application layer into its own service.
thanks. so does that mean that joins need to be handled on the application side as well or do you need to shard all the relevant data based on the same shard key?
Usually same shard key across all the tables where that user id is stored. Sometimes ppl employ different sharding keys for different table for the same user (very inefficient)