Often you are asked: how much storage would you need per year for this service. And you go through the capacity estimates: 10 M Daily active users * X average things done by users * Y KB of that thing = Z Terabytes of data per year. Ok? So what? What are you supposed to do with that information ? Are you supposed to use that to inform your database choices ? Like if size > X terabytes per year, we need to use oracle/mysql, else use a kv store, else use file system, etc ? If so, what are some rough 'rules' governing that? Otherwise, is capacity estimate just an isolated part of system design, and you don't necessarily need to use that information to dictate other parts of your design ?
Cogs estimate
😼
Just make up some calculations and then say that since the volume is large, we need to scale horizontally.
Lol This ^ 😂
Why are these storage estimation questions asked to PMs
AMA
6h
1644
Single engineer (Female) AMA!
2024 Tax
Yesterday
4851
Biden’s new tax proposal is wild
Tech Industry
2d
55930
Goog Employees Arrested
Tech Industry
Yesterday
29778
Google doing more layoffs, restructuring including country moves
2024 Presidential Election
2h
281
The "stolen" election narrative is a brilliant strategy from Trump
There are various dimensions to capacity estimates. With the storage dimension, it could help you decide if your database server can handle the required storage. For eg: if only 4TB disks can be attached, then you need to have a partitioned db. Similarly estimates for throughput, rps are all necessary in scaling out a system
Thanks. What are some rule of thumbs for storage and rps in terms of how they govern your design choices ?
It depends. For rps it generally boils down to how many cpus your box have and if it is throttling the cpu. Generally 50k rps for a 16 core cpu is achievable. Similarly for throughput it depends on your network bandwidth, otherwise yo will encounter TCP congestion control if you use TCP or packets will be lost if you use UDP if you go beyond you bandwidth capacity. For storage, it again depends. For eg if you are using a distributed storage engine (DynamoDB, Cassandra ) do you have good distribution of your keys? If you do then the storage per node per day will be your throughput/s/node* 86400 . If you have ttl then you can recycle some storage. If you have uneven traffic pattern then your design is wrong and you need a new schema