Creating a distributed object store on bare metal kubernetes. Critique my design.

For a side project, I’m designing a bare metal, scalable object storage solution. This would be for object storage, something competitive to S3/Ceph objects/Google cloud storage. Here’s the high level design: * bare metal servers running kubernetes (rancher). * server specs: Intel i9-13900K with 128GB ECC RAM. 4x4TB NVME SSD plus 6x4TB SATA SSD = 40TB per server. 10 Gbps DAC NIC. * kubernetes daemonset deploys filesystem I/O pod on each node. Filesystem IO pods inform a distributed database of the location of all files in the cluster. * write operation: client establishes connection and posts list of files it wants to upload. Service responds with list of signed URL’s that map to different backends in the cluster, plus instructions to upload larger files in chunks (eg no more than 100MB). Client uploads files to multiple backends in parallel. After successful upload, storage service replicates files to at least N other servers. * read operation: client wants to read list of files. Service responds with list of signed URLs corresponding to unique backend servers. Client downloads multiple files in parallel. Of course there will be authentication and authorization for the client server calls. What do you think? Could this design work, or is it shit? #engineering #software #swe

Workday b543 Apr 26

Out of curiosity how is this better than a normal storage server with ZFS?

Palo Alto Networks transverse OP Apr 26

* scalability. You’ll be able to add additional disks or nodes, and the service will know how to utilize them * performance. By spreading reads and writes across multiple nodes, you’ll be able to saturate your network connection and get better throughput * ability to apply some kind of map reduce operation down the line. If I want to develop a service on top of my storage cluster that runs compute intensive operations, it will be easier to do that if the data is already distributed * availability. When creating storage buckets, you can configure how many replicas of each file will be distributed across multiple servers

NetApp pitroda Apr 26

For read , request has to go to back end servers? No cache?

Palo Alto Networks transverse OP Apr 26

Why do you need cache if everything is stored on either NVME or SSD drives?

NetApp pitroda Apr 26

Nvmes are costly. Would you want tbs and tbs of nvme drives? Thats not gonna make customers happy. As for SSDs the frequent reads and writes are gonna impact the lifetime of SSDs, and since data is read in blocks from disks in OS, sequential access is gonna be faster than random access. Not a good way to use ssds as cache. So some logic is needed to optimise these devices for cost per GB .

NetApp pitroda Apr 26

This is very high level design, it does work. But latency depends on many other tiny details. You missed many things like fault tolerance and recovery, replication , rate limiting, management plane and counters, storing metadata of 100mb chunks , their recovery etc

Palo Alto Networks transverse OP Apr 26

Replication would be up to the end users when they create buckets. Recovery - I’ve thought a lot about that. Initially thought about using btrfs snapshots, but then decided it would be configurable per bucket if the user wanted backups, and if so how often. Stirring the metadata of the distributed chunks and files, that would be its own kubernetes service for sure.

Roblox KiLooL Apr 27

What is the problem you're trying to solve?