I was reading grokking for Dropbox and there they mention that the file is broken into 4mb chunks and sha256 of that is calculated to check if that chunk is already present on server for deduplication. I was wondering if two chunks can have same sha256 then how is this reliable ?
Theoretically you can find more than 1 pre image of a single sha-256 hash but practically its almost impossible. That’s why we frequently use Sha-256 digest as checksums also to verify data integrity
Since sha256 has 2^256 possible output and the input space is unbounded so collision has to happen someday. So you mean theoretically collision is still possible but it's highly unlikely so we ignore ?
Even if you have a collision, you can get around with some chunk-id, which can be uniquely chosen for each chunk in a file... A better approach maybe is to make immutable chunk and compare chunk-ids...