The adage is that you break your file into chunks (like 5 MB) and you check which chunks have changed their hashes. I don’t understand part of this: Original: ABCDE|FGHIJ|KLMNO| c1. c2. c3 Add data to the front of the first chunk. ZABCDEFGHIJKLMNO Are the chunks now: ZABCD|EFGHI|JKLMN|O c1. c2. c3. c4. Like you have to change all the chunks? Or does the first chunk just increase to 6 MB and you leave the rest at 5 MB? What happens?
Got into Meta but can’t build dropbox clone over the weekend? The bar is so low
I think your example is correct. Some relevant points: 1) There's no approach that handles every case perfectly, with no "bad" cases. E.g. if instead you have variable length chunks, your storage has to work well for variable length chunks, your algorithm is more complex (how do you decide when you're going to add to a chunk vs create new chunks?), you might end up with some chunks vastly large than others etc. 2) The most common file modification is writing to the end of the file, which would change only 1 chunk with this algo. Inserting into the middle is also going to be a lot of writes, this algo does decently for that too. Writing at the beginning of a file happens some, but you can't win them all.
If you keep track or each chunk and it’s order in metadata store, then you have ability to modify one of the chunks (if still below the allocated size, then no new chunks, else break apart that chunk to 2 chunks and update the doubly linked list metadata bookkeeping). linux inode does something like this
This is an interview question? Isn’t this how copy on write filesystems like btrfs work? Only the blocks with changes get written to a new block, everything else stays the same? Or something.
The first chunk would be ZABCDE and rest of the chunks are not updated. If the size of first chunk exceeds max limit, a new chunk would be created. This is my understanding and am curious about other responses.
With this logic we end up updating the sequence numbers for each chunk.