Over the years I've been (mildly) fascinated by how various version control tools and file backup utilities work. Especially the core algorithm that drives many of these file send/diff/backup/de-duplication programs.
Rsync being the most widely used tools and the basis for many extensions, I naturally tried to wrap my head around it's working. But I thought the details were somewhat hazy. Maybe it was just me but I was looking for a simpler, clearer implementation of the algorithm and not a fully functioning program.
Recently, I gave it another shot. I waded through some of the material available on the interwebs and bravely set out to implement it to see how much of I had understood.
So, here is the basic implementation in Java. It may not be a faithful implementation of the paper but the gist is:
- Create a summary out of fixed blocks of input text (original)
- Use these blocks as reference against another text (modified)
- This modified text is slightly different from the original text
- Hence the assumption that the original text can be transformed to the modified text without having to send the entire modified text back
- The modified text can now be transformed into a combination of:
- References to those original blocks where there were no changes
- And any differences as simple text
Some notes on the implementation:
- It only handles Java Strings
- It uses a combination of Rabin-Karp rolling hash for quick, incremental hashing of blocks and CRC32 for hash conflict resolution. In reality a much more robust hash should be used instead of CRC32
- It assumes that the list of generated "blocks" is available on the other side to generate the patch. In reality there has to be a more clearly defined mechanism/protocol to exchange these blocks
- The overall algorithm to identify common/repeating hashes should be smarted than this
Until next time!