Tuesday, July 31, 2012

Duplicity as stateful rsync

Problem

Let's say you have a very large master mirror serving hundreds of gigabytes of data -- most of it in small files that don't change very often. Let's also say you have a 100 hosts that want an exact copy of that data -- full or partial -- and they want it mirrored as quickly as possible so that any changes on the master are quickly available on the hosts.
What do you do?

Rsync

Evidently, you would use rsync. However, you quickly realize that it is extremely inefficient due to its stateless nature and is therefore really poor at scaling up.
The way rsync works is by comparing various file attributes between the master mirror and the host. If you have a million files that you must mirror, on each run it will check a million files. If you have a hundred hosts that want to replicate your master data every fifteen minutes, you will be examining the same data 400,000,000 times each hour, reducing your master mirror to tears (or at least the administrator of that mirror).

Rsync --write-batch

You try to use rsync --write-batch on the master mirror and copy that file to your hosts instead. However, this quickly becomes a mess. If one of the mirroring hosts is down for maintenance and misses the propagation window, it will need to have multiple batches applied in order to catch up, or just use plain rsync to the mirror. If you have any number of hosts up or down, then you have to write thousands of lines of code just to keep track of which batch files can be applied and which must revert to rsync.
Moreover, if one of the hosts only wants a subset of your data, they will end up downloading a lot of batch data that they don't care about.

So, what do you do?

Basically, you are stuck. There is not a single open-source solution (that the author is aware of) that will allow one to efficiently propagate file changes.
The solution I am proposing doesn't yet exist, but I believe that it's a feature that can be easily added to an existing tool that is already extremely good at keeping track of changes and creating bandwidth-efficient deltas.

What is duplicity

From the project's web page:
Duplicity backs directories by producing encrypted tar-format volumes and uploading them to a remote or local file server. Because duplicity uses librsync, the incremental archives are space efficient and only record the parts of files that have changed since the last backup. Because duplicity uses GnuPG to encrypt and/or sign these archives, they will be safe from spying and/or modification by the server. The duplicity package also includes the rdiffdir utility. Rdiffdir is an extension of librsync's rdiff to directories -- it can be used to produce signatures and deltas of directories as well as regular files. These signatures and deltas are in GNU tar format.
Let's ignore the encryption bits for the moment -- that is an optional switch and duplicity can create unencrypted volumes. After duplicity establishes a baseline of your backup, it will then create space-efficient deltas of any changes and store them in clearly designated, timestamped files -- "difftars". It will also create a full file manifest with checksums of all files to make it easy to figure out during next run which files were changed or added.

How does that help?

Think of it in the following terms -- instead of publishing full trees of your files on the master, you publish a duplicity backup repository. In order to create the initial mirror, your hosts will effectively do a "duplicity restore", which will download all duplicity volumes from the master and do a full restore.
Here's where we get to the functionality that doesn't yet exist. Once the initial data mirror is complete, your hosts will periodically connect to the master to see if new duplicity manifest files are available. If there are new manifest files, that means there is new data to download. The hosts will then use manifest files to download the difftars with change deltas, and effectively "replay" them on their local backup mirrors.
Periodically, hosts can also go through the file signatures in manifest files to make sure that the local tree is still in sync with the master mirror. Since all manifests carry the mapping of files to difftars, if for some reason there are local discrepancies, hosts can download the necessary volumes from the master mirror in order to correct any errors.

Nice! What else is awesome about this?

  • Encryption is built-in into duplicity, so if you wanted to, you could encrypt all your data.
  • Duplicity supports a slew of backends, including ftp and rsync, which are already provided by large mirrors.
  • Duplicity creates difftars in volumes (25MB in size by default, but easily configurable). If one of the hosts needs to carry only a subtree of the master mirror, they can use the manifest mapping to only download the difftars containing the changes they want.
  • If one of the hosts need to update the mirror once a day instead of every 15 minutes, they can just replay multiple accumulated difftars.
  • If one of the hosts was down for a week, they can replay a week's worth of difftars to catch up.
  • If the master does a full rebase, the hosts can easily recognize this fact and use the manifest and signature files to figure out which volumes out of the new full backup they need to download.

So, what's the catch?

There are two that I'm aware of:
  1. Duplicity does not work like that yet. I believe it would be fairly easy to either extend duplicity proper to support this functionality, or write a separate tool that would build on top and pretty much "import duplicity" (yes, it is written in python).
  2. You effectively need twice the room on the master mirror -- one for the actual trees and another for duplicity content. This is a small price to pay if the benefit is dramatically reduced disk thrashing.

What can I do to help?

started a discussion on this topic on the duplicity-talk mailing list. Please feel free to join!
If you read that thread, you will see that I'm willing to obtain funding from the Linux Foundation to help develop this feature either as part of the duplicity proper or as a separate add-on tool.
If you write this feature, all mirror admins in the world will instantly love you.

2 comments:

mx said...

(sorry Bad my english)
Sorry, not looking at other items I do not quite understand:
Can duplitcity make a backup remotely. Why then I used to think you can but I can not: (

Example: duplicity rsync://xxxx /tmp

Or is it impossible?

Sol Jerome said...

Have you considered/tried Lsyncd (https://github.com/axkibe/lsyncd)? I haven't yet had the opportunity to actually try it out but it appears to fulfill your initial requirements.