|
From: | Marcel (Felix) Giannelia |
Subject: | Re: [rdiff-backup-users] atomic increment files? |
Date: | Wed, 11 Mar 2009 03:23:41 -0700 |
User-agent: | Thunderbird 2.0.0.16 (X11/20080726) |
An interesting thing about the output tarballs from my script: if I rdiff two of them, one of them plus the patch file is significantly smaller than two of them (presumably because diffs on different days are nonetheless similar).* This is probably very dependent on what kind of data is being backed up, but it may lead to a way to make increment storage even more efficient (but also more fragile, since a restore would take two levels of merging). It's also very possible that this is a clear indication that I've done something very wrong in my script that's causing duplicate data in what are supposed to be separate increments. Further testing is required ;)I've been doing some more experimenting, and I've found a partial explanation for this. Mirror metadata files are huge! In my case each one is 88MB (uncompressed), though they're only 6MB when gzipped. There are only minor differences between consecutive ones (a diff patch from one to the next is on the order of 61KB uncompressed, 11KB compressed; an rdiff patch is considerably larger since they're plain text), so in my example above that explains some of the saved space. File statistics files probably also don't change much, but they too don't account for much when compressed (also 5MB apiece).*Example from my test set: a collected increment from 2008-10-04 is 49MB, and the one from 2008-10-05 is also 49MB (total 98MB). An rdiff delta file to turn 2008-10-04 into 2008-10-05 is only 18MB, so 2008-10-04 plus the delta file is 67MB. Another delta to turn 2008-10-05 into 2008-10-06 is also only 18MB, so the three of them together are 85MB instead of 147MB. Again, this is probably highly dependent on the kind of data that's in these increments, but I'm surprised it works as well as it does given that I'm tarring some already-gzipped files together.
In round 2 of testing, I tried uncompressing all of the files in an increment, and then re-storing that as a tar, then generating the rdiff delta, and then recompressing everything. This yielded a very slight advantage in compression, but a significant one in rdiff'ing -- the rdiff deltas are down from 18MB to only 7.4MB (and as I said, some of that can be explained away by similarities in mirror metadata files).
That leaves, of a 49MB increment: 7.4MB of data that's different + 5MB of nearly-identical file statistics + 6.1MB of nearly-identical mirror metadata + another 30.5 MB of data that must identical between the two increments.
This leads me to suspect that rdiff-backup is storing snapshots of things that it shouldn't. Even if rdiff-backup routinely stores snapshots every 10 times a file changes (as was mentioned earlier), I find it unlikely that this would coincidentally happen to enough files on 7 consecutive backup runs (I've run this experiment on 7 adjacent pairs of increments and get similar numbers for all of them) to get the kind of numbers I'm getting.
Another possibility is that these overlaps can be explained as file moves. Currently I think rdiff-backup cannot detect a file move, and stores it as a deletion plus a new file; correct? If so, then perhaps what's happening here is that part of the backup data set includes daily-rotated logfiles. Rdiff can detect the identical blocks, because when I'm using it on tarballs of the entire increment, all of the data is in one file. Supposing the rotating logs keep 10 files, then rdiff-backup is seeing 10 files change so drastically that it's cheaper to store snapshots, but rdiff sees 10 large blocks of identical data that just happen to have moved down by a unit or two in the tarball.
So, perhaps my harebrained original suggestion of storing increments as single files has lead to a relatively easy way to implement file move detection? (I'll be the first to point out, though, that since it requires tarballs to work from, it's not particularly efficient to create even if it is efficient to store once it's done. There might be a better way of this same idea, though.)
~Felix.
[Prev in Thread] | Current Thread | [Next in Thread] |