[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [rdiff-backup-users] Interesting write-up of 'compare-by-hash'.
From: |
Greg Freemyer |
Subject: |
Re: [rdiff-backup-users] Interesting write-up of 'compare-by-hash'. |
Date: |
Wed, 28 Jan 2004 14:08:05 -0500 |
On Tue, 2004-01-27 at 21:33, Ben Escoto wrote:
> >>>>> Greg Freemyer <address@hidden>
> >>>>> wrote the following on Fri, 02 Jan 2004 14:39:18 -0500
>
> > The writeup at:
> >
> > http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson_html/hash.html
> >
> > in effect says that 'compare-by-hash' is not the perfect choice for
> > a rdiff-backup type of program. Instead, it says some kind of state
> > information should be added to the algorithym to ensure accuracy.
> >
> > I don't know that I am concerned, but I am curious if rdiff-backup
> > takes any precautions against hash-collisions.
>
> No, rdiff-backup takes no special precautions. As far as I can tell,
> the paper makes the point that even cryptographically strong hashes
> may be breakable in the future. This is a pretty good objection,
> especially since librsync uses md5 instead of the stronger SHA1.
>
> But in rdiff-backup the most an attacker could do would be to produce
> a file that had the same signature, causing rdiff-backup to skip over
> parts of the file. But an attacker can already do much better than
> that: just use utime(), chmod, and such to make it look like your file
> hasn't changed. Plus this method doesn't take years of research.
>
> So anyway, rdiff-backup isn't intended to defeat an attacker trying to
> make it look like his files haven't changed.
>
My impression was the write-up was talking about an inadvertant
hash-collision, not hackers.
Since the hashing process is lossy (ie. non-reversable), then it is
possible that two totally different data sets could generate the same
hash, and in turn invalidate the backup checksum check.
I don't know what the odds are of this happening with rdiff-backup.
I assume that they are exceedingly small, but not zero.
>
> I did find this quote from the article interesting:
>
> > BitKeeper keeps state about each file under source control and knows
> > what changes have been made since the last time each tree was
> > synchronized. When synchronizing, it sends only the differences
> > since the last synchronization occurred, in compressed form. In
> > comparison to rsync, BitKeeper provides similar and sometimes better
> > bandwidth usage when simply synchronizing two trees without
> > resorting to compare-by-hash.
>
> Does anyone know how BitKeeper can send only the differences without
> using a compare-by-hash algorithm? I don't see how this is possible
> unless BitKeeper has abandoned the usual edit w/arbitrary editor and
> then checkin scheme.
>
I don't how bitkeeper does it, but I am familiar with drbd. A network
based disk mirroring solution.
With drbd, if the network/cluster is up, then all writes are mirrored
immediately.
If the network is down, a dirty table is maintained. Anytime a write is
done, the dirty table is flagged.
Then, when communication is re-established only the dirty blocks are
syncronized.
LVM snapshots also maintain a dirty block table (or Copy-On-Write
table).
It is obviously very non-trivial, but the ultimate way for a
rdiff-backup tool to work would be to somehow tie into a block level
dirty table. The table would then be cleared when the block is backed
up.
As I said, LVM already has the concept of a dirty block table, so it is
not a totally crazy idea.
Anybody need a doctorial thesis project?
Greg
--
Greg Freemyer