arx-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Arx-users] Further thoughts on ArX and simplicity


From: Walter Landry
Subject: Re: [Arx-users] Further thoughts on ArX and simplicity
Date: Wed, 27 Jul 2005 18:14:57 -0700 (PDT)

Kevin Smith <address@hidden> wrote:
> Walter Landry wrote:
> > Kevin Smith <address@hidden> wrote:
> > 4) sha256
> > 5) sha256.sig
> > 
> >   These are the SHA256 of the revision and associated gpg detached
> >   signature.  These are just the hex representation of the SHA256, not
> >   using serialization.
> 
> SHA256 of what? Of the .tar.gz file?

Of the entire project tree.  The _arx/++manifest file has the hashes
of all of the files.  The sha256 file is the hash of that file.  So it
is an end-to-end check that what you got in the end is what you think
you are getting.

If you did not have this, someone would only have to break a single
link in the long chain of patches that make up a project tree.  If you
referenced revisions by hash, it would not matter, because breaking
the chain would mean breaking the hash.  But ArX currently doesn't do
this, so it needs the end-to-end checksum.

<snip>
> > Part of the patch is adding a patch log.  We have to know where to put
> > the patch log.  The only way to guarantee that there are no conflicts
> > is to use the hash.  If we did not support cherry-picking, then it is
> > pretty simple to just order the logs in one big file.  With
> > cherry-picking, it is no longer determined whether one patch comes
> > before another.
> 
> So this cherry-picking thing sounds like a key distinction between ArX 
> and other systems. A simplistic SCM can store a single file that lists 
> all the patches that have been applied, in order. I believe that GIT and 
> darcs do something like this.
> 
> Even with cherry picking, it seems like the patches must have been 
> applied to THIS branch (and therefore project tree) in some specific 
> order. What does the last sentence of that paragraph really mean?

I am not sure what I was thinking, since I realized that they can do
the moral equivalent of what I am proposing.  So scratch that :|

> Because the archive already has known location for the patch info (right 
> next to the patch itself). But that's also true in the project tree. In 
> both cases, patch infos are named according to their branch/revision. 
> I'm definitely missing something here. [Update...I think I'm getting it, 
> as you'll see near the end of this message.]

The patch info is also inside the patch.  The file outside the patch
is for performance.

> > We have to have the patch log in the tree because we have to record
> > which patches have been applied to the tree.  So when we run "get", we
> > will get a tree that knows that certain patches have been applied.
> > The revision hash incorporates the location and contents of the patch
> > log, so we have to know the location before we can compute the hash.
> 
> I don't yet understand the compelling benefit of revision hashes. I know 
> other systems use them, but are they necessary for ArX? Wait. Don't 
> answer that until you've read the rest of this message.

Revision hashes serve two purposes:

  1) A unique identifier

    This is needed when a single branch is receiving asynchronous
    input from different sources.  That means that if you create
    slightly different revisions from the same ancestor, they can not
    be confused with each other.  A random number (aka UUID) would
    mostly work, although you could mess yourself up by choosing the
    same random number twice.  Even if ArX itself would not do such a
    thing, once ArX gets popular(!?), I can guarantee you that third
    party tools will.

  2) Self verifier

    If I say that rev ab7324bd9ce is good, then you know that it is
    impossible for someone to replace it with something else, even if
    you downloaded it from a site in North Korea.  Digital signatures
    combined with random numbers is mostly equivalent.  However, the
    signature is not guaranteed to be with the revision, while the
    hash is an inherent quality of it.

> > I _think_ we could work around this by creating a random number that
> > we use to uniquify the patch log, and then have some metadata in the
> > patch that maps "random numbers" -> "hashes".  
> 
> That sounds even *more* complicated. I'm hoping for something simpler.
> 
> > Alternatively, we could just not include the location of the patch log
> > in the information that gets hashed.  Then we could put it in the
> > right place after applying a patch.  That would mean that patches
> > created with "diff" would be different from patches created with
> > "commit".  It also means that it would be a little difficult to
> > manually verify a hash (but that is really not a big deal).  The
> > mapping would still not be hashed, but that is ok, because when you
> > get the patch, you presumably know which patch you want.  Hmm.  That
> > might work.
> 
> If I understand this, you were originally thinking that the patch would 
> contain its info location, and therefore the hash of the patch would 
> depend partly on the info location. But now you're thinking that the 
> patch could be independent of its info.

Not quite.  The patch still depends on the info (commit message,
changed files, etc.), but not on the name or location of the info.  It
would be put in a generic place like "latest log".  Once the patch is
applied, the "latest log" would be put into its final resting place.

However, your suggestion of making the patch independent of the info
is interesting.  I need to think about it some more.

> >>I would love to figure out a way to implement remote branching without 
> >>having to overhaul the archive format.
> > 
> > Archive names are pervasive, so I think modifying the archive format
> > is inevitable.  On the other hand, since we have to modify the format
> > anyway, we can sneak in a few other improvements.
> 
> Ok, but here's my current thinking, in abstract form:
> 
> An archive is be a collection of branches. Each branch is an ordered 
> collection of patches.

Right now, everything is ordered.  But with revision hashes, they
would become partially ordered.  That is inevitable if you are going
to allow two people to work independently on the same line of
development.

> A project tree is a local working copy of a single branch. It needs to 
> know the URL of its primary archive, plus the specific branch name from 
> within that archive. These combine to point to a specific target for 
> things like get, diff, and commit.
>
> Now, to facilitate remote branches, an archive can store, for each of 
> its branches, a "fallback URL". Any information about that branch that 
> isn't found in this archive can be found in the archive located at the 
> fallback URL. If the information isn't there, check THAT archive's fallback.
> 
> If an archive moves, the URL's that referred to it must be updated.

That is exactly what I am thinking.

> Now, compare that conceptual model to what ArX has today:
> 
> An ArX archive stores each branch in its own directory. Within each 
> branch, there is a list of patches, and each patch has the data (as a 
> tarball), metadata (aka "patch info" or an ArX "patch log"), and (I 
> think) a signature-of-tarball, hash-of-tarball, and 
> signature-of-hash-of-tarball. That all seems like what I described above.

As noted above, it is signature-of-patch-tarball, hash-of-revision,
and signature-of-hash-of-revision.  Also, the patch info is strictly
for performance, so that we don't have to unpack a patch everytime we
need to know something about a patch.

> An ArX project tree stores archive+branch as a single entity, rather 
> than as two separate fields. A subtle difference, but one that makes it 
> harder to move an archive to a new location.
> 
> An ArX project tree stores a patch log...which seems to be identical to 
> what's already in the archive for this branch. Is this just a speed 
> optimization for archives that aren't local?

Some of those archives might not be available.  This is really a
concern only if you want to support distributed branches.  Otherwise,
you could do what monotone does, and just get everything out of the
archive.

> Where does cherry picking fit into (or conflict with) this vision? Ah. 
> We need to know that patch (revision) 57 in branch a is actually the 
> same patch as revision 33 in branch b. If we use the hash of a patch as 
> its identifier, we can know that we have or have not already pulled that 
> patch. We want a map of hash->patch for fast checking.

Are you talking about patches or revisions? In any case, patch or
revision 57 in branch a is never the same as patch or revision 33 in
branch b.  They have different names, which is part of the hash.

> Should we hash just the patch data, or also its metadata? I would think 
> it would be just the data, because the metadata might have an additional 
> "signed off by", or a fixed typo in the commit comment. That would argue 
> for storing each patch in a file named by the SHA-1 of the patch.

But then the meta-data is no longer trusted.  The right way to fix
things like typos is to delete the old revision and commit a new
revision with corrected metadata.

> The corresponding patch info could be stored under the same name with a 
> .info appended (or whatever). But that would prevent having the same 
> patch in the system have two different sets of metadata associated with 
> it. I don't know if that's a problem. I suppose there could be .info.1, 
> .info.2, etc. if necessary.

I don't understand this part.

> The patch-log (ordered collection of patch infos) could contain an 
> ordered list of patch hashes. It could also store patch info hashes, to 
> allow verification that the patch info hasn't changed.

I am not sure why you are worried about the patch info changing in the
project tree.  Are you worried that someone will break into your
machine and change the patch info but not the corresponding checksum?

> You mentioned that each time a fork is created, that revision needs to 
> record the branch it came from. Maybe this concept could be split into 
> two parts. First, a revision can ONLY be forked from within the same 
> archive. That way, a revision only needs to store a local/relative 
> branch/subbranch,revision. Meanwhile, a *branch* can be forked from some 
> other archive. So the URL of the source archive becomes a piece of 
> branch data, which gets it out of the hash picture and makes it much 
> easier to update if the source archive moves.

Yep, this is what I was thinking.

> I think the same thing is true of a tag, but since an ArX tag is quite a 
> bit different from tags in most other SCM tools, I'm not sure. If I were 
> designing tags, I would make them merely a record that attaches a 
> symbolic name to a particular branch/subbranch,revision within the archive.

It should also work for tags.

> Thanks for bearing with me as I try to understand this stuff!

These emails take a fair amount of time to write and go through so
many revisions, I get to the point where even I don't understand them ;)

Cheers,
Walter




reply via email to

[Prev in Thread] Current Thread [Next in Thread]