arx-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Arx-users] Further thoughts on ArX and simplicity


From: Kevin Smith
Subject: Re: [Arx-users] Further thoughts on ArX and simplicity
Date: Tue, 26 Jul 2005 23:46:21 -0400
User-agent: Mozilla Thunderbird 1.0.2 (X11/20050404)

Walter Landry wrote:
Kevin Smith <address@hidden> wrote:

Can you point to any design docs that describe the existing archive format?

(Excellent description of ARCHIVE and PROJECT TREE data files mostly snipped)

Thanks!


The rest of the archive is the actual data.  For a revision with the
name branch.subbranch,revision, that patch will be in the directory
branch/subbranch/,revision.  The patch itself can contain up to 5 files

1) branch.subbranch,revision.patches.tar.gz
2) branch.subbranch,revision.patches.tar.gz.sig

  This are the actual patch and associated gpg detached signature.

3) log

(snip)

4) sha256
5) sha256.sig

  These are the SHA256 of the revision and associated gpg detached
  signature.  These are just the hex representation of the SHA256, not
  using serialization.

SHA256 of what? Of the .tar.gz file?

Obviously a "log" here is not what I think of as a "log". Perhaps we could come up with a better name for it to avoid confusion.


I am confused by your confusion.  What do you think a log is?

I think my main confusion is that I think of a log as being a collection of entries. ArX creates a "log" for each checkin. I would probably use the name "commit data" or "patch description". To me, the log is the concatenation of a bunch of those smaller atoms.

I notice that you use the term "patch-log" to refer to a collection of "logs of patches". That does seem confusing. From here on, I will use the term "patch info" to refer to what ArX currently calls the log for a single patch.

Part of the patch is adding a patch log.  We have to know where to put
the patch log.  The only way to guarantee that there are no conflicts
is to use the hash.  If we did not support cherry-picking, then it is
pretty simple to just order the logs in one big file.  With
cherry-picking, it is no longer determined whether one patch comes
before another.

So this cherry-picking thing sounds like a key distinction between ArX and other systems. A simplistic SCM can store a single file that lists all the patches that have been applied, in order. I believe that GIT and darcs do something like this.

Even with cherry picking, it seems like the patches must have been applied to THIS branch (and therefore project tree) in some specific order. What does the last sentence of that paragraph really mean?

Because the archive already has known location for the patch info (right next to the patch itself). But that's also true in the project tree. In both cases, patch infos are named according to their branch/revision. I'm definitely missing something here. [Update...I think I'm getting it, as you'll see near the end of this message.]

We have to have the patch log in the tree because we have to record
which patches have been applied to the tree.  So when we run "get", we
will get a tree that knows that certain patches have been applied.
The revision hash incorporates the location and contents of the patch
log, so we have to know the location before we can compute the hash.

I don't yet understand the compelling benefit of revision hashes. I know other systems use them, but are they necessary for ArX? Wait. Don't answer that until you've read the rest of this message.

I _think_ we could work around this by creating a random number that
we use to uniquify the patch log, and then have some metadata in the
patch that maps "random numbers" -> "hashes".

That sounds even *more* complicated. I'm hoping for something simpler.

Alternatively, we could just not include the location of the patch log
in the information that gets hashed.  Then we could put it in the
right place after applying a patch.  That would mean that patches
created with "diff" would be different from patches created with
"commit".  It also means that it would be a little difficult to
manually verify a hash (but that is really not a big deal).  The
mapping would still not be hashed, but that is ok, because when you
get the patch, you presumably know which patch you want.  Hmm.  That
might work.

If I understand this, you were originally thinking that the patch would contain its info location, and therefore the hash of the patch would depend partly on the info location. But now you're thinking that the patch could be independent of its info.

I would love to figure out a way to implement remote branching without having to overhaul the archive format.

Archive names are pervasive, so I think modifying the archive format
is inevitable.  On the other hand, since we have to modify the format
anyway, we can sneak in a few other improvements.

Ok, but here's my current thinking, in abstract form:

An archive is be a collection of branches. Each branch is an ordered collection of patches.

A project tree is a local working copy of a single branch. It needs to know the URL of its primary archive, plus the specific branch name from within that archive. These combine to point to a specific target for things like get, diff, and commit.

Now, to facilitate remote branches, an archive can store, for each of its branches, a "fallback URL". Any information about that branch that isn't found in this archive can be found in the archive located at the fallback URL. If the information isn't there, check THAT archive's fallback.

If an archive moves, the URL's that referred to it must be updated.

Now, compare that conceptual model to what ArX has today:

An ArX archive stores each branch in its own directory. Within each branch, there is a list of patches, and each patch has the data (as a tarball), metadata (aka "patch info" or an ArX "patch log"), and (I think) a signature-of-tarball, hash-of-tarball, and signature-of-hash-of-tarball. That all seems like what I described above.

An ArX project tree stores archive+branch as a single entity, rather than as two separate fields. A subtle difference, but one that makes it harder to move an archive to a new location.

An ArX project tree stores a patch log...which seems to be identical to what's already in the archive for this branch. Is this just a speed optimization for archives that aren't local?

Where does cherry picking fit into (or conflict with) this vision? Ah. We need to know that patch (revision) 57 in branch a is actually the same patch as revision 33 in branch b. If we use the hash of a patch as its identifier, we can know that we have or have not already pulled that patch. We want a map of hash->patch for fast checking.

Should we hash just the patch data, or also its metadata? I would think it would be just the data, because the metadata might have an additional "signed off by", or a fixed typo in the commit comment. That would argue for storing each patch in a file named by the SHA-1 of the patch.

The corresponding patch info could be stored under the same name with a .info appended (or whatever). But that would prevent having the same patch in the system have two different sets of metadata associated with it. I don't know if that's a problem. I suppose there could be .info.1, .info.2, etc. if necessary.

The patch-log (ordered collection of patch infos) could contain an ordered list of patch hashes. It could also store patch info hashes, to allow verification that the patch info hasn't changed.

You mentioned that each time a fork is created, that revision needs to record the branch it came from. Maybe this concept could be split into two parts. First, a revision can ONLY be forked from within the same archive. That way, a revision only needs to store a local/relative branch/subbranch,revision. Meanwhile, a *branch* can be forked from some other archive. So the URL of the source archive becomes a piece of branch data, which gets it out of the hash picture and makes it much easier to update if the source archive moves.

I think the same thing is true of a tag, but since an ArX tag is quite a bit different from tags in most other SCM tools, I'm not sure. If I were designing tags, I would make them merely a record that attaches a symbolic name to a particular branch/subbranch,revision within the archive.

Thanks for bearing with me as I try to understand this stuff!

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]