Re: [Arx-users] Further thoughts on ArX and simplicity

arx-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Arx-users] Further thoughts on ArX and simplicity

From:	Kevin Smith
Subject:	Re: [Arx-users] Further thoughts on ArX and simplicity
Date:	Tue, 26 Jul 2005 23:46:21 -0400
User-agent:	Mozilla Thunderbird 1.0.2 (X11/20050404)

Walter Landry wrote:

Kevin Smith <address@hidden> wrote:
Can you point to any design docs that describe the existing archiveformat?

(Excellent description of ARCHIVE and PROJECT TREE data files mostlysnipped)


Thanks!


The rest of the archive is the actual data.  For a revision with the
name branch.subbranch,revision, that patch will be in the directory
branch/subbranch/,revision.  The patch itself can contain up to 5 files

1) branch.subbranch,revision.patches.tar.gz
2) branch.subbranch,revision.patches.tar.gz.sig

  This are the actual patch and associated gpg detached signature.

3) log


(snip)

4) sha256
5) sha256.sig

  These are the SHA256 of the revision and associated gpg detached
  signature.  These are just the hex representation of the SHA256, not
  using serialization.


SHA256 of what? Of the .tar.gz file?

Obviously a "log" here is not what I think of as a "log". Perhaps wecould come up with a better name for it to avoid confusion.
I am confused by your confusion.  What do you think a log is?

I think my main confusion is that I think of a log as being a collectionof entries. ArX creates a "log" for each checkin. I would probably usethe name "commit data" or "patch description". To me, the log is theconcatenation of a bunch of those smaller atoms.

I notice that you use the term "patch-log" to refer to a collection of"logs of patches". That does seem confusing. From here on, I will usethe term "patch info" to refer to what ArX currently calls the log for asingle patch.

Part of the patch is adding a patch log.  We have to know where to put
the patch log.  The only way to guarantee that there are no conflicts
is to use the hash.  If we did not support cherry-picking, then it is
pretty simple to just order the logs in one big file.  With
cherry-picking, it is no longer determined whether one patch comes
before another.

So this cherry-picking thing sounds like a key distinction between ArXand other systems. A simplistic SCM can store a single file that listsall the patches that have been applied, in order. I believe that GIT anddarcs do something like this.

Even with cherry picking, it seems like the patches must have beenapplied to THIS branch (and therefore project tree) in some specificorder. What does the last sentence of that paragraph really mean?

Because the archive already has known location for the patch info (rightnext to the patch itself). But that's also true in the project tree. Inboth cases, patch infos are named according to their branch/revision.I'm definitely missing something here. [Update...I think I'm getting it,as you'll see near the end of this message.]

We have to have the patch log in the tree because we have to record
which patches have been applied to the tree.  So when we run "get", we
will get a tree that knows that certain patches have been applied.
The revision hash incorporates the location and contents of the patch
log, so we have to know the location before we can compute the hash.

I don't yet understand the compelling benefit of revision hashes. I knowother systems use them, but are they necessary for ArX? Wait. Don'tanswer that until you've read the rest of this message.

I _think_ we could work around this by creating a random number that
we use to uniquify the patch log, and then have some metadata in the

patch that maps "random numbers" -> "hashes".


That sounds even *more* complicated. I'm hoping for something simpler.

Alternatively, we could just not include the location of the patch log
in the information that gets hashed.  Then we could put it in the
right place after applying a patch.  That would mean that patches
created with "diff" would be different from patches created with
"commit".  It also means that it would be a little difficult to
manually verify a hash (but that is really not a big deal).  The
mapping would still not be hashed, but that is ok, because when you
get the patch, you presumably know which patch you want.  Hmm.  That
might work.

If I understand this, you were originally thinking that the patch wouldcontain its info location, and therefore the hash of the patch woulddepend partly on the info location. But now you're thinking that thepatch could be independent of its info.

I would love to figure out a way to implement remote branching withouthaving to overhaul the archive format.
Archive names are pervasive, so I think modifying the archive format
is inevitable.  On the other hand, since we have to modify the format
anyway, we can sneak in a few other improvements.


Ok, but here's my current thinking, in abstract form:

An archive is be a collection of branches. Each branch is an orderedcollection of patches.

A project tree is a local working copy of a single branch. It needs toknow the URL of its primary archive, plus the specific branch name fromwithin that archive. These combine to point to a specific target forthings like get, diff, and commit.

Now, to facilitate remote branches, an archive can store, for each ofits branches, a "fallback URL". Any information about that branch thatisn't found in this archive can be found in the archive located at thefallback URL. If the information isn't there, check THAT archive's fallback.


If an archive moves, the URL's that referred to it must be updated.

Now, compare that conceptual model to what ArX has today:

An ArX archive stores each branch in its own directory. Within eachbranch, there is a list of patches, and each patch has the data (as atarball), metadata (aka "patch info" or an ArX "patch log"), and (Ithink) a signature-of-tarball, hash-of-tarball, andsignature-of-hash-of-tarball. That all seems like what I described above.

An ArX project tree stores archive+branch as a single entity, ratherthan as two separate fields. A subtle difference, but one that makes itharder to move an archive to a new location.

An ArX project tree stores a patch log...which seems to be identical towhat's already in the archive for this branch. Is this just a speedoptimization for archives that aren't local?

Where does cherry picking fit into (or conflict with) this vision? Ah.We need to know that patch (revision) 57 in branch a is actually thesame patch as revision 33 in branch b. If we use the hash of a patch asits identifier, we can know that we have or have not already pulled thatpatch. We want a map of hash->patch for fast checking.

Should we hash just the patch data, or also its metadata? I would thinkit would be just the data, because the metadata might have an additional"signed off by", or a fixed typo in the commit comment. That would arguefor storing each patch in a file named by the SHA-1 of the patch.

The corresponding patch info could be stored under the same name with a.info appended (or whatever). But that would prevent having the samepatch in the system have two different sets of metadata associated withit. I don't know if that's a problem. I suppose there could be .info.1,.info.2, etc. if necessary.

The patch-log (ordered collection of patch infos) could contain anordered list of patch hashes. It could also store patch info hashes, toallow verification that the patch info hasn't changed.

You mentioned that each time a fork is created, that revision needs torecord the branch it came from. Maybe this concept could be split intotwo parts. First, a revision can ONLY be forked from within the samearchive. That way, a revision only needs to store a local/relativebranch/subbranch,revision. Meanwhile, a *branch* can be forked from someother archive. So the URL of the source archive becomes a piece ofbranch data, which gets it out of the hash picture and makes it mucheasier to update if the source archive moves.

I think the same thing is true of a tag, but since an ArX tag is quite abit different from tags in most other SCM tools, I'm not sure. If I weredesigning tags, I would make them merely a record that attaches asymbolic name to a particular branch/subbranch,revision within the archive.


Thanks for bearing with me as I try to understand this stuff!

Kevin

[Prev in Thread]

Current Thread

[Next in Thread]

[Arx-users] Further thoughts on ArX and simplicity, Kevin Smith, 2005/07/13
- Re: [Arx-users] Further thoughts on ArX and simplicity, Walter Landry, 2005/07/15
  - Re: [Arx-users] Further thoughts on ArX and simplicity, Kevin Smith, 2005/07/15
    - Re: [Arx-users] Further thoughts on ArX and simplicity, Walter Landry, 2005/07/18
    - Re: [Arx-users] Further thoughts on ArX and simplicity, Kevin Smith, 2005/07/18
    - Re: [Arx-users] Further thoughts on ArX and simplicity, Walter Landry, 2005/07/19
    - Re: [Arx-users] Further thoughts on ArX and simplicity, Kevin Smith <=
    - Re: [Arx-users] Further thoughts on ArX and simplicity, Walter Landry, 2005/07/27
    - Re: [Arx-users] Further thoughts on ArX and simplicity, Kevin Smith, 2005/07/27
    - Re: [Arx-users] Further thoughts on ArX and simplicity, Walter Landry, 2005/07/29
    - Re: [Arx-users] Further thoughts on ArX and simplicity, Kevin Smith, 2005/07/30

Prev by Date: Re: [Arx-users] Chicken bindings progress
Next by Date: [Arx-users] Blog entry: Selecting an SCM
Previous by thread: Re: [Arx-users] Further thoughts on ArX and simplicity
Next by thread: Re: [Arx-users] Further thoughts on ArX and simplicity
Index(es):
- Date
- Thread