[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Arx-users] Further thoughts on ArX and simplicity
From: |
Kevin Smith |
Subject: |
Re: [Arx-users] Further thoughts on ArX and simplicity |
Date: |
Tue, 26 Jul 2005 23:46:21 -0400 |
User-agent: |
Mozilla Thunderbird 1.0.2 (X11/20050404) |
Walter Landry wrote:
Kevin Smith <address@hidden> wrote:
Can you point to any design docs that describe the existing archive
format?
(Excellent description of ARCHIVE and PROJECT TREE data files mostly
snipped)
Thanks!
The rest of the archive is the actual data. For a revision with the
name branch.subbranch,revision, that patch will be in the directory
branch/subbranch/,revision. The patch itself can contain up to 5 files
1) branch.subbranch,revision.patches.tar.gz
2) branch.subbranch,revision.patches.tar.gz.sig
This are the actual patch and associated gpg detached signature.
3) log
(snip)
4) sha256
5) sha256.sig
These are the SHA256 of the revision and associated gpg detached
signature. These are just the hex representation of the SHA256, not
using serialization.
SHA256 of what? Of the .tar.gz file?
Obviously a "log" here is not what I think of as a "log". Perhaps we
could come up with a better name for it to avoid confusion.
I am confused by your confusion. What do you think a log is?
I think my main confusion is that I think of a log as being a collection
of entries. ArX creates a "log" for each checkin. I would probably use
the name "commit data" or "patch description". To me, the log is the
concatenation of a bunch of those smaller atoms.
I notice that you use the term "patch-log" to refer to a collection of
"logs of patches". That does seem confusing. From here on, I will use
the term "patch info" to refer to what ArX currently calls the log for a
single patch.
Part of the patch is adding a patch log. We have to know where to put
the patch log. The only way to guarantee that there are no conflicts
is to use the hash. If we did not support cherry-picking, then it is
pretty simple to just order the logs in one big file. With
cherry-picking, it is no longer determined whether one patch comes
before another.
So this cherry-picking thing sounds like a key distinction between ArX
and other systems. A simplistic SCM can store a single file that lists
all the patches that have been applied, in order. I believe that GIT and
darcs do something like this.
Even with cherry picking, it seems like the patches must have been
applied to THIS branch (and therefore project tree) in some specific
order. What does the last sentence of that paragraph really mean?
Because the archive already has known location for the patch info (right
next to the patch itself). But that's also true in the project tree. In
both cases, patch infos are named according to their branch/revision.
I'm definitely missing something here. [Update...I think I'm getting it,
as you'll see near the end of this message.]
We have to have the patch log in the tree because we have to record
which patches have been applied to the tree. So when we run "get", we
will get a tree that knows that certain patches have been applied.
The revision hash incorporates the location and contents of the patch
log, so we have to know the location before we can compute the hash.
I don't yet understand the compelling benefit of revision hashes. I know
other systems use them, but are they necessary for ArX? Wait. Don't
answer that until you've read the rest of this message.
I _think_ we could work around this by creating a random number that
we use to uniquify the patch log, and then have some metadata in the
patch that maps "random numbers" -> "hashes".
That sounds even *more* complicated. I'm hoping for something simpler.
Alternatively, we could just not include the location of the patch log
in the information that gets hashed. Then we could put it in the
right place after applying a patch. That would mean that patches
created with "diff" would be different from patches created with
"commit". It also means that it would be a little difficult to
manually verify a hash (but that is really not a big deal). The
mapping would still not be hashed, but that is ok, because when you
get the patch, you presumably know which patch you want. Hmm. That
might work.
If I understand this, you were originally thinking that the patch would
contain its info location, and therefore the hash of the patch would
depend partly on the info location. But now you're thinking that the
patch could be independent of its info.
I would love to figure out a way to implement remote branching without
having to overhaul the archive format.
Archive names are pervasive, so I think modifying the archive format
is inevitable. On the other hand, since we have to modify the format
anyway, we can sneak in a few other improvements.
Ok, but here's my current thinking, in abstract form:
An archive is be a collection of branches. Each branch is an ordered
collection of patches.
A project tree is a local working copy of a single branch. It needs to
know the URL of its primary archive, plus the specific branch name from
within that archive. These combine to point to a specific target for
things like get, diff, and commit.
Now, to facilitate remote branches, an archive can store, for each of
its branches, a "fallback URL". Any information about that branch that
isn't found in this archive can be found in the archive located at the
fallback URL. If the information isn't there, check THAT archive's fallback.
If an archive moves, the URL's that referred to it must be updated.
Now, compare that conceptual model to what ArX has today:
An ArX archive stores each branch in its own directory. Within each
branch, there is a list of patches, and each patch has the data (as a
tarball), metadata (aka "patch info" or an ArX "patch log"), and (I
think) a signature-of-tarball, hash-of-tarball, and
signature-of-hash-of-tarball. That all seems like what I described above.
An ArX project tree stores archive+branch as a single entity, rather
than as two separate fields. A subtle difference, but one that makes it
harder to move an archive to a new location.
An ArX project tree stores a patch log...which seems to be identical to
what's already in the archive for this branch. Is this just a speed
optimization for archives that aren't local?
Where does cherry picking fit into (or conflict with) this vision? Ah.
We need to know that patch (revision) 57 in branch a is actually the
same patch as revision 33 in branch b. If we use the hash of a patch as
its identifier, we can know that we have or have not already pulled that
patch. We want a map of hash->patch for fast checking.
Should we hash just the patch data, or also its metadata? I would think
it would be just the data, because the metadata might have an additional
"signed off by", or a fixed typo in the commit comment. That would argue
for storing each patch in a file named by the SHA-1 of the patch.
The corresponding patch info could be stored under the same name with a
.info appended (or whatever). But that would prevent having the same
patch in the system have two different sets of metadata associated with
it. I don't know if that's a problem. I suppose there could be .info.1,
.info.2, etc. if necessary.
The patch-log (ordered collection of patch infos) could contain an
ordered list of patch hashes. It could also store patch info hashes, to
allow verification that the patch info hasn't changed.
You mentioned that each time a fork is created, that revision needs to
record the branch it came from. Maybe this concept could be split into
two parts. First, a revision can ONLY be forked from within the same
archive. That way, a revision only needs to store a local/relative
branch/subbranch,revision. Meanwhile, a *branch* can be forked from some
other archive. So the URL of the source archive becomes a piece of
branch data, which gets it out of the hash picture and makes it much
easier to update if the source archive moves.
I think the same thing is true of a tag, but since an ArX tag is quite a
bit different from tags in most other SCM tools, I'm not sure. If I were
designing tags, I would make them merely a record that attaches a
symbolic name to a particular branch/subbranch,revision within the archive.
Thanks for bearing with me as I try to understand this stuff!
Kevin
- [Arx-users] Further thoughts on ArX and simplicity, Kevin Smith, 2005/07/13
- Re: [Arx-users] Further thoughts on ArX and simplicity, Walter Landry, 2005/07/15
- Re: [Arx-users] Further thoughts on ArX and simplicity, Kevin Smith, 2005/07/15
- Re: [Arx-users] Further thoughts on ArX and simplicity, Walter Landry, 2005/07/18
- Re: [Arx-users] Further thoughts on ArX and simplicity, Kevin Smith, 2005/07/18
- Re: [Arx-users] Further thoughts on ArX and simplicity, Walter Landry, 2005/07/19
- Re: [Arx-users] Further thoughts on ArX and simplicity,
Kevin Smith <=
- Re: [Arx-users] Further thoughts on ArX and simplicity, Walter Landry, 2005/07/27
- Re: [Arx-users] Further thoughts on ArX and simplicity, Kevin Smith, 2005/07/27
- Re: [Arx-users] Further thoughts on ArX and simplicity, Walter Landry, 2005/07/29
- Re: [Arx-users] Further thoughts on ArX and simplicity, Kevin Smith, 2005/07/30