[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Arx-users] The Future (long)
From: |
Kevin Smith |
Subject: |
Re: [Arx-users] The Future (long) |
Date: |
Wed, 07 Dec 2005 10:59:12 -0500 |
User-agent: |
Mozilla Thunderbird 1.0.7 (X11/20051011) |
Walter Landry wrote:
I have opened up a completely new, incompatible branch of ArX
Cool. My reply is long, too!
First of all, I have changed over from the term "archive" to
"repository".
Good.
I have also converted ArX to only use url's (see [1]). So repos no
longer have names.
Great.
So, unless someone objects, I
would like to change the "#" to "," and the "," to "@". So it would be
url,address@hidden
Looks great.
At the root of the repo, there are two files: "keys" which has the
public gpg keys, and README which tells you the format of the repo and
not to touch anything :)
Smart.
For every branch, there is a directory with the same name as the
branch, but with a period "." appended. The period "." makes it easy
to distinguish branch names (which can be almost anything) from
everything else.
This might cause problems with certain tools on MS Windows (where
"empty" extensions are unusual). Otherwise, seems reaonable.
Also within (sub)*branches, there can be revisions. They are
divided in chunks of 256. So revisions 0-255 are in directory 0,
256-511 are in directory 256, etc.
These are sequential revisions to a particular repo, right? [Oh, this is
clarified later, I think.]
Within each revisions directory, there are directories for each
revision. Each revision is named by its sha256 and its parent. The
idea is that a simple directory listing can give us all of the
revisions and how they are pieced together.
It seems odd to have the outer layer be sequential but the inner layer
be hashed. I would have expected either a git-like approach, where all
the revision data was stored hashed, with an external index file that
provides the ordering, OR to have the outer directories as you said
here, and then have the inner directories named 1, 2, 3, 4, etc.
This is not the complete hash, but it is 60 bits of it. To get an
accidental collision, we would have to have 2^30 different revisions
(about 1 billion) from the same parent revision. The only danger
of accidental collisions is that it would cause you to be unable to
commit or mirror. The signatures and full hashes are still checked,
so there is no danger from a malicious replacement.
I think that's a bit of an overstatement. It's true that an attacker
couldn't just drop a fake revision in to replace one that you had
signed. However, someone could disrupt the system by signing two
different revisions that share the same hash but have different
contents. Just something to consider as a corner case.
Within the revision subdirectory, we can have up to four files:
1) "rev.tgz"
This is a full copy of the project tree at this revision. This is
_only_ present for the first branch of a revision.
That sounds funny. My mind wants to hear "first revision of a branch".
One thing of note is that, since the log is completely separate from
the revision, it is easy to rewrite logs to correct misspellings etc.
The revision format sounds good, based on my little knowlege of such things.
I am considering adding a non-authoritative "index" file to every
directory except the revision directories. It would contain a single
hash of its subdirectories. So you would be able to read a single
short file to see if anything has changed in the repo. Updating the
"index" file atomically is tricky over remote connections. So if the
file is missing or corrupt, ArX will look for changes the
old-fashioned way.
That sounds like a good idea, although I think "index" is the wrong
word. It's more of a cached hash.
The "index" file is separate from a "listing" file, which will still
be maintained for systems that read over plain old http. It is
"listing" instead of ".listing" to get around some restrictive ftp
upload policies. Also, it will become a serialized list of
directories instead of the one-file-per-line format, so that you can
have special characters (e.g. carriage return, NULL) in revision
names.
I strongly prefer portable data formats over serialized C++ binary
stuff. I like the freedom to write tools in other languages to access
the data.
The revisions are numbered by their maximum distance from the root.
For example, a graph with numbering
aaa (0)
/ \
| |
| |
bbb (1) |
| eee (1)
| /
ccc (2) /
\ /
ddd (3)
So to get ddd, you can type
arx get url,address@hidden
To get eee, you have to disambiguate it
arx get url,address@hidden
Oh. Wow. That's really odd. I mean, it probably makes sense, but for
those of us who aren't yet comfortable with histories that branch and
merge in weird ways, it looks odd. I guess the good news is that for
most of my projects, which only have a mainline and an occasional fork
that never merges back, I could work with pure sequential revision
numbers. [Which you also say later.]
It is fairly simple to go from there to using the branch==repo
paradigm that hg, bzr, darcs, etc. have. My thought right now is that
that paradigm is sufficiently different from the separate repo and
tree paradigm that I would want a different command for it.
I'm not quite sure what you're saying, but I think the repo == branches
paradigm of ArX is one of its strengths. Mercurial allows multiple heads
within a single branch/repo, but that's not as powerful. Bzr plans to
build ArX-like repos out of their branch/repos, but it's not yet clear
to me how clean that will end up.
Skip-deltas is probably the least firm part of the new format. I just
can't think of anything else that is going to work well.
You mentioned several drawbacks of skip-deltas. What are the big
benefits they bring, and what alternatives did you consider?
For the project tree-format, the only real difference is that I will
get rid of the patch logs in the tree, and instead just have a file
which contains all of the patches that have been applied to the tree.
The patchlogs take up way too much space and are duplicated from the
repo. The file would just be a serialized graph of the ancestry,
making it easy to read and write.
Ok, so "project tree" is what some folks call the "working tree". The
place where a user has checked out a specific working copy of the code.
Right? If you are storing less information in the project tree, then it
would follow that it would be more important to have fast access to the
repo itself...the repo probably shouldn't be on the other side of a slow
network connection. Or maybe that's already true with ArX 2?
* Reliable: Won't break if interrupted at arbitrary times.
-
The only bad things that can happen are that there are pending
revisions left in the repo, or index files are missing. Missing
index files merely slow down ArX, and the next commit to the repo
will fix it.
So the index file would be deleted before any updates are performed in a
directory, and re-created after the updates are complete. Ok.
* Repairable: If broken, it is easy to fix. This argues for storing
everything in multiple files with simple formats.
-
The patches are still the simple tarballs of patches that they were
before. You may now be able to ignore some corruption, because
skip-deltas won't need them to construct any new revisions.
I would argue that this would push for plaintext data files, rather than
C++ serialized files. They're not THAT much harder for the simple things
you seem to be doing.
* Fast annotate:
-
This requires going through all of the patch logs to find out which
ones affect a given file, then combining the appropriate patches to
get a real delta between versions. Not particularly fast, though it
does scale as O(Number of revisions). So long histories will
suffer.
It won't be anywhere near as fast as a format that uses weaves
storage.
* Fast merging, including perhaps strategies like fast weave merges:
-
Regular, 3-way merges should not be too bad, especially with
O(log(N)) access to any revision. Weave merge will not be
particularly fast, but perhaps fast enough for the rare cases when
you need it.
You might also have a look at GIT's "recursive" merging. Slower, but
supposedly fixes some cases that confuse a 3-way merge.
The bzr folks keep talking about "knits", which are some variant of
weaves. I think those are both part of a more generic strategy of doing
merges based on annotated lines, regardless of how those are stored.
* Fast access to any revision, in particular the latest revision, even
remotely:
-
O(log(N)) access to any revision, which is not bad. For 60000
revisions, that is about 16 patches.
* No need to download the entire history just to check out the latest
version (monotone, bzr, hg, and git are all bad in this respect).
-
Yes. You can also commit into the remote repository directly.
* Easy to specify any revision: darcs is bad in this respect, because
there is no universal number for every revision.
-
For ordinary, linear development, you can use the sequence number.
Once there is parallel development, you only need to use enough of
the hash's name to uniquify it.
* Fast diffs against any revision, in particular the latest revision:
-
The latest revision is fast, and other revisions are O(log(N)).
* Fast commits:
-
Not as fast as before, because of the need to compute a skip-delta.
Applying exact patches to a tree is actually quite fast, but it is
extra work. Testing will tell whether that is a problem.
* Fast imports:
-
Much better than before. It should be mostly equivalent to moving
all of the files and tarring them up.
* Fast repo verification (e.g. svn has the "svnadmin verify" command):
-
Nothing really planned here.
* No extra directories: (Subversion has .svn subdirectories in every
directory, most other systems have a single special directory at the
top, and svk has no special directories, instead keeping that
information elsewhere)
-
There is still a _arx directory at the top of every project tree.
The bzr folks recently debated using . or _ for this. The big advantage
of .arx would be that tools like grep wouldn't recurse into the arx
metadata. I believe they decided to default to .bzr on unix and _bzr on
MS Windows, but always to allow either one. That seems sane to me.
* Efficient storage of repo and project tree: (unpacked git is
terrible here. tla/baz/arx all store the complete patch logs of all
revisions in separate files in the project tree, bloating the space
requirements for projects with long histories)
-
The project tree is very efficient since we have gotten rid of the
patch logs. The repo is somewhat efficient, although it could be
better. It depends on the number of revisions.
I think another requirement should be that the native format is
sufficiently efficient. I view GIT's "packs" as perhaps its worst
feature, as it is user-hostile. Mercurial has "bundles" which also add a
lot of complexity that I dislike.
* Can move repo and project tree around in filesystem or between
machines with tar:
-
Yes
* Fast repo syncing for both CPU, latency, and bandwidth:
-
The "index" files allow you to figure out a null sync in just one
read of a file of 32 bytes. A non-null sync has to recurse down,
requiring two round trips (one for "index", one for the directory
listing, though I may be able to do the listing asynchronously) for
every level. Certainly faster than the current method, which has to
recurse down into every branch in the repo.
* Complete history by default, truncated history when desired:
-
If you don't have write permissions on the original repo, then you
must either "mirror" or "fork" to commit. In either case, you will
not need to contact the initial repo except for updates. It then
becomes a matter of educating users about the proper method to use.
However, in the future, it would not be hard to implement
branch==repo, which will give you complete history by default.
* Checksums on patches and revisions.
-
sha256 on the revision and patch.
* Signatures on patches and revisions:
-
The signature on the patch log covers the sha256 of the revision and
patch. Sha256 should be good for the next 50 years or so, barring
unforseen developments. The same can not be said for sha1.
As long as this is fast enough, I think it's a good choice.
* Convergence when merging, so repeated merges don't create repeated
commits:
-
Yes. Patches from all inputs to the merge are stored, so updates
from the branches bring you to the merged revision.
* push/pull over dumb protocols:
-
As before, ftp and webdav servers work for free. Plain http servers
must use update-listing.
I'm still happy that plain http will be supported. I'll still grumble
that update-listing is separate, but as long as ArX has an option to
automatically keep the listing files updated, it's ok.
* Distributed:
-
Using hashes to disambiguate revisions allows people to work in
parallel, and then pull in revisions from each other.
Does this also cover the desktop/laptop case? Perhaps in combination
with merge convergence that you mentioned earler?
* Easy branching (e.g. you don't have to come up with a new name every
time you want to make a branch):
-
Yes. A branch does not have to lock anything or use a different
name. It just uses a different hash.
* Cheap branching, even on systems without hardlinks or symlinks. A
FAT32 user should be able to create a new branch of a large project as
quickly and using only as much disk space as someone on an ext3 system.
Multiple branches on a web server should not consume excess space.
* handles collections of projects:
-
Yes, same as before with "arx tag".
I still question using that word for that feature, but that's a UI issue.
* No repo maintenance (no archive caches or git's packing, or even
make-archive (as in darcs and bzr)):
-
make-repo is still required, but archive caches are not.
* Handles a large number of revisions (~60000):
-
When updating, you know which group of 256 revisions you are in, so
you usually only need to list one directory. Basically you
have to list (Number of new revisions)/256 directories,
although you may have to list a directory with (Number of
total revisions)/256 entries. for ~60000 revisions, that is ~256
entries, making a total of about 1 KB to read.
When doing an initial get, you have to list all of revision
directories. That is about 60000*32 bytes=2MB. That is proabaly a
small number compared to the size of the project tree. You also
have to get and apply about 16 patches, as well as the initial
rev.tgz.
* Handles a large number of branches(~100):
-
When updating, you may spuriously notice changes that happened in a
parallel branch. However, that will only result in an extra listing
of a revision directory, which we might do anyway to cut down on
latency. The directory listings will have about (Number of
branches)*256*32 bytes. For 100 branches (a pretty extreme
example), that would be about 800 KB.
* Handles projects which have a large number of directories and files in
the working tree.
* Works simply with small and simple projects.
* Human readable revision names:
-
The default sequence numbers are human readable. Hashes only come
in when there is a need to disambiguate.
* No need to name repos or projects: (darcs/bzr/hg/git is good, tla is
ultra bad)
-
Naming is optional
* Handles cherry picking, and makes it a merge when you have applied
all of the patches:
-
Yes
* Will be able to support quilt/bzr-shelve/mq functionality.
* Can disapprove patches:
-
No
* Can host on any filesystem including 8.3 systems. This includes
running a server on a filesystem that can not store files in the
repo. Using a single file database would be one solution, although
that causes other problems.
8.3 filesystems are not so common, so you might want to try to get
away with only 31 character, case-insensitive filenames, with a max
path length of 255 and max directory depth of 8.
Also, some ftp sites have restrictive policies about what kind of
files can be uploaded. From the comcast website:
NOTE: File names must consist of characters from "a-z", "A-Z",
"0-9", '_' (underscore), '.' (period), '-' (hyphen). No other
characters (including spaces) can be included in the file
name. File names must not start with '.' or '-'.
-
8.3 filesystems will not work because of the 30 character revision
names. Similarly, illegal characters in a branch name or overly
long branch names can cause problems. I considered url-encoding
branch names, but that will make non-ascii names much longer,
possibly causing problems with length.
An interesting note is that if two branches differ only in case,
they will end up stored in the same place on case-insensitive
filesystems. Because of the hashes, the ancestry will not get
confused. But running a simple "arx get branch" will notice that
there are two heads.
I don't think support for 8.3 names is important, so I think you've made
the right choice here.
It might be worth storing branch names in a table, rather than exposing
them as raw filenames. The bzr folks are discussing something similar at
the moment. I believe that if you burn a backup on one system, and
restore it on another system, it should just work. That means you can't
allow just any character, nor can you escape only the characters that
won't work on the particular file system you are writing to at the moment.
* Easy to backup. DB's have their own backup scripts, but being able
to use rsync and having it do the right thing is awfully nice:
-
Yes
* Works with write-once media. (No one really has this, although
tla/baz/arx and subversion (with fsfs) could be modified to do so.
We just need a place to put the lock files.)
-
No
I thought GIT had this. I don't think supporting write-once media is a
critical feature. Requiring only append access could help in certain
high-security cases.
* Remote, multi-user, auditable, restricted repos like what CVS and
subversion offers. Then there is only one person who needs to
manage the repo, and a random user can't delete or modify old
revisions. Doing this without a smart server is painful.
-
No, but it could be added later with a smart server.
* Lightweight branches
-
Microbranches are as light as they can be, since they only have a
small patch file. However, that requires you to have the rest of
the branches' revisions.
Branches without history are not as lightweight as they used to be,
because you always have a "rev.tgz" file. However, the size of
"rev.tgz" is usually much smaller than the size of a project tree,
so it won't actually be that bad. My WAG is a 30% addition over the
size of a project tree.
Terminology check...when you say "lightweight branch", you mean a branch
that doesn't contain full history. As opposed to having multiple heads
within the same branch, as in hg, which can also be called a lightweight
branch. I still prefer the term "distributed branch" for the "no
history" case.
* Able to remove revisions, even after they have had child revisions:
-
yes, although you will have to remove the children first. You no
longer get into the situation where you have two revisions with the
same name.
* Reasonable support for archiving large binary files
* All user data (filenames, commit comments, branch names) can be UTF-8
* Store all dates in ISO UTC format, and otherwise keep all data as
locale-independent as possible
The following issues aren't necessarily tied to the repo format, but are
valuable features in an SCM tool:
* Repo and/or branch "nicknames" or "aliases"
* Facilities to mitigate newline conversions when a project is shared by
people using different workstation OS's.
* Support for plugins (see bzr and hg), because it makes it far easier
for non-core developers to experiment with cool stuff, and to prototype
potential new features before adding them to the core.
Comments?
I like the direction you're heading. It will be interesting to see how
ArX will fit in to the SCM landscape that has changed so dramatically
since the ArX project was originally started.
Kevin
- [Arx-users] The Future (long), Walter Landry, 2005/12/07
- Re: [Arx-users] The Future (long),
Kevin Smith <=
- Re: [Arx-users] The Future (long), Walter Landry, 2005/12/07
- Re: [Arx-users] The Future (long), Kevin Smith, 2005/12/07
- Re: [Arx-users] The Future (long), Walter Landry, 2005/12/08
- Re: [Arx-users] The Future (long), Kevin Smith, 2005/12/08
- Re: [Arx-users] The Future (long), Walter Landry, 2005/12/09
- Re: [Arx-users] The Future (long), Kevin Smith, 2005/12/09
- Re: [Arx-users] The Future (long), Walter Landry, 2005/12/09
- Re: [Arx-users] The Future (long), Kevin Smith, 2005/12/09
- Re: [Arx-users] The Future (long), Walter Landry, 2005/12/11
- Re: [Arx-users] The Future (long), Kevin Smith, 2005/12/22