arx-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Arx-users] The Future (long)


From: Kevin Smith
Subject: Re: [Arx-users] The Future (long)
Date: Wed, 07 Dec 2005 10:59:12 -0500
User-agent: Mozilla Thunderbird 1.0.7 (X11/20051011)

Walter Landry wrote:

I have opened up a completely new, incompatible branch of ArX

Cool. My reply is long, too!

First of all, I have changed over from the term "archive" to
"repository".

Good.

I have also converted ArX to only use url's (see [1]).  So repos no
longer have names.

Great.

So, unless someone objects, I
would like to change the "#" to "," and the "," to "@".  So it would be

  url,address@hidden

Looks great.

At the root of the repo, there are two files: "keys" which has the
public gpg keys, and README which tells you the format of the repo and
not to touch anything :)

Smart.

For every branch, there is a directory with the same name as the
branch, but with a period "." appended.  The period "." makes it easy
to distinguish branch names (which can be almost anything) from
everything else.

This might cause problems with certain tools on MS Windows (where "empty" extensions are unusual). Otherwise, seems reaonable.

Also within (sub)*branches, there can be revisions.  They are
divided in chunks of 256.  So revisions 0-255 are in directory 0,
256-511 are in directory 256, etc.

These are sequential revisions to a particular repo, right? [Oh, this is clarified later, I think.]

Within each revisions directory, there are directories for each
revision.  Each revision is named by its sha256 and its parent.  The
idea is that a simple directory listing can give us all of the
revisions and how they are pieced together.

It seems odd to have the outer layer be sequential but the inner layer be hashed. I would have expected either a git-like approach, where all the revision data was stored hashed, with an external index file that provides the ordering, OR to have the outer directories as you said here, and then have the inner directories named 1, 2, 3, 4, etc.

This is not the complete hash, but it is 60 bits of it.  To get an
accidental collision, we would have to have 2^30 different revisions
(about 1 billion) from the same parent revision.  The only danger
of accidental collisions is that it would cause you to be unable to
commit or mirror.  The signatures and full hashes are still checked,
so there is no danger from a malicious replacement.

I think that's a bit of an overstatement. It's true that an attacker couldn't just drop a fake revision in to replace one that you had signed. However, someone could disrupt the system by signing two different revisions that share the same hash but have different contents. Just something to consider as a corner case.

Within the revision subdirectory, we can have up to four files:

  1) "rev.tgz"

    This is a full copy of the project tree at this revision.  This is
_only_ present for the first branch of a revision.

That sounds funny. My mind wants to hear "first revision of a branch".

One thing of note is that, since the log is completely separate from
the revision, it is easy to rewrite logs to correct misspellings etc.

The revision format sounds good, based on my little knowlege of such things.

I am considering adding a non-authoritative "index" file to every
directory except the revision directories.  It would contain a single
hash of its subdirectories.  So you would be able to read a single
short file to see if anything has changed in the repo.  Updating the
"index" file atomically is tricky over remote connections.  So if the
file is missing or corrupt, ArX will look for changes the
old-fashioned way.

That sounds like a good idea, although I think "index" is the wrong word. It's more of a cached hash.

The "index" file is separate from a "listing" file, which will still
be maintained for systems that read over plain old http.  It is
"listing" instead of ".listing" to get around some restrictive ftp
upload policies.  Also, it will become a serialized list of
directories instead of the one-file-per-line format, so that you can
have special characters (e.g. carriage return, NULL) in revision
names.

I strongly prefer portable data formats over serialized C++ binary stuff. I like the freedom to write tools in other languages to access the data.

The revisions are numbered by their maximum distance from the root.
For example, a graph with numbering

     aaa   (0)
  /       \
 |         |
 |         |
bbb (1)    |
 |        eee (1)
 |         /
ccc (2)   /
  \      /
    ddd     (3)

So to get ddd, you can type

  arx get url,address@hidden

To get eee, you have to disambiguate it

  arx get url,address@hidden

Oh. Wow. That's really odd. I mean, it probably makes sense, but for those of us who aren't yet comfortable with histories that branch and merge in weird ways, it looks odd. I guess the good news is that for most of my projects, which only have a mainline and an occasional fork that never merges back, I could work with pure sequential revision numbers. [Which you also say later.]

It is fairly simple to go from there to using the branch==repo
paradigm that hg, bzr, darcs, etc. have.  My thought right now is that
that paradigm is sufficiently different from the separate repo and
tree paradigm that I would want a different command for it.

I'm not quite sure what you're saying, but I think the repo == branches paradigm of ArX is one of its strengths. Mercurial allows multiple heads within a single branch/repo, but that's not as powerful. Bzr plans to build ArX-like repos out of their branch/repos, but it's not yet clear to me how clean that will end up.

Skip-deltas is probably the least firm part of the new format.  I just
can't think of anything else that is going to work well.

You mentioned several drawbacks of skip-deltas. What are the big benefits they bring, and what alternatives did you consider?

For the project tree-format, the only real difference is that I will
get rid of the patch logs in the tree, and instead just have a file
which contains all of the patches that have been applied to the tree.
The patchlogs take up way too much space and are duplicated from the
repo.  The file would just be a serialized graph of the ancestry,
making it easy to read and write.

Ok, so "project tree" is what some folks call the "working tree". The place where a user has checked out a specific working copy of the code. Right? If you are storing less information in the project tree, then it would follow that it would be more important to have fast access to the repo itself...the repo probably shouldn't be on the other side of a slow network connection. Or maybe that's already true with ArX 2?

* Reliable: Won't break if interrupted at arbitrary times.
  -
  The only bad things that can happen are that there are pending
  revisions left in the repo, or index files are missing.  Missing
  index files merely slow down ArX, and the next commit to the repo
  will fix it.

So the index file would be deleted before any updates are performed in a directory, and re-created after the updates are complete. Ok.

* Repairable: If broken, it is easy to fix.  This argues for storing
  everything in multiple files with simple formats.
  -
  The patches are still the simple tarballs of patches that they were
  before.  You may now be able to ignore some corruption, because
  skip-deltas won't need them to construct any new revisions.

I would argue that this would push for plaintext data files, rather than C++ serialized files. They're not THAT much harder for the simple things you seem to be doing.

* Fast annotate:
  -
  This requires going through all of the patch logs to find out which
  ones affect a given file, then combining the appropriate patches to
  get a real delta between versions.  Not particularly fast, though it
  does scale as O(Number of revisions).  So long histories will
  suffer.

  It won't be anywhere near as fast as a format that uses weaves
  storage.

* Fast merging, including perhaps strategies like fast weave merges:
  -
  Regular, 3-way merges should not be too bad, especially with
  O(log(N)) access to any revision.  Weave merge will not be
  particularly fast, but perhaps fast enough for the rare cases when
  you need it.

You might also have a look at GIT's "recursive" merging. Slower, but supposedly fixes some cases that confuse a 3-way merge.

The bzr folks keep talking about "knits", which are some variant of weaves. I think those are both part of a more generic strategy of doing merges based on annotated lines, regardless of how those are stored.

* Fast access to any revision, in particular the latest revision, even
  remotely:
  -
  O(log(N)) access to any revision, which is not bad.  For 60000
  revisions, that is about 16 patches.

* No need to download the entire history just to check out the latest
  version (monotone, bzr, hg, and git are all bad in this respect).
  -
  Yes.  You can also commit into the remote repository directly.

* Easy to specify any revision: darcs is bad in this respect, because
  there is no universal number for every revision.
  -
  For ordinary, linear development, you can use the sequence number.
  Once there is parallel development, you only need to use enough of
  the hash's name to uniquify it.

* Fast diffs against any revision, in particular the latest revision:
  -
  The latest revision is fast, and other revisions are O(log(N)).

* Fast commits:
  -
  Not as fast as before, because of the need to compute a skip-delta.
  Applying exact patches to a tree is actually quite fast, but it is
  extra work.  Testing will tell whether that is a problem.

* Fast imports:
  -
  Much better than before.  It should be mostly equivalent to moving
  all of the files and tarring them up.

* Fast repo verification (e.g. svn has the "svnadmin verify" command):
  -
  Nothing really planned here.

* No extra directories: (Subversion has .svn subdirectories in every
  directory, most other systems have a single special directory at the
  top, and svk has no special directories, instead keeping that
  information elsewhere)
  -
  There is still a _arx directory at the top of every project tree.

The bzr folks recently debated using . or _ for this. The big advantage of .arx would be that tools like grep wouldn't recurse into the arx metadata. I believe they decided to default to .bzr on unix and _bzr on MS Windows, but always to allow either one. That seems sane to me.

* Efficient storage of repo and project tree: (unpacked git is
  terrible here.  tla/baz/arx all store the complete patch logs of all
  revisions in separate files in the project tree, bloating the space
  requirements for projects with long histories)
  -
  The project tree is very efficient since we have gotten rid of the
  patch logs.  The repo is somewhat efficient, although it could be
  better.  It depends on the number of revisions.

I think another requirement should be that the native format is sufficiently efficient. I view GIT's "packs" as perhaps its worst feature, as it is user-hostile. Mercurial has "bundles" which also add a lot of complexity that I dislike.

* Can move repo and project tree around in filesystem or between
  machines with tar:
  -
  Yes

* Fast repo syncing for both CPU, latency, and bandwidth:
  -
  The "index" files allow you to figure out a null sync in just one
  read of a file of 32 bytes.  A non-null sync has to recurse down,
  requiring two round trips (one for "index", one for the directory
  listing, though I may be able to do the listing asynchronously) for
  every level.  Certainly faster than the current method, which has to
  recurse down into every branch in the repo.

* Complete history by default, truncated history when desired:
  -
  If you don't have write permissions on the original repo, then you
  must either "mirror" or "fork" to commit.  In either case, you will
  not need to contact the initial repo except for updates.  It then
  becomes a matter of educating users about the proper method to use.

  However, in the future, it would not be hard to implement
  branch==repo, which will give you complete history by default.

* Checksums on patches and revisions.
  -
  sha256 on the revision and patch.

* Signatures on patches and revisions:
  -
  The signature on the patch log covers the sha256 of the revision and
  patch.  Sha256 should be good for the next 50 years or so, barring
  unforseen developments.  The same can not be said for sha1.

As long as this is fast enough, I think it's a good choice.

* Convergence when merging, so repeated merges don't create repeated
  commits:
  -
  Yes.  Patches from all inputs to the merge are stored, so updates
  from the branches bring you to the merged revision.

* push/pull over dumb protocols:
  -
  As before, ftp and webdav servers work for free.  Plain http servers
  must use update-listing.

I'm still happy that plain http will be supported. I'll still grumble that update-listing is separate, but as long as ArX has an option to automatically keep the listing files updated, it's ok.

* Distributed:
  -
  Using hashes to disambiguate revisions allows people to work in
  parallel, and then pull in revisions from each other.

Does this also cover the desktop/laptop case? Perhaps in combination with merge convergence that you mentioned earler?

* Easy branching (e.g. you don't have to come up with a new name every
  time you want to make a branch):
  -
  Yes.  A branch does not have to lock anything or use a different
  name.  It just uses a different hash.

* Cheap branching, even on systems without hardlinks or symlinks. A FAT32 user should be able to create a new branch of a large project as quickly and using only as much disk space as someone on an ext3 system. Multiple branches on a web server should not consume excess space.

* handles collections of projects:
  -
  Yes, same as before with "arx tag".

I still question using that word for that feature, but that's a UI issue.

* No repo maintenance (no archive caches or git's packing, or even
  make-archive (as in darcs and bzr)):
  -
  make-repo is still required, but archive caches are not.

* Handles a large number of revisions (~60000):
  -
  When updating, you know which group of 256 revisions you are in, so
  you usually only need to list one directory.  Basically you
  have to list (Number of new revisions)/256 directories,
  although you may have to list a directory with (Number of
  total revisions)/256 entries.  for ~60000 revisions, that is ~256
  entries, making a total of about 1 KB to read.
When doing an initial get, you have to list all of revision
  directories.  That is about 60000*32 bytes=2MB.  That is proabaly a
  small number compared to the size of the project tree.  You also
  have to get and apply about 16 patches, as well as the initial
  rev.tgz.

* Handles a large number of branches(~100):
  -
  When updating, you may spuriously notice changes that happened in a
  parallel branch.  However, that will only result in an extra listing
  of a revision directory, which we might do anyway to cut down on
  latency.  The directory listings will have about (Number of
  branches)*256*32 bytes.  For 100 branches (a pretty extreme
  example), that would be about 800 KB.

* Handles projects which have a large number of directories and files in the working tree.

* Works simply with small and simple projects.

* Human readable revision names:
  -
  The default sequence numbers are human readable.  Hashes only come
  in when there is a need to disambiguate.

* No need to name repos or projects: (darcs/bzr/hg/git is good, tla is
  ultra bad)
  -
  Naming is optional

* Handles cherry picking, and makes it a merge when you have applied
  all of the patches:
  -
  Yes

* Will be able to support quilt/bzr-shelve/mq functionality.

* Can disapprove patches:
  -
  No

* Can host on any filesystem including 8.3 systems.  This includes
  running a server on a filesystem that can not store files in the
  repo.  Using a single file database would be one solution, although
  that causes other problems.

  8.3 filesystems are not so common, so you might want to try to get
  away with only 31 character, case-insensitive filenames, with a max
  path length of 255 and max directory depth of 8.

  Also, some ftp sites have restrictive policies about what kind of
  files can be uploaded.  From the comcast website:

   NOTE: File names must consist of characters from "a-z", "A-Z",
   "0-9", '_' (underscore), '.' (period), '-' (hyphen). No other
   characters (including spaces) can be included in the file
   name. File names must not start with '.' or '-'.

  -
  8.3 filesystems will not work because of the 30 character revision
  names.  Similarly, illegal characters in a branch name or overly
  long branch names can cause problems.  I considered url-encoding
  branch names, but that will make non-ascii names much longer,
  possibly causing problems with length.

  An interesting note is that if two branches differ only in case,
  they will end up stored in the same place on case-insensitive
  filesystems.  Because of the hashes, the ancestry will not get
  confused.  But running a simple "arx get branch" will notice that
  there are two heads.

I don't think support for 8.3 names is important, so I think you've made the right choice here.

It might be worth storing branch names in a table, rather than exposing them as raw filenames. The bzr folks are discussing something similar at the moment. I believe that if you burn a backup on one system, and restore it on another system, it should just work. That means you can't allow just any character, nor can you escape only the characters that won't work on the particular file system you are writing to at the moment.

* Easy to backup. DB's have their own backup scripts, but being able
  to use rsync and having it do the right thing is awfully nice:
  -
  Yes

* Works with write-once media. (No one really has this, although
  tla/baz/arx and subversion (with fsfs) could be modified to do so.
  We just need a place to put the lock files.)
  -
  No

I thought GIT had this. I don't think supporting write-once media is a critical feature. Requiring only append access could help in certain high-security cases.

* Remote, multi-user, auditable, restricted repos like what CVS and
  subversion offers. Then there is only one person who needs to
  manage the repo, and a random user can't delete or modify old
  revisions.  Doing this without a smart server is painful.
  -
  No, but it could be added later with a smart server.

* Lightweight branches
  -
  Microbranches are as light as they can be, since they only have a
  small patch file.  However, that requires you to have the rest of
  the branches' revisions.

  Branches without history are not as lightweight as they used to be,
  because you always have a "rev.tgz" file.  However, the size of
  "rev.tgz" is usually much smaller than the size of a project tree,
  so it won't actually be that bad.  My WAG is a 30% addition over the
  size of a project tree.

Terminology check...when you say "lightweight branch", you mean a branch that doesn't contain full history. As opposed to having multiple heads within the same branch, as in hg, which can also be called a lightweight branch. I still prefer the term "distributed branch" for the "no history" case.

* Able to remove revisions, even after they have had child revisions:
  -
  yes, although you will have to remove the children first.  You no
  longer get into the situation where you have two revisions with the
  same name.

* Reasonable support for archiving large binary files

* All user data (filenames, commit comments, branch names) can be UTF-8

* Store all dates in ISO UTC format, and otherwise keep all data as locale-independent as possible

The following issues aren't necessarily tied to the repo format, but are valuable features in an SCM tool:

* Repo and/or branch "nicknames" or "aliases"

* Facilities to mitigate newline conversions when a project is shared by people using different workstation OS's.

* Support for plugins (see bzr and hg), because it makes it far easier for non-core developers to experiment with cool stuff, and to prototype potential new features before adding them to the core.

Comments?

I like the direction you're heading. It will be interesting to see how ArX will fit in to the SCM landscape that has changed so dramatically since the ArX project was originally started.

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]