arx-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Arx-users] The Future (long)


From: Walter Landry
Subject: Re: [Arx-users] The Future (long)
Date: Wed, 07 Dec 2005 16:44:23 -0800 (PST)

Kevin Smith <address@hidden> wrote:
> Walter Landry wrote:
> > For every branch, there is a directory with the same name as the
> > branch, but with a period "." appended.  The period "." makes it easy
> > to distinguish branch names (which can be almost anything) from
> > everything else.
> 
> This might cause problems with certain tools on MS Windows (where 
> "empty" extensions are unusual). Otherwise, seems reaonable.

What kind of trouble?  I can add an extension easily enough (.arx?
.bra? .brc?).

> > Also within (sub)*branches, there can be revisions.  They are
> > divided in chunks of 256.  So revisions 0-255 are in directory 0,
> > 256-511 are in directory 256, etc.
> 
> These are sequential revisions to a particular repo, right? [Oh, this is 
> clarified later, I think.]

For a given branch in a repo.

> > Within each revisions directory, there are directories for each
> > revision.  Each revision is named by its sha256 and its parent.  The
> > idea is that a simple directory listing can give us all of the
> > revisions and how they are pieced together.
> 
> It seems odd to have the outer layer be sequential but the inner layer 
> be hashed. I would have expected either a git-like approach, where all 
> the revision data was stored hashed, with an external index file that 
> provides the ordering, OR to have the outer directories as you said 
> here, and then have the inner directories named 1, 2, 3, 4, etc.

The idea is to group related stuff together.  So you put related
revisions together, but still segregate widely separated revisions for
performance.

> > This is not the complete hash, but it is 60 bits of it.  To get an
> > accidental collision, we would have to have 2^30 different revisions
> > (about 1 billion) from the same parent revision.  The only danger
> > of accidental collisions is that it would cause you to be unable to
> > commit or mirror.  The signatures and full hashes are still checked,
> > so there is no danger from a malicious replacement.
> 
> I think that's a bit of an overstatement. It's true that an attacker 
> couldn't just drop a fake revision in to replace one that you had 
> signed. However, someone could disrupt the system by signing two 
> different revisions that share the same hash but have different 
> contents. Just something to consider as a corner case.

Could you be more specific?  I don't see how what you are describing
is different from just making a directory with the same 60 bit name
and putting junk in it.  Yes, it is disruptive, but allowing people to
modify the repository opens you up to that kind of thing.  In either
case, whatever you get won't be signed or won't validate to the
correct 256 bit hash.

Also, when you say "share the same hash", I presume you are talking
about the first 60 bits, not the entire 256 bits.  It is infeassible
to create different files with the same 256 bits of hash.

> > Within the revision subdirectory, we can have up to four files:
> > 
> >   1) "rev.tgz"
> > 
> >     This is a full copy of the project tree at this revision.  This is
> >     _only_ present for the first branch of a revision.  
> 
> That sounds funny. My mind wants to hear "first revision of a branch".

You're right.

> > I am considering adding a non-authoritative "index" file to every
> > directory except the revision directories.  It would contain a single
> > hash of its subdirectories.  So you would be able to read a single
> > short file to see if anything has changed in the repo.  Updating the
> > "index" file atomically is tricky over remote connections.  So if the
> > file is missing or corrupt, ArX will look for changes the
> > old-fashioned way.
> 
> That sounds like a good idea, although I think "index" is the wrong 
> word. It's more of a cached hash.

How about "dirhash"?

> > The "index" file is separate from a "listing" file, which will still
> > be maintained for systems that read over plain old http.  It is
> > "listing" instead of ".listing" to get around some restrictive ftp
> > upload policies.  Also, it will become a serialized list of
> > directories instead of the one-file-per-line format, so that you can
> > have special characters (e.g. carriage return, NULL) in revision
> > names.
> 
> I strongly prefer portable data formats over serialized C++ binary 
> stuff. I like the freedom to write tools in other languages to access 
> the data.

I think you are overestimating the complexity of serialization.  If I
recall correctly, for a list of strings, the serialization library
would write a header, the length of the list as an ascii string
(e.g. "12"), and then the elements of the list.  Each element is again
a length and then the string itself.  You are not going to get any
simpler than that and cover all of the corner cases with embedded
nulls etc.  So a list with the elements "crate" and "barrel" would be
serialized as

  22 serialization::archive 2 5 crate 6 barrel

The serialization format is not complicated.  What you are probably
complaining about is that the _arx/++manifest file has some binary
elements.  Those are sha256's of files, and I put them in that format
for efficiency (though it may be premature optimization).

> > The revisions are numbered by their maximum distance from the root.
> > For example, a graph with numbering
> > 
> >      aaa   (0)
> >   /       \
> >  |         |
> >  |         |
> > bbb (1)    |
> >  |        eee (1)
> >  |         /
> > ccc (2)   /
> >   \      /
> >     ddd     (3)
> > 
> > So to get ddd, you can type
> > 
> >   arx get url,address@hidden
> > 
> > To get eee, you have to disambiguate it
> > 
> >   arx get url,address@hidden
> 
> Oh. Wow. That's really odd. I mean, it probably makes sense, but for 
> those of us who aren't yet comfortable with histories that branch and 
> merge in weird ways, it looks odd. I guess the good news is that for 
> most of my projects, which only have a mainline and an occasional fork 
> that never merges back, I could work with pure sequential revision 
> numbers. [Which you also say later.]

That is exactly the idea.  It is simple for simple projects, but can
handle more complicated projects.

> > It is fairly simple to go from there to using the branch==repo
> > paradigm that hg, bzr, darcs, etc. have.  My thought right now is that
> > that paradigm is sufficiently different from the separate repo and
> > tree paradigm that I would want a different command for it.  
> 
> I'm not quite sure what you're saying, but I think the repo == branches 
> paradigm of ArX is one of its strengths.

I assume you mean branch!=repo here?  In any case, I am just saying
that, for those who prefer branch==repo, it would be simple to create
a tool to cater to them.  Everyone would use the same master repo.

> Mercurial allows multiple heads within a single branch/repo, but
> that's not as powerful. Bzr plans to build ArX-like repos out of
> their branch/repos, but it's not yet clear to me how clean that will
> end up.
> 
> > Skip-deltas is probably the least firm part of the new format.  I just
> > can't think of anything else that is going to work well.
> 
> You mentioned several drawbacks of skip-deltas. What are the big 
> benefits they bring, and what alternatives did you consider?

It only takes O(log(Number of revisions)) to get a particular
revision.  So revision 63222 takes about 16 patches.  Currently, it
would take 63222 patches.  ArX gets around this somewhat with repo
caches.  But that requires repo maintenance, which I really want to
get rid of.  Even I don't update cached revisions as much as I should.

> > For the project tree-format, the only real difference is that I will
> > get rid of the patch logs in the tree, and instead just have a file
> > which contains all of the patches that have been applied to the tree.
> > The patchlogs take up way too much space and are duplicated from the
> > repo.  The file would just be a serialized graph of the ancestry,
> > making it easy to read and write.
> 
> Ok, so "project tree" is what some folks call the "working tree". The 
> place where a user has checked out a specific working copy of the code. 
> Right?

Right.

> If you are storing less information in the project tree, then it 
> would follow that it would be more important to have fast access to the 
> repo itself...the repo probably shouldn't be on the other side of a slow 
> network connection. Or maybe that's already true with ArX 2?

This will only affect "arx log", because now it will have to go to the
repo to get the information.  You can still do local diffs, and you
still need to hit the repo to commit or annotate.  It turns out that
this is the same set of operations as supported by Subversion.  Except
that with ArX you have the option of creating a no-history branch.

> > * Reliable: Won't break if interrupted at arbitrary times.
> >   -
> >   The only bad things that can happen are that there are pending
> >   revisions left in the repo, or index files are missing.  Missing
> >   index files merely slow down ArX, and the next commit to the repo
> >   will fix it.
> 
> So the index file would be deleted before any updates are performed in a 
> directory, and re-created after the updates are complete. Ok.

I wasn't quite thinking that things would happen in that order.  But
now that you mention it, it does sound more robust.  It would
completely prevent out-of-date problems.

> > * Fast merging, including perhaps strategies like fast weave merges:
> >   -
> >   Regular, 3-way merges should not be too bad, especially with
> >   O(log(N)) access to any revision.  Weave merge will not be
> >   particularly fast, but perhaps fast enough for the rare cases when
> >   you need it.
> 
> You might also have a look at GIT's "recursive" merging. Slower, but 
> supposedly fixes some cases that confuse a 3-way merge.

I have read a bit about it.  It seems a bit ill-specified and prone to
odd breakage.  But if it works, it works.  In any case, fast access to
revisions would make recursive merge less painful.  At this point, I
am still waiting for someone else to figure out the best merging
strategy ;)

> The bzr folks keep talking about "knits", which are some variant of 
> weaves. I think those are both part of a more generic strategy of doing 
> merges based on annotated lines, regardless of how those are stored.

I have seen mention of knits, but I don't really know what they are.

> > * Efficient storage of repo and project tree: (unpacked git is
> >   terrible here.  tla/baz/arx all store the complete patch logs of all
> >   revisions in separate files in the project tree, bloating the space
> >   requirements for projects with long histories)
> >   -
> >   The project tree is very efficient since we have gotten rid of the
> >   patch logs.  The repo is somewhat efficient, although it could be
> >   better.  It depends on the number of revisions.
> 
> I think another requirement should be that the native format is 
> sufficiently efficient. I view GIT's "packs" as perhaps its worst 
> feature, as it is user-hostile. Mercurial has "bundles" which also add a 
> lot of complexity that I dislike.

I agree.  That is why later on I listed "no repo maintenance" as a
desired feature.

> > * Signatures on patches and revisions:
> >   -
> >   The signature on the patch log covers the sha256 of the revision and
> >   patch.  Sha256 should be good for the next 50 years or so, barring
> >   unforseen developments.  The same can not be said for sha1.
> 
> As long as this is fast enough, I think it's a good choice.

It is the _only_ choice if you actually care about security.  Don't
get me started.

> > * push/pull over dumb protocols:
> >   -
> >   As before, ftp and webdav servers work for free.  Plain http servers
> >   must use update-listing.
> 
> I'm still happy that plain http will be supported. I'll still grumble 
> that update-listing is separate, but as long as ArX has an option to 
> automatically keep the listing files updated, it's ok.

An earlier version of the format used real indexes to list everything.
You really get into problems of atomicity over remote protocols.
Other systems (hg, bzr, git, darcs) don't get into this problem
because they are not writing the repository over a dumb server.

> > * Distributed:
> >   -
> >   Using hashes to disambiguate revisions allows people to work in
> >   parallel, and then pull in revisions from each other.
> 
> Does this also cover the desktop/laptop case? Perhaps in combination 
> with merge convergence that you mentioned earler?

Yes.

> * Cheap branching, even on systems without hardlinks or symlinks. A 
> FAT32 user should be able to create a new branch of a large project as 
> quickly and using only as much disk space as someone on an ext3 system. 
> Multiple branches on a web server should not consume excess space.

Would these be microbranches or no-history branches?  Microbranches do
not consume excess space.  No-history branches do take up some space.
This is all independent of what file system you are using.

> * Handles projects which have a large number of directories and files in 
> the working tree.

That is one I forgot.  The new repo format will work as well as the
current format.  Each new revision uses 1 directory and 3 files.  As
opposed to mercurial, which has a 2 files in the repo for each file,
which are just appended for each revision.

The only place where the number of directories and files comes into
play is when reading the manifest, which is required for diffs.  It
seems fast enough on my machine with boost, which has about 10,000
files.  The linux kernel has about 20,000 files.

> * Works simply with small and simple projects.

This is kind of covered by other points (e.g. No repo maintenance, No
names).

> * Will be able to support quilt/bzr-shelve/mq functionality.

If I understand this functionality correctly, this is just selectively
reverting files and putting them into a changeset?  Storing revisions
as patches against complete trees (as opposed to weaves) makes this
pretty trivial.

However, I get the feeling that there is more to it than that.

> > * Can host on any filesystem including 8.3 systems.  This includes
> >   running a server on a filesystem that can not store files in the
> >   repo.  Using a single file database would be one solution, although
> >   that causes other problems.
> > 
> >   8.3 filesystems are not so common, so you might want to try to get
> >   away with only 31 character, case-insensitive filenames, with a max
> >   path length of 255 and max directory depth of 8.
> > 
> >   Also, some ftp sites have restrictive policies about what kind of
> >   files can be uploaded.  From the comcast website:
> > 
> >    NOTE: File names must consist of characters from "a-z", "A-Z",
> >    "0-9", '_' (underscore), '.' (period), '-' (hyphen). No other
> >    characters (including spaces) can be included in the file
> >    name. File names must not start with '.' or '-'.
> > 
> >   -
> >   8.3 filesystems will not work because of the 30 character revision
> >   names.  Similarly, illegal characters in a branch name or overly
> >   long branch names can cause problems.  I considered url-encoding
> >   branch names, but that will make non-ascii names much longer,
> >   possibly causing problems with length.
> > 
> >   An interesting note is that if two branches differ only in case,
> >   they will end up stored in the same place on case-insensitive
> >   filesystems.  Because of the hashes, the ancestry will not get
> >   confused.  But running a simple "arx get branch" will notice that
> >   there are two heads.
> 
> I don't think support for 8.3 names is important, so I think you've made 
> the right choice here.
> 
> It might be worth storing branch names in a table, rather than exposing 
> them as raw filenames. The bzr folks are discussing something similar at 
> the moment. I believe that if you burn a backup on one system, and 
> restore it on another system, it should just work. That means you can't 
> allow just any character, nor can you escape only the characters that 
> won't work on the particular file system you are writing to at the moment.

That introduces another place where things can fail, leaving your repo
in an inconsistent state.  Any time you update a file, you have to be
prepared to deal with it missing or corrupted.  Bzr, hg, git, etc. all
deal with local filesystems where the window for wedging your repo is
small.

> > * Works with write-once media. (No one really has this, although
> >   tla/baz/arx and subversion (with fsfs) could be modified to do so.
> >   We just need a place to put the lock files.)
> >   -
> >   No
> 
> I thought GIT had this. I don't think supporting write-once media is a 
> critical feature. Requiring only append access could help in certain 
> high-security cases.

Doesn't git have a file which tells you what HEAD is?  You need
something to serve over http.

> > * Lightweight branches
> >   -
> >   Microbranches are as light as they can be, since they only have a
> >   small patch file.  However, that requires you to have the rest of
> >   the branches' revisions.
> > 
> >   Branches without history are not as lightweight as they used to be,
> >   because you always have a "rev.tgz" file.  However, the size of
> >   "rev.tgz" is usually much smaller than the size of a project tree,
> >   so it won't actually be that bad.  My WAG is a 30% addition over the
> >   size of a project tree.
> 
> Terminology check...when you say "lightweight branch", you mean a branch 
> that doesn't contain full history. As opposed to having multiple heads 
> within the same branch, as in hg, which can also be called a lightweight 
> branch.

Correct.  I call what hg has a microbranch.  What this new repo format
really gives us is microbranches.

> I still prefer the term "distributed branch" for the "no history"
> case.

Actually, I like the term "no history".  "Truncated history" would also
work, although that is a bit longer.

> * Reasonable support for archiving large binary files

That is another good one.  Basically, you need a streamy binary diff.
ArX has a binary diff, but it is not streamy.  I think Subversion (and
thus SVK) are the only ones with this.

> * All user data (filenames, commit comments, branch names) can be UTF-8

This is so basic a requirement that it did not even occur to me.  The
new formats can, of course, handle UTF-8.

> * Store all dates in ISO UTC format, and otherwise keep all data as 
> locale-independent as possible

ArX stores dates using boost's to_simple_string, which gives you dates like

  2002-Jan-01 10:00:01.123456789Z

instead of ISO

  20020131T100001,123456789

The first format is much easier to comprehend and not difficult to
parse.  Storing the time in ISO format but printing it out in
simple_string format would introduce annoying hackery into a few
places.

> The following issues aren't necessarily tied to the repo format, but are 
> valuable features in an SCM tool:
> 
> * Repo and/or branch "nicknames" or "aliases"

Are you thinking of multiple names for the same branch?  So that 

  arx get http://foo.com,address@hidden

and 

  arx get http://foo.com,address@hidden

would get the same thing?  You can certainly tag "bar" so that it
always points to the latest version of "foo".  But it won't share
revision numbers.

> * Facilities to mitigate newline conversions when a project is shared by 
> people using different workstation OS's.

I recognize the difficulties that people have, but this is such a bag
of worms that I have been unmotivated to think about it.

> * Support for plugins (see bzr and hg), because it makes it far easier 
> for non-core developers to experiment with cool stuff, and to prototype 
> potential new features before adding them to the core.

This is, indeed, nice.  ArX has python bindings, but you can't create
new commands with it.

Thanks for the comments.

Cheers,
Walter





reply via email to

[Prev in Thread] Current Thread [Next in Thread]