arx-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Arx-users] The Future (long)


From: Kevin Smith
Subject: Re: [Arx-users] The Future (long)
Date: Wed, 07 Dec 2005 21:34:44 -0500
User-agent: Mozilla Thunderbird 1.0.7 (X11/20051011)

Walter Landry wrote:
Kevin Smith <address@hidden> wrote:

Walter Landry wrote:

For every branch, there is a directory with the same name as the
branch, but with a period "." appended.  The period "." makes it easy
to distinguish branch names (which can be almost anything) from
everything else.

This might cause problems with certain tools on MS Windows (where "empty" extensions are unusual). Otherwise, seems reaonable.


What kind of trouble?  I can add an extension easily enough (.arx?
.bra? .brc?).

I have memories of Notepad really disliking files without extensions, and vague memories (perhaps false) of running into a case where Windows Explorer couldn't tell the difference between "foo." and "foo". That may have been with Windows 3.1 and 8.3 filenames, though.

If it's easy to add an extension, I would probably do so, just to be more conventional. My first thought would be ".d" to reaffirm that it's a directory.

I think that's a bit of an overstatement. It's true that an attacker couldn't just drop a fake revision in to replace one that you had signed. However, someone could disrupt the system by signing two different revisions that share the same hash but have different contents. Just something to consider as a corner case.


Could you be more specific?  I don't see how what you are describing
is different from just making a directory with the same 60 bit name
and putting junk in it.  Yes, it is disruptive, but allowing people to
modify the repository opens you up to that kind of thing.  In either
case, whatever you get won't be signed or won't validate to the
correct 256 bit hash.

Also, when you say "share the same hash", I presume you are talking
about the first 60 bits, not the entire 256 bits.  It is infeassible
to create different files with the same 256 bits of hash.

Yes, I was referring to sharing the "first 60 bits".

You are correct that it's not a real issue, as long as you never do a sha check of the contents against the abbreviated directory name. As long as you also check the full hash at the same time, you would catch any problems. So: never mind.

That sounds like a good idea, although I think "index" is the wrong word. It's more of a cached hash.

How about "dirhash"?

Sounds fine.

I think you are overestimating the complexity of serialization.  If I
recall correctly, for a list of strings, the serialization library
would write a header, the length of the list as an ascii string
(e.g. "12"), and then the elements of the list.  Each element is again
a length and then the string itself.  You are not going to get any
simpler than that and cover all of the corner cases with embedded
nulls etc.  So a list with the elements "crate" and "barrel" would be
serialized as

  22 serialization::archive 2 5 crate 6 barrel

The serialization format is not complicated.  What you are probably
complaining about is that the _arx/++manifest file has some binary
elements.  Those are sha256's of files, and I put them in that format
for efficiency (though it may be premature optimization).

I have two wishes:

1) That the file is plain text so I can look at it with any tool. I know I shouldn't need to, but when I'm coding it's a pain to not be able to look at important data easily.

2) That the full file spec is documented. I want to be able to write a tool in any language to access the data, and reverse-engineering binary data is a royal pain. I think this is more important to me than #1, and I would shy away from a format that might change at any moment due to the whims of the boost developers. Unless they provide assurances about stability.

Your rationale was to allow special characters in revision names. You must have meant branch names. Even so, disallowing newlines and nul bytes doesn't seem like a severe limitation on branch names.


It is fairly simple to go from there to using the branch==repo
paradigm that hg, bzr, darcs, etc. have.  My thought right now is that
that paradigm is sufficiently different from the separate repo and
tree paradigm that I would want a different command for it.

I'm not quite sure what you're saying, but I think the repo == branches paradigm of ArX is one of its strengths.


I assume you mean branch!=repo here?  In any case, I am just saying
that, for those who prefer branch==repo, it would be simple to create
a tool to cater to them.  Everyone would use the same master repo.

Yes, I struggled with the wording, which is why I said repo==branchES, as opposed to repo==branch. I have not yet seen a branch==repo SCM app that allows cheap branching on non-hardlink file systems, so I remain happy about ArX repos.

You mentioned several drawbacks of skip-deltas. What are the big benefits they bring, and what alternatives did you consider?

It only takes O(log(Number of revisions)) to get a particular
revision.  So revision 63222 takes about 16 patches.  Currently, it
would take 63222 patches.  ArX gets around this somewhat with repo
caches.  But that requires repo maintenance, which I really want to
get rid of.  Even I don't update cached revisions as much as I should.

Ah. You mentioned that svn uses skip-deltas. How do the other tools solve that problem? It seems like most systems are, at their core, either a vector or a linked list of revisions. I suppose the speed optimizations would be snapshots (ArX caches, darcs has something similar, maybe GIT bundles?), or some kind of b-tree index, or ???

At this point, I
am still waiting for someone else to figure out the best merging
strategy ;)

Smart.

The bzr folks keep talking about "knits", which are some variant of weaves. I think those are both part of a more generic strategy of doing merges based on annotated lines, regardless of how those are stored.

I have seen mention of knits, but I don't really know what they are.

Me neither. I think I half-understood them a few weeks ago, but it's gone now.

The bzr folks are almost talking as if bzr will have multiple back ends. One might store weaves, another knits, and another might store "delta histories".

* Signatures on patches and revisions:
 -
 The signature on the patch log covers the sha256 of the revision and
 patch.  Sha256 should be good for the next 50 years or so, barring
 unforseen developments.  The same can not be said for sha1.

As long as this is fast enough, I think it's a good choice.


It is the _only_ choice if you actually care about security.  Don't
get me started.

Well, I could get into a whole thing about how SHA-1 might be good enough for most purposes for a while, or about how there might be some legitimate competitors to SHA-256, but I won't. SHA-256 makes sense.

* Cheap branching, even on systems without hardlinks or symlinks. A FAT32 user should be able to create a new branch of a large project as quickly and using only as much disk space as someone on an ext3 system. Multiple branches on a web server should not consume excess space.


Would these be microbranches or no-history branches?  Microbranches do
not consume excess space.  No-history branches do take up some space.
This is all independent of what file system you are using.

My concern is simply that MS-Windows FAT32 users should not be second-class citizens. They should be able to work as efficiently, using the same processes, as other folks. That's not the case right now with darcs or mercurial. Or bzr, but the bzr folks are working on it.

If I want to work on ten features a day, each in its own branch that might last an hour or two, what ArX 3 mechanism would I use?

* Will be able to support quilt/bzr-shelve/mq functionality.

If I understand this functionality correctly, this is just selectively
reverting files and putting them into a changeset?  Storing revisions
as patches against complete trees (as opposed to weaves) makes this
pretty trivial.

However, I get the feeling that there is more to it than that.

It seems that the primary use of bzr shelve is:

I have made several changes to my working tree, but they really should be two different revisions/changesets. I can "shelve" some of my changes, leaving me with a single changeset that I can test and commit. Then I can unshelve those changes, test the full result, and commit the second revision. It includes darcs-style per-hunk selection.

It seems that the primary use of quilt is:

I am tracking an upstream repo. I am maintaining several of my own patches on top of that repo. Every time I sync with the upstream repo, I can push my patches (changesets) aside, sync with upstream, and then re-apply my patches on top. The unit of work is changesets, not files or hunks.

Further, I can (or at least theoretically could) do patch refactoring:
- Combine small patches into a single large patch
- Split a large patch into several smaller patches
- Reorder patches
- Modify the patch description or other metadata

I think mq is very similar to quilt, except that since it is integrated with mercurial, it actually stores my patches in the repo itself. When necessary, those patches are ripped out of the repo, and then reapplied after the upstream sync.

There are some concerns that mq is dangerous because it can remove changesets from a repo that may already have been published. Darned handy, though.

It might be worth storing branch names in a table, rather than exposing them as raw filenames. The bzr folks are discussing something similar at the moment. I believe that if you burn a backup on one system, and restore it on another system, it should just work. That means you can't allow just any character, nor can you escape only the characters that won't work on the particular file system you are writing to at the moment.


That introduces another place where things can fail, leaving your repo
in an inconsistent state.  Any time you update a file, you have to be
prepared to deal with it missing or corrupted.  Bzr, hg, git, etc. all
deal with local filesystems where the window for wedging your repo is
small.

So if you use any unusual characters anywhere in your repo, it becomes non-portable. That would include: Repos stored on Samba shares, on plain http servers, and burnt onto CD-ROM's.

I understand the hassle of using indirection to avoid using branch names as filenames, but that still seems like a significant problem to me.

* Works with write-once media. (No one really has this, although
 tla/baz/arx and subversion (with fsfs) could be modified to do so.
 We just need a place to put the lock files.)
 -
 No

I thought GIT had this. I don't think supporting write-once media is a critical feature. Requiring only append access could help in certain high-security cases.


Doesn't git have a file which tells you what HEAD is?  You need
something to serve over http.

You're right, although I believe HEAD is/was an optional convention. Linus resisted adding tags for a long time, instead just announcing the hash value of the latest release.

Correct.  I call what hg has a microbranch.  What this new repo format
really gives us is microbranches.

Ok. I like that term.

I still prefer the term "distributed branch" for the "no history"
case.


Actually, I like the term "no history".  "Truncated history" would also
work, although that is a bit longer.

Any of those work for me. Just not lightweight :-)

* Reasonable support for archiving large binary files


That is another good one.  Basically, you need a streamy binary diff.
ArX has a binary diff, but it is not streamy.  I think Subversion (and
thus SVK) are the only ones with this.

As long as big binary files can be stored, retrieved, and updated to a new copy, that's sufficient. Better diffing is a plus.

ArX stores dates using boost's to_simple_string, which gives you dates like

  2002-Jan-01 10:00:01.123456789Z

instead of ISO

  20020131T100001,123456789

I'm not thrilled that English text in part of the stored format, but otherwise that seems sane. It's a known set of twelve short strings, appearing at a fixed location, so it would be easy to localize in the UI.

* Repo and/or branch "nicknames" or "aliases"


Are you thinking of multiple names for the same branch? So that
  arx get http://foo.com,address@hidden

Nope. I'm thinking of:

    arx get walter

instead of whatever long URL happens to contain the latest official ArX tree. I guess it depends on how often I have to type the URL. If it's one a month, I don't really care. If it's several times a day (as it seemed to be with ArX 2), it's important.

This could just be a lookup table stored in a .conf file. Bzr has "branch nicks" (nicknames) which seem similar, although I haven't actually used them. In ArX, I'm not sure whether having aliases for branches would be valuable or not.

* Facilities to mitigate newline conversions when a project is shared by people using different workstation OS's.


I recognize the difficulties that people have, but this is such a bag
of worms that I have been unmotivated to think about it.

Yup. For a while, I advocated SCM tools not doing newline conversions, but enough people seem to still be using brain-dead tools that it must be supported to be a mainstream cross-platform tool. I think monotone pioneered the use of hooks for this, and I haven't heard bad things about it.

* Support for plugins (see bzr and hg), because it makes it far easier for non-core developers to experiment with cool stuff, and to prototype potential new features before adding them to the core.


This is, indeed, nice.  ArX has python bindings, but you can't create
new commands with it.

Would it be possible to support C++ plugins? That would be better than nothing, and perhaps a framework could be built on top of that which would actually allow plugins to be written in python, ruby or other languages.

Oh, the repo format should also handle file attributes, as ArX 2 does. Handy for executable bits and other stuff.

Thanks for the comments.

You're welcome. Fun stuff.

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]