guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: On raw strings in <origin> commit field


From: Liliana Marie Prikler
Subject: Re: On raw strings in <origin> commit field
Date: Wed, 29 Dec 2021 21:25:07 +0100
User-agent: Evolution 3.42.1

Hi,

Am Mittwoch, dem 29.12.2021 um 09:39 +0100 schrieb zimoun:
> Hi,
> 
> On Tue, 28 Dec 2021 at 21:55, Liliana Marie Prikler
> <liliana.prikler@gmail.com> wrote:
> 
> > Consider a package being added or updated in Guix.  At the time of
> > commit, we have the tag v1.2.3 pointing towards commit deadbeef.  We
> > therefore create a guix package with version "1.2.3" pointing to
> > said commit (either directly or indirectly).  At this point, one of
> > the following holds:
> >   (1) Guix "1.2.3" -> upstream "v1.2.3" -> upstream "deadbeef"
> >   (2) Guix "1.2.3" -> upstream "deadbeef" <- upstream "v1.2.3"
> > From either, we can follow that Guix "1.2.3" = upstream "v1.2.3".  If
> > upstream keeps their tags around, then both forms are equivalent, but
> > (1) is more convenient; it allows us to derive commit from version,
> > which is often done through an affine mapping.
> 
> No, tags and hash commit are not equivalent.  Hash commit is intrinsic:
> it only depends on the content.  Whereas, tags are extrinsic, they
> depend on external choice.
The notion of equivalence I am using here is the same as in the
statement "5 ≡ 2 mod 3", wherein the ≡ symbol is ironically called
IDENTICAL TO in Unicode despite being used very differently in
mathematics.  Perhaps there is a language barrier here; in German we
read that as "5 is equivalent to 2 modulo 3" and logic equivalence
functions similarly.

For the record, one could argue that I should have used that symbol for
comparing Guix "1.2.3" to upstream "v1.2.3" because they are in fact
not equal, only equivalent, but that's besides the point.  The point
is, with an upstream behaving as we want upstreams to behave (not just
git ones, url-fetch suffers from the same issue with moving tarballs
for instance), you can substitute one for the other without a change in
meaning; both will fetch the same commit.

> From the content to the hash, three keys: 1) how to serialize and 2)
> how to hash and 3) how to represent the hash.  For #1, Git uses their
> own serializer and Guix, inheriting from Nix, uses another (Nar);
> although the difference is minor.  For #2, Git uses by default SHA-1 as
> hash function, although Guix uses SHA-256.  And for #3, Git uses
> hexadecimal format and Guix uses nix-base32.
> 
> The subcommand “guix hash” with the options ’-S, -H’ and ’-f’ exposes
> these 3 keys.  For instance:
> 
>         $ cat /tmp/foo.txt | git hash-object --stdin
>         557db03de997c86a4a028e1ebd3a1ceb225be238
>         $ ./pre-inst-env guix hash -S git -H sha1 -f hex /tmp/foo.txt
>         557db03de997c86a4a028e1ebd3a1ceb225be238
> 
> To make it explicit, the checksum hash of ’git-reference’ could be
> removed because it is somehow redundant with the commit hash.
> Obviously, it cannot because security reason (SHA-1 is considered as
> weak).
The other way also works.  If Git used a secure hashing function such
as SHA-256 (or SHA-512 or Keccak) and Guix supported that hash, we
could generate a git hash from the Guix hash (assuming also we allow
the origin serializer to be configured, which would be required either
way).

The weakness of SHA-1 also flies in the face of the robustness
argument.  One could maliciously push a commit that replaces an
existing one with the same hash, though it would also break the repo in
doing so.  At least in theory, as no such attack has been done yet. 
Note to self: theoretical attacks on Git are probably off-topic as
well.

> > Problems arise, when upstreams move or delete tags.  At this 
> > point, guix packages that use them break and are no longer able to
> > fetch their source code.  Raw commits are in principle resilient to
> > this kind of denial of service; instead upstreams would have to
> > actually delete the commits themselves, including also possible
> > backups such as SWH to break it.  There is certainly an argument
> > for robustness to be made here, particularly concerning `guix time-
> > machine', though as noted it is not infallible.  
> 
> SWH provides ’swh:id’ which is another triplet (really close to Git).
> Basically, content means data and metadata and to make it short, SWH
> deals their way with metadata for reason of large scale.  And SWH
> does snapshots of Git repositories.
> 
> Therefore, to have something really robust, Guix has to rely on a map
> from package definition to SWH.
> 
> Using Git commit hash instead of tag makes this map.  For tag, to
> have something robust, we need an external map from checksum hash to
> SWH hash via Git commit hash.  This “external” is done by Disarchive.
I don't know too much about Disarchive here, so please enlighten me. 
If it used a pair of origin file name + hash, whether or not the git-
reference uses tags would be irrelevant, no?  Do we have to take values
from the uri field?

> > Long-term, we might want to support having multiple <git-
> > references> in git-fetch -- if the first one fails due to a hash
> > mismatch, we would warn about that instead of producing an error
> > and thereafter continue with the second, third, etc. 
> > similar to how we currently have mirror:// urls for some well-known
> > mirrored repositories.  That way, we have a system to warn us about
> > naughty upstreams while also providing robustness for the time
> > machine.
> 
> I think the long term is to completely remove tag and only use commit
> hash; as done for ’guile-aiscm’.  But it will not happen for
> convenience reasons, I guess.
> 
> What you are proposing is to mix extrinsic (tag, URL, etc.) with
> intrinsic (commit hash, checksum hash, etc.).  Well, I do not know if
> this proposed fallback mechanism would ease the maintenance and would
> make Guix more robust.
I'm not sure the distinction between extrinsic and intrinsic values is
a useful one here.  The only important intrinsic value here is the
content hash, which is unlikely to break [1].  We're using extrinsic
values such as URLs all over the place, including the very line
preceding the commit value of a git-reference (almost) every time --
I'm leaving room here for some person to put the commit before the URL.

> To me, robustness means make a map from intrinsic values to content;
> as Disarchive is doing for instance.
See above, I don't understand why Disarchive would need more than the
content hash as an intrinsic value to do so.

Cheers,
Liliana

[1] "Briefly stated, if you find SHA-256 collisions scary then your
priorities are wrong." https://stackoverflow.com/a/4014407



reply via email to

[Prev in Thread] Current Thread [Next in Thread]