Re: On raw strings in <origin> commit field

guix-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: On raw strings in <origin> commit field

From:	Liliana Marie Prikler
Subject:	Re: On raw strings in <origin> commit field
Date:	Fri, 31 Dec 2021 01:02:43 +0100
User-agent:	Evolution 3.42.1

Am Donnerstag, dem 30.12.2021 um 13:43 +0100 schrieb zimoun:
> Hi Liliana,
> 
> On Wed, 29 Dec 2021 at 21:25, Liliana Marie Prikler
> <liliana.prikler@gmail.com> wrote:
> > Am Mittwoch, dem 29.12.2021 um 09:39 +0100 schrieb zimoun:
> > > On Tue, 28 Dec 2021 at 21:55, Liliana Marie Prikler
> > > <liliana.prikler@gmail.com> wrote:
> 
> > The notion of equivalence I am using here is the same as in the
> > statement "5 ≡ 2 mod 3", wherein the ≡ symbol is ironically called
> > IDENTICAL TO in Unicode despite being used very differently in
> > mathematics.  Perhaps there is a language barrier here; in German
> > we read that as "5 is equivalent to 2 modulo 3" and logic
> > equivalence functions similarly.
> 
> I do not understand against what you are arguing so I skip it. :-)
I was under the impression that you and I used the word "equivalent"
differently, so I wanted to clear that up.

> > For the record, one could argue that I should have used that symbol
> > for comparing Guix "1.2.3" to upstream "v1.2.3" because they are in
> > fact not equal, only equivalent, but that's besides the point.  The
> > point is, with an upstream behaving as we want upstreams to behave
> > (not just git ones, url-fetch suffers from the same issue with
> > moving tarballs for instance), you can substitute one for the other
> > without a change in meaning; both will fetch the same commit.
> 
> If I understand you correctly:
> 
>  - Guix "1.2.3" means the field ’version’
>  - upstream “v1.2.3” means the upstream tag used by the field
> ’commit’ of ’git-reference’.
> 
> and yes it is strongly expected that these both fields matches. :-)
Well, at least we agree on something.

> But it is irrelevant, IMHO, to your initial message «commit tags are
> in principle mutable and hence can not be relied on when fetching
> sources.  I do have a few issues with that explanation».  It is
> fortunate and not robust that ’commit’ matches ’version’ via upstream
> ’tag’.
It is in fact very relevant to the issue at hand.  In principle,
versioned URLs are not robust, hence we can't have a single package
using url-fetch.  A statement like that is obviously silly, not just
because tarballs that are updated in-place are exceedingly rare, but
also because they violate how we think about versions.  The same holds
for git, with the difference being that we no longer generate a URL
from the version, but a tag.  If that tag can't serve as bridge here,
the version field loses the meaning it had from the strong expectation
that the two of them match.

> Because how ’commit’ and ’tag’ are defined is different.
> 
> I cannot tell it differently than: Git commit depends only on the
> content, although ’tag’ not.
> 
> Version (or tag) is convenient names for humans.  It is easier to
> tell version 0.23.1 than
> 09rdbcr8dinzijyx9h940ann91yjlbg0fangx365llhvy354n840.  And we can
> deduce that 0.22.3 is older than 0.23.1, when it is impossible for
> commits.
Git commit hashes do not just depend on the content.  They also depend
on how much effort you put into solving a proof of work challenge that
won't ever earn you crypto coins [1].

> If you prefer to keep the frame: «you can substitute one for the
> other without a change in meaning», then, for what my opinion is
> worth on that matter, my probably wrong understanding of your words
> is that perhaps you are missing a point about content-addressability.
To be fair, I did not consider content-addressability here, because my
main concern is natural intelligence based verification.  

> > > From the content to the hash, three keys: 1) how to serialize and
> > > 2) how to hash and 3) how to represent the hash.  For #1, Git
> > > uses their own serializer and Guix, inheriting from Nix, uses
> > > another (Nar); although the difference is minor.  For #2, Git
> > > uses by default SHA-1 as hash function, although Guix uses SHA-
> > > 256.  And for #3, Git uses hexadecimal format and Guix uses nix-
> > > base32.
> 
> [...]
> 
> > > To make it explicit, the checksum hash of ’git-reference’ could
> > > be removed because it is somehow redundant with the commit hash.
> > > Obviously, it cannot because security reason (SHA-1 is considered
> > > as weak).
> > 
> > The other way also works.  If Git used a secure hashing function
> > such as SHA-256 (or SHA-512 or Keccak) and Guix supported that
> > hash, we could generate a git hash from the Guix hash (assuming
> > also we allow the origin serializer to be configured, which would
> > be required either way).
> 
> Yes somehow.  To be on the same wavelength, we need to be precise
> when we speak about hash here because hash means:
> 
>  - serializer: how to deal with all the bits making the full content
>    (files, folder, tree, etc.)
>  - hashing function
>  - format
> 
> So yes, on principles, instead of NAR + SHA-256 + Nix-base32, the
> Guix project could have chosen Git + SHA-1 + Hex, or Git + SHA-512 +
> Base64 or any other combinations.
> 
> (I think this choice inherited from Nix is rooted in daemon
> implementations and another triplet would have been more changes when
> starting Guix, I guess.)
> 
> However, knowing only the final Guix checksum hash (NAR + SHA-256 +
> Nix-base32), say
> 09rdbcr8dinzijyx9h940ann91yjlbg0fangx365llhvy354n840,
> you can easily replace by any other formats (Hex or Base64), but it
> is not straightforward to compute the Git commit hash (here
> c78b91edb7c17c6fbf3b294452f44e91d75e3c67) from this Guix checksum
> hash, because the serializer NAR and Git have minor differences, and
> mainly because one uses SHA-256 and the other SHA-1 – and it is
> generally not possible to convert the hash from one hashing function
> to another hashing function.
> 
> To make it short, my point is: a) a Git commit hash owns the same
> properties as any checksum hash and b) a string tag is obviously not
> a checksum.
Ad b), I never claimed that a string tag is a checksum.  All I'm
claiming is that *under normal circumstances* we would expect it to
point to just one commit over time, similar to how we expect mirror URL
to expect to the same tarball no matter who ends up delivering it.  Or
how we expect the same substitutes from different servers ;)

Ad a) given that the Git hash (or checksum if you will) is weaker than
other checksums used in Guix, I simply wanted to reassert that it ought
to be the first to vanish if any of them vanishes, not the last.  I
understand my crypto well enough to know that we can't simply change
serializers post hash creation; if we wanted to encode our origin
hashes in an SWH-friendly fashion, we would need to change API
accordingly.

> > I don't know too much about Disarchive here, so please enlighten
> > me. If it used a pair of origin file name + hash, whether or not
> > the git-reference uses tags would be irrelevant, no?  Do we have to
> > take values from the uri field?
> 
> I am not sure to understand the questions.  Maybe the thread starting
> here is worth:
> 
>     <https://yhetil.org/guix/87a6j2w1et.fsf@gnu.org>
> 
> Otherwise, could you explain more what you have in mind?
I'mma quote Ludo for a change.

> SWH records the “history of the history”.  It can tell you what the
> tag pointed to at the time of a specific snapshot.
This just reiterates my point of Guix not trying hard enough with
fallbacks.  Let's say I archive git.evil.org/malicious-repo at version
0.1.0 a fair number of 64 times because I just keep changing the
initial release and Guix still refers to it by tag because I am also in
charge of updating the Guix package and have not yet caught up to the
fact that revision/commit pairs are good, actually.  Since each of
those 64 archives have a different NAR hash, we could try fetching all
of them from SWH and pick the one that fits.

Now obviously, in the real world, we would probably switch to a
version/revision pair for an upstream that violated our expectations
once, perhaps twice, so the overhead would not be as dramatic outside
of constructed examples.  Still, for the sake of "robustness", we might
want to decide what's a robust number of retries to not get into a DoS
loop. 

> > > To me, robustness means make a map from intrinsic values to
> > > content; as Disarchive is doing for instance.
> > 
> > See above, I don't understand why Disarchive would need more than
> > the content hash as an intrinsic value to do so.
> 
> Basically nothing more, so nothing to understand. :-)
> 
> Your initial messages started with:
> 
>         [...]
> 
> and my intent was to point the reason is not really the “mutable”
> part but the reason is because it is better to rely on intrinsic
> values (discussed in link above).  
By content hash, I meant NAR hash or Guix hash, not commit hash.  Sorry
for the confusion.

> Obviously, intrinsic value is immutable but, IMHO, intrinsic value is
> somehow a key-point for lookup in content-address systems.  Git-
> commit hash is one way, SWH-ID is another, IPFS uses another, GNUnet
> another, etc.  The recent ERIS [1,2] is an
> attempt to bridge, IIUC.
> 
> Addressing ’origin’ by intrinsic values implies which ones and The
> Right Thing is really hard to predict.
I don't think I agree with that assessment.  "Guix for Racket packages"
(it was called Xiden back then, but appears to have changed to denxi)
had the insane idea of allowing more than one hash in a package
definition and the source would have to match all of them.  We could do
the same in Guix, but it'd be another core-updates cycle until then.

> My opinion is that robust long-term – i.e., near future I want – is
> to rely on more intrinsic values in ’source’ or ’origin’ and less
> tags, urls, etc.  Well, I am fine if we disagree.  You asked «What do
> y'all think?», now you know what I think. :-)
> 
> Last, sorry if I am misunderstanding you, back to your initial
> message.  You provided ’guile-aiscm’ as one example of something that
> confused you.  Instead of the current definition, you would like this
> definition
> 
> --8<---------------cut here---------------start------------->8---
> 1 file changed, 1 insertion(+), 1 deletion(-)
> gnu/packages/machine-learning.scm | 2 +-
> 
> modified   gnu/packages/machine-learning.scm
> @@ -299,7 +299,7 @@ (define-public guile-aiscm
>                (method git-fetch)
>                (uri (git-reference
>                      (url "https://github.com/wedesoft/aiscm";)
> -                    (commit
> "c78b91edb7c17c6fbf3b294452f44e91d75e3c67")))
> +                    (commit (string-append "v" version))))
>                (file-name (git-file-name name version))
>                (sha256
>                 (base32
> --8<---------------cut here---------------end--------------->8---
That would have been a perfectly fine definition in my opinion, yes.

> ?  Or something like along these lines,
> 
> --8<---------------cut here---------------start------------->8---
> (define-public guile-aiscm
>   (let ((version "0.23.1")
>         (commit "c78b91edb7c17c6fbf3b294452f44e91d75e3c67")
>         (revision "0"))
>     (package
>       (name "guile-aiscm")
>       (version (git-version version revision commit))
>       (source (origin
>                 (method git-fetch)
>                 (uri (git-reference
>                       (url "https://github.com/wedesoft/aiscm";)
>                       (commit commit)))
>                 (file-name (git-file-name name version))
>                 (sha256
>                  (base32
>                  
> "09rdbcr8dinzijyx9h940ann91yjlbg0fangx365llhvy354n840"))))
> [..]
> --8<---------------cut here---------------end--------------->8---
> 
> ?  And your point is that “0.23.1” is redundant with
> “c78b91edb7c17c6fbf3b294452f44e91d75e3c67” because Git so why not
> just use “0.23.1” in ’origin’.  Right?
We typically don't let-bind version (i.e. we only bind revision and
commit, which is probably a wise idea as version is syntax inside
package), but sure, that's also a fine definition.  I would wonder why
you are doing that for a commit that is itself a release, but if you're
explaining to me "Well, I don't trust this weird wedesoft fellow, they
sound like the kind of person/company to change their tags more often
then their underwear" or even better had evidence of such a change, I'd
agree and push.

> In the current matter of facts, I do not think any rationale can be
> made in favor of one of the three main possible definitions
> (addressing by tag, by commit, using let).  The only weak
> justification for addressing using commit hash is that the lookup
> when fallbacking to SWH is easier, i.e., it is easier when the Git-
> commit hash is known instead of URL+tag.
In my personal opinion, the version+raw commit style can be discredited
using Cantor's diagonal argument.

> These 200 packages can also be seen as real-world experiments
> complementing the other ways of addressing in order to find The Right
> Way for robust addressing.
If a comment spanning four lines is the most reasonable way of
explaining said style to others in the source code, that alone serves
as an argument for let-binding. 

> My personal preference, for what it is worth, is an explicit
> reference to the commit, i.e., the current definition or the ’let’
> one.  Note it was also discussed this: have convenient things as
> url+tag for ’uri’ and use checksum coupled to an external service as
> disarchive.guix.gnu.org; but the definitions would be not self-
> consistent anymore.  Heh, The Right Thing is not obvious. :-)
I have trouble understanding this.  Using origin file-names and hashes
for computing fallbacks would be a good thing, no?  We could completely
decouple that from anything related to the method; if we have a backup
elsewhere, we can use it.

> Other said, version and tag are currently first-class when commit is
> second-class, somehow.  As you said «it allows us to derive commit
> from tag» (tag is mine).  And I think it is inherited from the long
> history about releasing software which is now somehow inadequate
> these days.  Obviously, I do not know how to do but it should be the
> contrary: commit first-class which allows us to derive version
> second-class.
Let's put humans before machines, they're not our overlords (yet).

> PS: You said in initial email «(1) is more convenient; it allows us
> to derive commit from version, which is often done through an affine
> mapping.».
> 
> I do not understand the “affine mapping”.  Why would it be an affine
> mapping?  Well, I miss what is the affine space here, I am able to
> imagine the set but what would be the vector space?  Bah you are
> probably referring to maths I have never studied. :-)
I thought affine mappings were a fine substitute for bijective ones,
but it turns out this time it was I who sucks at maths.  The original
point I was making though, is that we often just have to prepend "v" or
some other version marker to get from the Guix version to the tag, for
which it doesn't matter if that's an affine mapping or a bijective one,
as it's both affine and bijective.

Cheers

[Prev in Thread]

Current Thread

[Next in Thread]

On raw strings in <origin> commit field, Liliana Marie Prikler, 2021/12/28
- Re: On raw strings in <origin> commit field, zimoun, 2021/12/29
  - Re: On raw strings in <origin> commit field, Liliana Marie Prikler, 2021/12/29
    - Re: On raw strings in <origin> commit field, zimoun, 2021/12/30
    - Re: On raw strings in <origin> commit field, Liliana Marie Prikler <=
    - Re: On raw strings in <origin> commit field, zimoun, 2021/12/30
    - Re: On raw strings in <origin> commit field, Liliana Marie Prikler, 2021/12/30
    - Re: On raw strings in <origin> commit field, Ricardo Wurmus, 2021/12/31
    - Re: On raw strings in <origin> commit field, Liliana Marie Prikler, 2021/12/31
    - Re: On raw strings in <origin> commit field, Ricardo Wurmus, 2021/12/31
    - Re: On raw strings in <origin> commit field, Liliana Marie Prikler, 2021/12/31
    - Re: On raw strings in <origin> commit field, zimoun, 2021/12/31
    - Re: On raw strings in <origin> commit field, Liliana Marie Prikler, 2021/12/31
    - Re: On raw strings in <origin> commit field, zimoun, 2021/12/31
    - Re: On raw strings in <origin> commit field, Liliana Marie Prikler, 2021/12/31

Prev by Date: Re: Guix Documentation Meetup
Next by Date: Re: On raw strings in <origin> commit field
Previous by thread: Re: On raw strings in <origin> commit field
Next by thread: Re: On raw strings in <origin> commit field
Index(es):
- Date
- Thread