guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: On raw strings in <origin> commit field


From: Liliana Marie Prikler
Subject: Re: On raw strings in <origin> commit field
Date: Fri, 31 Dec 2021 21:52:46 +0100
User-agent: Evolution 3.42.1

Hi,

Am Freitag, dem 31.12.2021 um 18:21 +0100 schrieb zimoun:
> Redundancy adds one kind of robustness: resilience.  [...]  However
> this assumes all the redundant nodes of the web of nets will be still
> up, at least enough to have this…  robustness.  Me too, I hope Guix
> will be popular and all redundancies still running when I will be old
> or dead.  But I will not bet on that assumption.
I think we can live with one or two redundant nodes dying over time;
the great thing about redundancy is that things will still work out
fine if a sufficient number of them (typically one) is still around at
the time of query.  So it'd be robust enough to actually work let's say
10 years, but I have no illusions that our time machine will ever be
able to go back a lifetime (which is also a reason why I don't think
using commit hashes everywhere will magically result in the robustness
that appears to be desired here).

> What Timothy is doing with Preservation of Guix and a window of
> ~2years shows that any web of nets is really fragile.  I do not see
> why the one we are building around Guix will be different.
> 
> Instead of trying to have robustness by adding more and more, from my
> point of view, it appears to me the occasion to rethink and try to
> have robustness with less.
> 
> I agree with you that various fallbacks is one good direction to go.
> SWH is one thing because it is currently well supported (by UNESCO
> for instance).  But many others are also worth.  Maybe IPFS or GNUnet
> are worth.
Why not both?  Or all three, because for what it's worth SWH will also
be around for some while.  We would still need federated Disarchive
instances to match origins to SWH IDs, IPFS files and whatever GNUnet
has.

> > > It is a difficult topic to know what information the ’uri’ field
> > > should contain for robust long-term; a topic with a lot of
> > > unknowns, although many solutions are around, they are a strong
> > > change of habits and changing my own habits is already hard, so a
> > > collective change is a big collective challenge. :-)
> > We're going back to Cantor's argument for raw commits.  I'm not
> > opposed to using commits as value of the commit field (let-bound
> > commits reflected in the version, that is), but let's not forget
> > that this robustness argument still presupposes that the (commit
> > tag) binding is the point of failure.  This probably holds to some
> > degree for "npm-something", but we also have a fair amount of e.g.
> > GNOME-related packages which we trust to have robust tags and the
> > only reason we don't use mirror://gnome to refer to them is because
> > it's not in GNOME mirrors (yet). 
> 
> Because this point of failure for tag potentially exists, the
> counter-measure would be to add more (check integrity, fallback to
> other servers, etc.) and even it could be impossible if the tag
> changed and propagated to all.
> 
> I am not saying neither that we have to replace tomorrow all the tags
> by commit hashes.  My point is just that this tag in the ’uri’ field
> does not appears to me a correct design.  For sure, I agree it is
> convenient but I think it is not The Right Thing.  Sadly, I do not
> know what The Right Thing is – and commit hash is probably not The
> Right Thing but it seems to me a direction to explore.
I don't think there's a single Right Thing to be had here.

> > > For instance, SWH promotes swhid instead of DOI for referencing
> > > the publications.  I am not sure it is really popular outside a
> > > small French subgroup. ;-)
> > 
> > Completely off-topic, but isn't part of the point of DOIs that you
> > can fetch the revised paper as well?  I can understand putting
> > OpenData behind an SWH ID rather than a DOI, but the paper itself? 
> > Why?
> 
> If you find it off-topic, fine.  My point is to say that DOI
> (extrinsic) is not known to not be The Right Thing for referencing
> and intrinsic identifier is really better but it seems hard to
> convince people to switch.
> 
> For instance, DOI is known to be fragile because it relies on an
> external centralized mutable index to have the bijection between the
> identifier and the content.  If today I cite doi:123abc then tomorrow
> when you reach this very same identifier doi:123abc, then you have no
> guarantee that it is the same content.  Obviously, it is not an issue
> by itself, but in scientific context where fraud is something, once
> the centralized mutable index is corrupted, done!
I'm not sure to which extent there's a central index on all DOIs.  As
far as I can see most things are actually handled by DOI registration
agencies, which of course one could possibly corrupt in much the same
manner.

But you don't just cite a DOI, typically.  You also have all that
analog stuff like author, title, publisher, etc.  Assuming the
publisher (or an archive of their publications) still exist, you can
use that to cross-check.

> Because SWH-ID only depends on the content itself, it allows
> decentralization and integrity check.
> 
> Do not take me wrong, I am not comparing Git SHA-1 hash with an
> integrity check. :-)  Well, maybe the interested reader can give a
> look at:
> 
> <
> https://www.softwareheritage.org/2020/07/09/intrinsic-vs-extrinsic-identifiers/
> >
> 
> All in all, I was trying to point that this extrinsic vs intrinsic
> thing is bigger than ’git-fetch’ and commit hash vs tag and the root
> appears to me in exploring what the ’uri’ field should contain.  This
> DOI was an example to show the topic is not easy.
Point taken, "it's not easy" is something we can all easily agree on :)

But the larger issue with DOIs vs SWH IDs is that I typically don't
need to refer to other papers by exact content, which those intrinsic
tagging mechanisms rely on.  If I quote a book from 2015 and you read
the 2025 edition, chances are that the main body is still the same,
with perhaps one or two typos fixed and a new foreword.  For future
academics, it might also be interesting to know whether what I claimed
back in 2022 still holds then or if it has since been superseded.

For historians, it might instead be valuable to archive periodically
check whether the content behind the DOI changes and if so archive a
new snapshot (similar to what archive.org, SWH et al. do).  Then, if
the DOI gets lost or some evil company or government tries to bring out
a censored version of my paper or the paper I'm citing, you can browse
the archive to check what's behind all those sections that have been
painted black.

Note that the archive must be able to be queried in much the same
manner as you'd type a query in a normal search machine.  If it only
relied on content tagging, the evil agency could just simply hand you a
broken ID or even one that refers to a maliciously crafted page of
theirs.  Assuming they let you track down my paper in the first place.

TL;DR (even though you should read the full thing anyway): Despite what
archives specializing themselves on intrinsic identifiers might tell
you, they are not a panacea.  I could go even further off-topic and
show that NaCl is a social construct, but I'd rather stop here.

Cheers



reply via email to

[Prev in Thread] Current Thread [Next in Thread]