[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Concerns/questions around Software Heritage Archive
From: |
Tomas Volf |
Subject: |
Re: Concerns/questions around Software Heritage Archive |
Date: |
Sat, 16 Mar 2024 20:49:44 +0100 |
On 2024-03-16 12:06:27 -0700, Ian Eure wrote:
>
> Christopher Baines <mail@cbaines.net> writes:
>
> > [[PGP Signed Part:Undecided]]
> >
> > Ian Eure <ian@retrospec.tv> writes:
> >
> > > Hi Guixy people,
> > >
> > > I’d never heard of SWH before I started hacking on Guix last fall,
> > > and
> > > it struck me as rather a good idea. However, I’ve seen some things
> > > lately which have soured me on them.
> > >
> > > They appear to be using the archive to build LLMs:
> > > https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder2/
> > >
> > > I was also distressed to see how poorly they treated a developer who
> > > wished to update their name:
> > > https://cohost.org/arborelia/post/4968198-the-software-heritag
> > > https://cohost.org/arborelia/post/5052044-the-software-heritag
> > >
> > > GPL’d software I’ve created has been packaged for Guix, which I
> > > assume
> > > means it’s been included in SWH. While I’m dealing with their (IMO:
> > > unethical) opt-out process, I likely also need to stop new copies
> > > from
> > > being uploaded again in the future.
> > >
> > > Is there a way to indicate, in a Guix package, that it should
> > > *never*
> > > be included in SWH?
> >
> > Not currently, and I don't really see the point in such a mechanism. If
> > you really never want them to store your code, then you need to license
> > it accordingly (and not make it free software).
> >
>
> I don’t want my code in SWH *because* it’s free. A primary use of LLMs is
> laundering freely licensed software into proprietary, commercial projects
> through "AI" code completion and generation. Any Free software in an LLM
> training set can and will be used in violation of its license, without a
> clear path for the author to seek recourse. I deleted my code off Github
> and abandoned it completely for this exact reason, and am deeply irked to be
> going through this nonsense again.
>
> A more salient question may be: Is there a process within Guix (either the
> program or the organization) which uploads source to SWH? Or does it rely
> on SWH indepently?
`guix lint PKG-NAME' schedules SWH archival if possible. No code is directly
uploaded (at least currently), so assuming you have a IP list of SWH, it should
be possible to block it. At least AFAIK.
If you have the list, or know how to get it, could you share it? I would be
interesting in blocking it as well from my git hosting.
>
> If the latter, my problem is likely solved by blocking SWH at my network
> edge and opting out of their archive (or trying to) and the downstream
> training models they’ve already put it in. If the former, the only control
> I currently have to protect my license is removing packages from Guix which
> contain it. I don’t want that outcome.
>
> Noting also that the path here seems to be SWH->huggingface->bigcode
> training set, and the opt-out process for the training set appears to be a
> complete sham. To opt-out, you must create a Github Issue; only one opt-out
> has *ever* been processed, and there are 200+ sitting there, many with no
> response for nearly a year[1]. I want no part of any of this.
>
>
> > > Is there a way to tell Guix to never download source from SWH?
> >
> > Also no, and it's probably best to do this at the network level on your
> > systems/network if you want this to be the case.
> >
>
> I’ll investigate this, though I’d prefer if there was a way to configure
> source mirrors in the Guix daemon.
>
>
> > Skipping back to this though:
> >
> > > I was also distressed to see how poorly they treated a developer who
> > > wished to update their name:
> > > https://cohost.org/arborelia/post/4968198-the-software-heritag
> > > https://cohost.org/arborelia/post/5052044-the-software-heritag
> >
> > This is probably worth thinking about as Guix is in a similar situation
> > regarding publishing source code, and people potentially wanting to
> > change historical source code both in things Guix packages and Guix
> > itself.
> >
> > Like Software Heritage, there's cryptographical implications for
> > rewriting the Git history and modifying source tarballs or nars that
> > contain source code.
> >
> > We have 17TiB of compressed source code and built software stored for
> > bordeaux.guix.gnu.org now and we should probably work out how to handle
> > people asking for things to be removed or changed (for any and all
> > reasons).
> >
> > It's probably worth working out our position on this in advance of
> > someone asking.
> >
>
> Yes, I agree that Guix needs a better solution for this.
>
> Thanks,
>
> — Ian
>
> [1]: https://github.com/bigcode-project/opt-out-v2/issues
>
T.
--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
signature.asc
Description: PGP signature
Re: Concerns/questions around Software Heritage Archive, Ryan Prior, 2024/03/16
- Re: Concerns/questions around Software Heritage Archive, Lars-Dominik Braun, 2024/03/17
- Re: Concerns/questions around Software Heritage Archive, MSavoritias, 2024/03/17
- Re: Concerns/questions around Software Heritage Archive, paul, 2024/03/17
- Re: Concerns/questions around Software Heritage Archive, MSavoritias, 2024/03/17
- Re: Concerns/questions around Software Heritage Archive, Ian Eure, 2024/03/17
- Re: Concerns/questions around Software Heritage Archive, Richard Sent, 2024/03/17