|
From: | Ian Eure |
Subject: | Re: Concerns/questions around Software Heritage Archive |
Date: | Sat, 16 Mar 2024 12:06:27 -0700 |
User-agent: | mu4e 1.8.13; emacs 28.2 |
Christopher Baines <mail@cbaines.net> writes:
[[PGP Signed Part:Undecided]] Ian Eure <ian@retrospec.tv> writes:Hi Guixy people,I’d never heard of SWH before I started hacking on Guix last fall, and it struck me as rather a good idea. However, I’ve seen some thingslately which have soured me on them. They appear to be using the archive to build LLMs: https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder2/I was also distressed to see how poorly they treated a developer whowished to update their name: https://cohost.org/arborelia/post/4968198-the-software-heritag https://cohost.org/arborelia/post/5052044-the-software-heritagGPL’d software I’ve created has been packaged for Guix, which I assume means it’s been included in SWH. While I’m dealing with their (IMO: unethical) opt-out process, I likely also need to stop new copies frombeing uploaded again in the future.Is there a way to indicate, in a Guix package, that it should *never*be included in SWH?Not currently, and I don't really see the point in such a mechanism. If you really never want them to store your code, then you need to licenseit accordingly (and not make it free software).
I don’t want my code in SWH *because* it’s free. A primary use of LLMs is laundering freely licensed software into proprietary, commercial projects through "AI" code completion and generation. Any Free software in an LLM training set can and will be used in violation of its license, without a clear path for the author to seek recourse. I deleted my code off Github and abandoned it completely for this exact reason, and am deeply irked to be going through this nonsense again.
A more salient question may be: Is there a process within Guix (either the program or the organization) which uploads source to SWH? Or does it rely on SWH indepently?
If the latter, my problem is likely solved by blocking SWH at my network edge and opting out of their archive (or trying to) and the downstream training models they’ve already put it in. If the former, the only control I currently have to protect my license is removing packages from Guix which contain it. I don’t want that outcome.
Noting also that the path here seems to be SWH->huggingface->bigcode training set, and the opt-out process for the training set appears to be a complete sham. To opt-out, you must create a Github Issue; only one opt-out has *ever* been processed, and there are 200+ sitting there, many with no response for nearly a year[1]. I want no part of any of this.
Is there a way to tell Guix to never download source from SWH?Also no, and it's probably best to do this at the network level on yoursystems/network if you want this to be the case.
I’ll investigate this, though I’d prefer if there was a way to configure source mirrors in the Guix daemon.
Skipping back to this though:I was also distressed to see how poorly they treated a developer whowished to update their name: https://cohost.org/arborelia/post/4968198-the-software-heritag https://cohost.org/arborelia/post/5052044-the-software-heritagThis is probably worth thinking about as Guix is in a similar situation regarding publishing source code, and people potentially wanting to change historical source code both in things Guix packages and Guixitself. Like Software Heritage, there's cryptographical implications forrewriting the Git history and modifying source tarballs or nars thatcontain source code.We have 17TiB of compressed source code and built software stored for bordeaux.guix.gnu.org now and we should probably work out how to handle people asking for things to be removed or changed (for any and allreasons).It's probably worth working out our position on this in advance ofsomeone asking.
Yes, I agree that Guix needs a better solution for this. Thanks, — Ian [1]: https://github.com/bigcode-project/opt-out-v2/issues
[Prev in Thread] | Current Thread | [Next in Thread] |