guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Next Steps For the Software Heritage Problem


From: Ian Eure
Subject: Re: Next Steps For the Software Heritage Problem
Date: Thu, 27 Jun 2024 08:30:39 -0700
User-agent: mu4e 1.8.13; emacs 28.2

Hi Ludo,

Ludovic Courtès <ludo@gnu.org> writes:

Ian Eure <ian@retrospec.tv> skribis:

Guix sends archive requests to SWH. SWH gives that source code to
HuggingFace.  HuggingFace demonstrably violates the licenses.

Which licenses? As has been said previously, and you can verify for
yourself, it does not ingest code under copyleft licenses.


While this is what their paper claims[1], it doesn’t appear to be true, since I can see my own GPL’d code in the training set. I’ve since moved nearly all of my code off GitHub, but if you visit their "Am I in The Stack?" page[2] and enter my old username ("ieure"), you will see pretty much every repository I ever hosted there, including both unlicensed and GPL’d code. Some examples are hyperspace-el, nssh-el, tl1-mode, etc. While there aren’t LICENSE files in those repos, the file headers of all clearly indicate that they’re GPL’d.

Unfortunately, there is no way to check for the presence of code in the training set except by GitHub username.

What I don’t know for certain is whether these are in the training set because they came from SWH, or because HuggingFace obtained them through other means. Given that all the links for my GitHub username on that "Am I in The Stack" link back to SWH, it seems very likely that it came from them.

Thanks,

 — Ian

[1]: https://arxiv.org/pdf/2402.19173 "We also exclude copyleft-licensed code..."
[2]: https://huggingface.co/spaces/bigcode/in-the-stack



reply via email to

[Prev in Thread] Current Thread [Next in Thread]