[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Concerns/questions around Software Heritage Archive
From: |
Simon Tournier |
Subject: |
Re: Concerns/questions around Software Heritage Archive |
Date: |
Mon, 18 Mar 2024 10:28:55 +0100 |
Hi,
On sam., 16 mars 2024 at 08:52, Ian Eure <ian@retrospec.tv> wrote:
> They appear to be using the archive to build LLMs:
> https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder2/
About LLM, Software Heritage made a clear statement:
https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code
Quoting:
We feel that the question is no longer whether LLMs for code
should be built. They are already being built, independently of
what we do, and there is no turning back. The real question is
how they should be built and whom they should benefit.
Principles:
1. Knowledge derived from the Software Heritage archive must be
given back to humanity, rather than monopolized for private
gain. The resulting machine learning models must be made available
under a suitable open license, together with the documentation and
toolings needed to use them.
2. The initial training data extracted from the Software Heritage
archive must be fully and precisely identified by, for example,
publishing the corresponding SWHID identifiers (note that, in the
context of Software Heritage, public availability of the initial
training data is a given: anyone can obtain it from the
archive). This will enable use cases such as: studying biases
(fairness), verifying if a code of interest was present in the
training data (transparency), and providing appropriate attribution
when generated code bears resemblance to training data (credit),
among others.
3. Mechanisms should be established, where possible, for authors to
exclude their archived code from the training inputs before model
training begins.
I hope it clarifies your concerns to some extent.
Moreover, you wrote: « I want absolutely nothing to do with them. »
Maybe there is a misunderstanding on your side about what “free
software” and GPL means because once “free software”, you cannot prevent
people to use “your” free software for any purposes you dislike.
If you want to bound the use cases of the software you create, you need
to explicitly specify that in the license. And if you do, your software
will not be considered as “free software”.
That’s the double sword of “free software”. :-)
Cheers,
simon
- Re: the right to rewrite history to rectify the past (was Re: Concerns/questions around Software Heritage Archive), (continued)
- Re: the right to rewrite history to rectify the past (was Re: Concerns/questions around Software Heritage Archive), pinoaffe, 2024/03/21
- Re: the right to rewrite history to rectify the past (was Re: Concerns/questions around Software Heritage Archive), Hartmut Goebel, 2024/03/21
- Re: the right to rewrite history to rectify the past (was Re: Concerns/questions around Software Heritage Archive), MSavoritias, 2024/03/21
- Re: the right to rewrite history to rectify the past (was Re: Concerns/questions around Software Heritage Archive), Ekaitz Zarraga, 2024/03/21
- Re: the right to rewrite history to rectify the past (was Re: Concerns/questions around Software Heritage Archive), Felix Lechner, 2024/03/22
- Re: the right to rewrite history to rectify the past (was Re: Concerns/questions around Software Heritage Archive), Efraim Flashner, 2024/03/21
- Re: the right to rewrite history to rectify the past (was Re: Concerns/questions around Software Heritage Archive), pinoaffe, 2024/03/21
Content-Addressed system and history?, Simon Tournier, 2024/03/18
Re: Concerns/questions around Software Heritage Archive,
Simon Tournier <=
Re: Concerns/questions around Software Heritage Archive, Kaelyn, 2024/03/18
Re: Concerns/questions around Software Heritage Archive, Ian Eure, 2024/03/18