guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Concerns/questions around Software Heritage Archive


From: Kaelyn
Subject: Re: Concerns/questions around Software Heritage Archive
Date: Mon, 18 Mar 2024 16:27:24 +0000

On Monday, March 18th, 2024 at 2:28 AM, Simon Tournier 
<zimon.toutoune@gmail.com> wrote:

> 
> Hi,
> 
> On sam., 16 mars 2024 at 08:52, Ian Eure ian@retrospec.tv wrote:
> 
> > They appear to be using the archive to build LLMs:
> > https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder2/
> 
> 
> About LLM, Software Heritage made a clear statement:
> 
> https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code
> 
> Quoting:
> 
> We feel that the question is no longer whether LLMs for code
> should be built. They are already being built, independently of
> what we do, and there is no turning back. The real question is
> how they should be built and whom they should benefit.
> 
> Principles:
> 
> 1. Knowledge derived from the Software Heritage archive must be
> given back to humanity, rather than monopolized for private
> gain. The resulting machine learning models must be made available
> under a suitable open license, together with the documentation and
> toolings needed to use them.
> 
> 2. The initial training data extracted from the Software Heritage
> archive must be fully and precisely identified by, for example,
> publishing the corresponding SWHID identifiers (note that, in the
> context of Software Heritage, public availability of the initial
> training data is a given: anyone can obtain it from the
> archive). This will enable use cases such as: studying biases
> (fairness), verifying if a code of interest was present in the
> training data (transparency), and providing appropriate attribution
> when generated code bears resemblance to training data (credit),
> among others.
> 
> 3. Mechanisms should be established, where possible, for authors to
> exclude their archived code from the training inputs before model
> training begins.
> 
> I hope it clarifies your concerns to some extent.
> 
> 
> Moreover, you wrote: « I want absolutely nothing to do with them. »
> 
> Maybe there is a misunderstanding on your side about what “free
> software” and GPL means because once “free software”, you cannot prevent
> people to use “your” free software for any purposes you dislike.
> 
> If you want to bound the use cases of the software you create, you need
> to explicitly specify that in the license. And if you do, your software
> will not be considered as “free software”.
> 
> That’s the double sword of “free software”. :-)

Hi,

I want to stress that I am not a lawyer, but my (possiblibly outdated) 
understanding of what machine learning models can and cannot do with regards to 
their training data, and a reading of parts of the GPL 2 and 3, suggest that at 
best the SWH's LLM is in a legal grey area and at worst directly violates the 
license of GPL code that it ingests for training. As such, I don't think it is 
accurate to say "you cannot prevent people to use “your” free software for any 
purposes you dislike" in response to concerns about automatic inclusion of free 
software into LLM training sets. Specifically, my understanding (as of a few 
years ago) is that LLMs have difficulty tracing and atttributing various 
aspects of its training to specific inputs, which seems to be in violation of 
of e.g. Sections 5 and 6 of the GPL. Specific quotes from those sections 
https://www.gnu.org/licenses/gpl-3.0.html:

>From section 5:
> You may convey a work based on the Program, or the modifications to produce 
> it from the Program, in the form of source code under the terms of section 4, 
> provided that you also meet all of these conditions:
> 
>     a) The work must carry prominent notices stating that you modified it, 
> and giving a relevant date.
>     b) The work must carry prominent notices stating that it is released 
> under this License and any conditions added under section 7. This requirement 
> modifies the requirement in section 4 to “keep intact all notices”.
>     c) You must license the entire work, as a whole, under this License to 
> anyone who comes into possession of a copy. This License will therefore 
> apply, along with any applicable section 7 additional terms, to the whole of 
> the work, and all its parts, regardless of how they are packaged. This 
> License gives no permission to license the work in any other way, but it does 
> not invalidate such permission if you have separately received it.
>     d) If the work has interactive user interfaces, each must display 
> Appropriate Legal Notices; however, if the Program has interactive interfaces 
> that do not display Appropriate Legal Notices, your work need not make them 
> do so.

and from Section 6:
> You may convey a covered work in object code form under the terms of sections 
> 4 and 5, provided that you also convey the machine-readable Corresponding 
> Source under the terms of this License, in one of these ways:
> 
>     a) Convey the object code in, or embodied in, a physical product 
> (including a physical distribution medium), accompanied by the Corresponding 
> Source fixed on a durable physical medium customarily used for software 
> interchange.
>     b) Convey the object code in, or embodied in, a physical product 
> (including a physical distribution medium), accompanied by a written offer, 
> valid for at least three years and valid for as long as you offer spare parts 
> or customer support for that product model, to give anyone who possesses the 
> object code either (1) a copy of the Corresponding Source for all the 
> software in the product that is covered by this License, on a durable 
> physical medium customarily used for software interchange, for a price no 
> more than your reasonable cost of physically performing this conveying of 
> source, or (2) access to copy the Corresponding Source from a network server 
> at no charge.
>     c) Convey individual copies of the object code with a copy of the written 
> offer to provide the Corresponding Source. This alternative is allowed only 
> occasionally and noncommercially, and only if you received the object code 
> with such an offer, in accord with subsection 6b.
>     d) Convey the object code by offering access from a designated place 
> (gratis or for a charge), and offer equivalent access to the Corresponding 
> Source in the same way through the same place at no further charge. You need 
> not require recipients to copy the Corresponding Source along with the object 
> code. If the place to copy the object code is a network server, the 
> Corresponding Source may be on a different server (operated by you or a third 
> party) that supports equivalent copying facilities, provided you maintain 
> clear directions next to the object code saying where to find the 
> Corresponding Source. Regardless of what server hosts the Corresponding 
> Source, you remain obligated to ensure that it is available for as long as 
> needed to satisfy these requirements.
>     e) Convey the object code using peer-to-peer transmission, provided you 
> inform other peers where the object code and Corresponding Source of the work 
> are being offered to the general public at no charge under subsection 6d.

And from the GPL 2 text at 
https://www.gnu.org/licenses/old-licenses/gpl-2.0.html:

> 2. You may modify your copy or copies of the Program or any portion of it, 
> thus forming a work based on the Program, and copy and distribute such 
> modifications or work under the terms of Section 1 above, provided that you 
> also meet all of these conditions:
> 
>     a) You must cause the modified files to carry prominent notices stating 
> that you changed the files and the date of any change. 
>     b) You must cause any work that you distribute or publish, that in whole 
> or in part contains or is derived from the Program or any part thereof, to be 
> licensed as a whole at no charge to all third parties under the terms of this 
> License. 
>     c) If the modified program normally reads commands interactively when 
> run, you must cause it, when started running for such interactive use in the 
> most ordinary way, to print or display an announcement including an 
> appropriate copyright notice and a notice that there is no warranty (or else, 
> saying that you provide a warranty) and that users may redistribute the 
> program under these conditions, and telling the user how to view a copy of 
> this License. (Exception: if the Program itself is interactive but does not 
> normally print such an announcement, your work based on the Program is not 
> required to print an announcement.) 
> 
> These requirements apply to the modified work as a whole. If identifiable 
> sections of that work are not derived from the Program, and can be reasonably 
> considered independent and separate works in themselves, then this License, 
> and its terms, do not apply to those sections when you distribute them as 
> separate works. But when you distribute the same sections as part of a whole 
> which is a work based on the Program, the distribution of the whole must be 
> on the terms of this License, whose permissions for other licensees extend to 
> the entire whole, and thus to each and every part regardless of who wrote it.
> 
> Thus, it is not the intent of this section to claim rights or contest your 
> rights to work written entirely by you; rather, the intent is to exercise the 
> right to control the distribution of derivative or collective works based on 
> the Program.
> 
> In addition, mere aggregation of another work not based on the Program with 
> the Program (or with a work based on the Program) on a volume of a storage or 
> distribution medium does not bring the other work under the scope of this 
> License.
> 
> 3. You may copy and distribute the Program (or a work based on it, under 
> Section 2) in object code or executable form under the terms of Sections 1 
> and 2 above provided that you also do one of the following:
> 
>     a) Accompany it with the complete corresponding machine-readable source 
> code, which must be distributed under the terms of Sections 1 and 2 above on 
> a medium customarily used for software interchange; or, 
>     b) Accompany it with a written offer, valid for at least three years, to 
> give any third party, for a charge no more than your cost of physically 
> performing source distribution, a complete machine-readable copy of the 
> corresponding source code, to be distributed under the terms of Sections 1 
> and 2 above on a medium customarily used for software interchange; or, 
>     c) Accompany it with the information you received as to the offer to 
> distribute corresponding source code. (This alternative is allowed only for 
> noncommercial distribution and only if you received the program in object 
> code or executable form with such an offer, in accord with Subsection b 
> above.) 
> 
> The source code for a work means the preferred form of the work for making 
> modifications to it. For an executable work, complete source code means all 
> the source code for all modules it contains, plus any associated interface 
> definition files, plus the scripts used to control compilation and 
> installation of the executable. However, as a special exception, the source 
> code distributed need not include anything that is normally distributed (in 
> either source or binary form) with the major components (compiler, kernel, 
> and so on) of the operating system on which the executable runs, unless that 
> component itself accompanies the executable.
> 
> If distribution of executable or object code is made by offering access to 
> copy from a designated place, then offering equivalent access to copy the 
> source code from the same place counts as distribution of the source code, 
> even though third parties are not compelled to copy the source along with the 
> object code.
> 
> 4. You may not copy, modify, sublicense, or distribute the Program except as 
> expressly provided under this License. Any attempt otherwise to copy, modify, 
> sublicense or distribute the Program is void, and will automatically 
> terminate your rights under this License. However, parties who have received 
> copies, or rights, from you under this License will not have their licenses 
> terminated so long as such parties remain in full compliance. 

Again, I want to emphasize IANAL. As a layman, my understanding of ML model 
training is that it cannot maintain enough of a trace between GPLed input code 
and its (modified) use in the output to maintain the licensing and distribution 
requirements from either the GPL 3 sections above or the GPL 2 sections 2 and 
3. I also believe that section 4 of the GPL 2 directly applies to these LLM 
code models.

There is also the potential licensing issues of mixing (potentially) 
incompatible licenses in the training data sets, such as GPL and CDDL code, 
with no way to distinguish or separate the (arguably) modified sources from 
each.

Just my $0.02 USD on the LLM side of matter, as much of the discussion seems to 
be around the cost vs benefit of rewriting the git history for updating 
personally identifying information.

Cheers,
Kaelyn

> 
> Cheers,
> simon



reply via email to

[Prev in Thread] Current Thread [Next in Thread]