[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Next Steps For the Software Heritage Problem
From: |
Simon Tournier |
Subject: |
Re: Next Steps For the Software Heritage Problem |
Date: |
Thu, 20 Jun 2024 16:40:57 +0200 |
Hi MSavoritias, all,
On Thu, 20 Jun 2024 at 09:51, MSavoritias <email@msavoritias.me> wrote:
>> Not to avoid the question but from a pragmatic point of view, one
>> might ask if the source code you write and do not want to be included
>> in the training dataset, if this source code is concretely part of
>> that training dataset.
[...]
> Thats all fair and valid. Sadly tho SWH:
> -
> there is provenance. (unless i start searching
> HuggingFace.
Being concrete and explicit, could you please share:
1. Which part of your code is included in the pretraining dataset?
It’s easy, you can copy/paste a snippet and it returns the location
from where it comes from.
https://huggingface.co/spaces/bigcode/search-v2a
2. What is your code that is included in SWH archive?
Again, it’s easy: checkout some commit of your repository, then
inside this repository, you can run:
echo "https://archive.softwareheritage.org/swh:1:dir:$(guix hash -S git -f
hex -H sha1 .)"
Do not miss the ’.’ (dot) once entering the repository. This
command returns SWHID. Other said, using this identifier, you might
know if the repository is stored by SWH. (Be careful with temporary
artifacts as .go files or else.)
Or you can also check for one specific content:
$ echo "https://archive.softwareheritage.org/swh:1:cnt:$(guix hash -S git -f
hex -H sha1 COPYING)"
https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
And the URL display the content of the file COPYING. Here GPL 3
license for instance.
3. Where such source code from #2 and #3 is packaged by Guix?
That said, if the source is hosted on GitHub or GitLab.com or SourceHut
or CodeBerg or some other popular forges or even mirrored without your
consent on one of these, please consider that your code had been
ingested by ChatGPT without any mean to verify. Obviously, that’s not
an argument to accept the situation with HuggingFace and I understand
that you do not want that your publicly release copyleft source code
could be reused by any LLM.
However, as said several times, rooting this willing of non-inclusion is
larger than your own willing once you publicly released such source code
under some copyleft license. I hope we agree on that.
Again, I am not trying to avoid something. And again, we all have heard
your points. Nothing is ignored. To my knowledge, the path forward is
not yet well-defined.
Since we are discussing at length with various different inputs, it
means that a common understanding and/or opinion does not seem obvious.
>> Well, I do not know if the outcome will be aligned with your current
>> opinion, but be sure that your concerns as the others raised by Guix
>> community members are taking into account.
>
> Thank you for giving me an honest and detailed answer.
I feel you are pushy on the topic and for what my opinion is worth, it
is not helpful to raise again and again that you want a way to opt-out.
Yeah, people got it. :-) And you are probably not alone, I guess.
It would help if you could provide a source code that your wrote and
answer the three criteria above: included in pretraining dataset,
included in SWH, packaged by Guix.
I do not have special information from SWH but I am sure SWH people are
working on the topic. And again, maybe the outcome will not be aligned
with your opinion. Another story.
Now, the other question you ask to Guix: do we continue to help SWH in
harvesting? You propose to stop, IIUC. Ok, we got it, too. :-) From my
point of view, the path forward is not to speak on the abstract but to
root on concrete numbers; it would help in bounding what we are speaking
about.
Concretely, if you would like to be able to opt-out, could you point:
1. the piece from the Guix source code you are the author?
2. source code you are the author that is packaged by Guix?
Again, I am not trying to avoid the discussion. Instead, I would prefer
to root the discussion on concrete examples. Then it would appear to me
easier to make progress.
As Greg or Ekaitz also wrote: opting out has implications on the meaning
of freedom behind “free software“.
IMHO, that’s not because we would like to opt-out that we could, would
be able to or allowed to. Therefore, instead of holding opinions on the
abstract, let try to make progress and start on the concrete: which
piece of source code are we speaking about?
Cheers,
simon
- Re: Next Steps For the Software Heritage Problem, (continued)
- Re: Next Steps For the Software Heritage Problem, MSavoritias, 2024/06/19
- Re: Next Steps For the Software Heritage Problem, Efraim Flashner, 2024/06/19
- Re: Next Steps For the Software Heritage Problem, raingloom, 2024/06/19
- Re: Next Steps For the Software Heritage Problem, Ekaitz Zarraga, 2024/06/19
- Re: Next Steps For the Software Heritage Problem, MSavoritias, 2024/06/20
- Re: Next Steps For the Software Heritage Problem, Ekaitz Zarraga, 2024/06/20
- Re: Next Steps For the Software Heritage Problem, MSavoritias, 2024/06/21
- Re: Next Steps For the Software Heritage Problem, MSavoritias, 2024/06/19
- Re: Next Steps For the Software Heritage Problem, Simon Tournier, 2024/06/19
- Re: Next Steps For the Software Heritage Problem, MSavoritias, 2024/06/20
- Re: Next Steps For the Software Heritage Problem,
Simon Tournier <=
- Re: Next Steps For the Software Heritage Problem, MSavoritias, 2024/06/21
Re: Next Steps For the Software Heritage Problem, Juliana Sims, 2024/06/28