guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Guidelines for pre-trained ML model weight binaries (Was re: Where s


From: Simon Tournier
Subject: Re: Guidelines for pre-trained ML model weight binaries (Was re: Where should we put machine learning model parameters?)
Date: Tue, 11 Apr 2023 10:37:08 +0200

Hi Nathan,

Maybe there is a misunderstanding. :-)

The subject is “Guideline for pre-trained ML model weight binaries”.  My
opinion on such guideline would to only consider the license of such
data.  Other considerations appear to me hard to be conclusive.


What I am trying to express is that:

 1) Bit-identical rebuild is worth, for sure!, and it addresses a class
    of attacks (e.g., Trusting trust described in 1984 [1]).  Aside, I
    find this message by John Gilmore [2] very instructive about the
    history of bit-identical rebuilds. (Bit-identical rebuild had been
    considered by GNU in the early 90’s.)

 2) Bit-identical rebuild is *not* the solution to all.  Obviously.
    Many attacks are bit-identical.  Consider the package
    ’python-pillow’, it builds bit-identically.  But before c16add7fd9,
    it was subject to CVE-2022-45199.  Only an human expertise to
    produce the patch [3] protects against the attack.

Considering this, I am claiming that:

 a) Bit-identical re-train of ML models is similar to #2; other said
    that bit-identical re-training of ML model weights does not protect
    much against biased training.  The only protection against biased
    training is by human expertise.

    Note that if the re-train is not bit-identical, what would be the
    conclusion about the trust?  It falls under the cases of non
    bit-identical rebuild of packages as Julia or even Guile itself.

 b) The resources (human, financial, hardware, etc.) for re-training is,
    for most of the cases, not affordable.  Not because it would be
    difficult or because the task is complex, this is covered by the
    point a), no it is because the requirements in term of resources is
    just to high.

    Consider that, for some cases where we do not have the resources, we
    already do not debootstrap.  See GHC compiler (*) or Genomic
    references.  And I am not saying it is impossible or we should not
    try, instead, I am saying we have to be pragmatic for some cases.


Therefore, my opinion is that pre-trained ML model weight binaries
should be included as any other data and the lack of debootstrapping is
not an issue for inclusion in this particular cases.

The question for inclusion about this pre-trained ML model binary
weights is the license.

Last, from my point of view, the tangential question is the size of such
pre-trained ML model binary weights.  I do not know if they fit the
store.

Well, that’s my opinion on this “Guidelines for pre-trained ML model
weight binaries”. :-)



(*) And Ricardo is training hard! See [4] and part 2 is yet published,
IIRC.

1: 
https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf
2: 
https://lists.reproducible-builds.org/pipermail/rb-general/2017-January/000309.html
3: 
https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/patches/python-pillow-CVE-2022-45199.patch
4: https://elephly.net/posts/2017-01-09-bootstrapping-haskell-part-1.html

Cheers,
simon



reply via email to

[Prev in Thread] Current Thread [Next in Thread]