Re: Conda environments and reproducibility

guix-science

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Conda environments and reproducibility

From:	Simon Tournier
Subject:	Re: Conda environments and reproducibility
Date:	Mon, 13 Mar 2023 13:38:52 +0100

Hi,

On lun., 13 mars 2023 at 12:00, Ricardo Wurmus <rekado@elephly.net> wrote:

>> If the process of reproducing the environment is going to fail at some 
>> point, I 
>> wonder if we could accelerate this process by defining a more complex 
>> environment. 
>> Any ideas?

Maybe something using PyTorch or some other ML framework.

> A more complex environment would increase the chance of failure because
> it increases the complexity of the challenge to the resolver.  While it
> would be a useful demonstration to see the resolver fail I think it is
> the least damning kind of failure.

Yes, I agree the solver will be the last thing to break.  Well, from my
understanding of [1], the breakage of the Conda solver depends on the
state of their index.  Quoting [1]:

        This is where the SAT solver will act. It will use the list of MatchSpec
        objects to pick a number of PackageRecord entries from the index, thus
        building the “final state of the solved environment”. This is detailed
        later in this deep dive guide, if you need more info. 

so more complex is the environment and more complicated the solution of
the SAT will be.  And finding the solution can be slow.  That’s why they
implemented various solvers [2].  And it is not clear for me if [1] and
[2] always lead to the same environment.

To my knowledge, the issue is well-identified, for instance by the
Mancoosi project [3]; in short, it reads:

        4.2 Package installation is NP-Complete

        Theorem 1: Checking whether a single package P can be installed,
                   given a repository R, is NP-complete.

        4.2.4 Conclusions

        Despite the apparent differences, the constraint languages in DEB and
        RPM are sensibly equivalent in expressiveness, and the associated
        installation problems are both NP-complete. 

        This means that automatic package installation tools like APT, URPMI or
        SMART live dangerously on the edge of intractability, and must carefully
        apply heuristics that may be either safe (the approach advocated by
        SMART), and hence still not guaranteed to avoid intractability, or
        unsafe, thus accepting the risk of not always finding a solution when it
        exists. 

Therefore, I do not see where Conda would be different.  However, indeed
it could be hard to construct a concrete example of a failure for the
SAT solver part.  Moreover, Conda documentation reads [1],

        Explicit package installs

        These commands do not need a solver because the requested packages are
        expressed with a direct URL or path to a specific tarball. Instead of a
        MatchSpec, we already have a PackageRecord-like entity! For this to
        work, all the requested packages neeed to be URLs or paths. They can be
        typed in the command line or in a text file including a @EXPLICIT line. 

        Since the solver is not involved, the dependencies of the explicit
        package(s) are not processed at all. This can leave the environment in
        an inconsistent state, which can be fixed by running conda update --all,
        for example. 

        Explicit installs are taken care of by the explicit function.

For sure, the failure of Conda is by design.  And as with many things in
life, people only believe what they see from their own eyes. :-)

1: 
https://docs.conda.io/projects/conda/en/latest/dev-guide/deep-dives/solvers.html
2: https://conda.github.io/conda-libmamba-solver/libmamba-vs-classic/
3: https://www.mancoosi.org/edos/algorithmic/

>> Simon Tournier <zimon.toutoune@gmail.com> writes:
>>
>>> 1. also use the image continuumio/miniconda3:latest
>>> 2. install Miniconda on the top of the Docker image of Debian
>>>   unstable and run "apt update && apt upgrade"
>>> 
>>> And I expect that #2 will break first, then #1 and last the current
>>> one.
>>
>> Could you elaborate on this? For context the current pipeline 
>> pulls a pinned miniconda image then updates conda (=conda update conda=).  
>> Do you expect system libraries (I mean software installed through apt, not 
>> managed by conda) to influence the conda environment creation?  My current 
>> understanding is that conda brings its  own copies of these libraries 
>> without relying 
>> on whatever was/will be installed through other ways (e.g. apt).
>
> This depends on the packages.  There are packages that do link with
> system libraries, and these are provided by a base image in which the
> binary artefacts are built.

As Ricardo explained, sometimes Conda relies on system libraries.  Guix
makes the assumption of a compatible Linux kernel.  Conda also makes
assumptions and, to my knowledge, they are less strict about isolated
environments.

That’s why replacing the base image could also help to expose examples
where it breaks.

Just to point that I was in a workshop of Reproducible Research past
week and I discussed with the developer of BenchOpt [4].  Their aim is
to maintain the computational stack for some ML framework when the
passing of time by making their benchmarks evolving.  Other said, they
take the other direction of Guix.  If they do that, that’s because it is
not possible to run again. :-)

4: https://benchopt.github.io/

It is hard to predict beforehand where Conda will break. :-)  From my
point of view, by order of most probable:

 1. because the underlying Linux distribution base
 2. because the SAT solver

Well, for testing #1, I propose:

 a) to also run the pipeline using continuumio/miniconda3:latest
 b) to run an installation of Conda
     i) on the top of Debian
     ii) on the top of Ubuntu
    and then run the script

As corollary, it will also test #2. ;-)

The current script is about Numpy, maybe it would accelerate the process
if instead it would be PyTorch.

Thanks for the discussion about that topic.  If no one beats me, I will
adapt .gitlab-ci.yml.  Well, do not hold your breath… first holidays! ;-)

Cheers,
simon

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Conda environments and reproducibility, Ludovic Courtès, 2023/03/11
- Re: Conda environments and reproducibility, Simon Tournier, 2023/03/11
  - Re: Conda environments and reproducibility, Lestang, Thibault, 2023/03/13
    - Re: Conda environments and reproducibility, Ricardo Wurmus, 2023/03/13
    - Re: Conda environments and reproducibility, Simon Tournier <=
    - Re: Conda environments and reproducibility, Ludovic Courtès, 2023/03/16
    - Re: Conda environments and reproducibility, Thibault Lestang, 2023/03/16

Prev by Date: Re: Conda environments and reproducibility
Next by Date: Re: Conda environments and reproducibility
Previous by thread: Re: Conda environments and reproducibility
Next by thread: Re: Conda environments and reproducibility
Index(es):
- Date
- Thread