guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Thoughts on CI (was: Thoughts on building things for substitutes and the


From: zimoun
Subject: Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator)
Date: Mon, 23 Nov 2020 23:56:07 +0100

Hi,

(Disclaim: I am biased since I have been the Mathieu’s rubber duck [1]
about his “new CI design” presented in his talk and I have read his
first experimental implementations.)

1: <https://en.wikipedia.org/wiki/Rubber_duck_debugging>


Thank you Chris for this detailed email and your inspiring talk.  Really
interesting!  The discussion has been really fruitful.  At least for me.

>From my understanding, the Guix Build Coordinator is designed to
distribute the workload on heterogeneous context (distant machines).

IIUC, the design of GBC could implement some Andreas’s ideas.  Because,
the GBC is designed to support unreliable network and even it has
experimental trusted mechanism for the workers.  The un-queue’ing
algorithm implemented in GBC is not clear; it appears to be “work
stealing” but I have not read the code.

The Mathieu’s offload is designed for cluster with the architecture of
Berlin in mind; reusing as much as possible the existing part of Guix.

Since Berlin is a cluster, the workers are already trusted.  So Avahi
allows to discover them; the addition / remove of machines should be hot
swapping, without reconfiguration involved.  In other words, the
controller/coordinator (master) does not need the list of workers.
That’s one of the dynamic part.

The second dynamic part is “work stealing”.  And to do so, ZeroMQ is
used both for communication and for un-queue’ing (work stealing).  This
library is used because it allows to focus on the design avoiding the
reimplementation of the scheduling strategy and probably bugs with
Fibers to communicate.  Well, that’s how I understand it.

For sure, the ’guile-simple-zmq’ wrapper is not bullet-proof; but it is
simply a wrapper and ZeroMQ is well-tested, AFAIK.  Well, we could
imagine replace in a second step this ZMQ library by Fibers plus Scheme
reimplementation of the scheduling strategy; once the design is a bit
tested.


> I've been using the Guix Build Coordinator build substitutes for
> guix.cbaines.net, which is my testing ground for providing
> substitutes. I think it's working reasonably well.

What is the configuration of this machine?  Size of the store?  Number
of workers where the agents are running?


> The Guix Build Coordinator supports prioritisation of builds. You can
> assign a priority to builds, and it'll try to order builds in such a way
> that the higher priority builds get processed first. If the aim is to
> serve substitutes, doing some prioritisation might help building the
> most fetched things first.

This is really cool!  How does it work?  Do you do manual tag on some
specific derivations?


> Another feature supported by the Guix Build Coordinator is retries. If a
> build fails, the Guix Build Coordinator can automatically retry it. In a
[…]
> perfect world, everything would succeed first time, but because the
> world isn't perfect, there still can be intermittent build
> failures. Retrying failed builds even once can help reduce the chance
> that a failure leads to no substitutes for that builds as well as any
> builds that depend on that output.

Yeah, something in the current infrastructure is lacking to distinguish
between error (“build is complete but return an error”) and failure
(“something along had been wrong“).


> Because the build results don't end up in a store (they could, but as
> set out above, not being in the store is a feature I think), you can't
> use `guix gc` to get rid of old store entries/substitutes. I have some
> ideas about what to implement to provide some kind of GC approach over a
> bunch of nars + narinfos, but I haven't implemented anything yet.

Where do they end up so?  I missed your answer in the Question/Answer
session.


Speaking about Berlin, the builds should be in the workers store (with a
GC policy to be defined; keep them for debugging concern?) and the main
store should have only the minimum.  The items should be really and only
stored in the cache of publish.  IMHO.

Maybe I miss something.


> There could be issues with the implementation… I'd like to think it's
> relatively simple, but that doesn't mean there aren't issues. For some
[…]
> reason or another, getting backtraces for exceptions rarely works. Most
> of the time the coordinator tries to print a backtrace, the part of
> Guile doing that raises an exception. I've managed to cause it to
> segfault, through using SQLite incorrectly, which hasn't been obvious to
> fix at least for me. Additionally, there are some places where I'm
> fighting against bits of Guix, things like checking for substitutes
> without caching, or substituting a derivation without starting to build
> it.

I am confused by all the SQL involved.  And I feel it is hard to
maintain when scaling at large.  I do not know.  I am newbie.


> Finally, the instrumentation is somewhat reliant on Prometheus, and if
> you want a pretty dashboard, then you might need Grafana too. Both of
> these things aren't packaged for Guix, Prometheus might be feasible to
> package within the next few months, I doubt the same is true for Grafana
> (due to the use of NPM).

Really cool!  For sure know how it is healthy (or not) is really nice.

Cheers,
simon



reply via email to

[Prev in Thread] Current Thread [Next in Thread]