bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#24496: offloading should fall back to local build after n tries


From: zimoun
Subject: bug#24496: offloading should fall back to local build after n tries
Date: Sat, 18 Dec 2021 01:10:49 +0100

Hi,

I have not checked all the details, since the code of “guix offload” is
run by root, IIUC and so it is not as friendly as usual to debug. :-)

On Fri, 17 Dec 2021 at 16:57, Maxim Cournoyer <maxim.cournoyer@gmail.com> wrote:

>> However, I think this behavior was unintentionally lost in
>> efbf5fdd01817ea75de369e3dd2761a85f8f7dd5.  Maxim, WDYT?
>
> I just reviewed this commit, and don't see anywhere where the behavior
> would have changed.  The discarding happens here:

[...]

> previously load could be set to +inf.0.  Now it is a float between 0.0
> and 1.0, with threshold defaulting to 0.6.

My /etc/guix/machines.scm contains only one machine and --max-jobs=0.

Because the machine is unreachable, IIUC, ’node’ is (or should be) false
and ’load’ is thus not involved, I guess.  Indeed, ’report-load’
displays nothing, and instead I get:

--8<---------------cut here---------------start------------->8---
The following derivation will be built:
   /gnu/store/c1qicg17ygn1a0biq0q4mkprzy4p2x74-hello-2.10.drv
process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0'
guix offload: error: failed to connect to 'x.x.x.x': Timeout connecting to 
x.x.x.x
waiting for locks or build slots...
process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0'
guix offload: error: failed to connect to 'x.x.x.x': Timeout connecting to 
x.x.x.x
process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0'
guix offload: error: failed to connect to 'x.x.x.x': Timeout connecting to 
x.x.x.x
process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0'
guix offload: error: failed to connect to 'x.x.x.x': Timeout connecting to 
x.x.x.x
process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0'
  C-c C-c
--8<---------------cut here---------------end--------------->8---


Well, if the machine is not reachable, then ’session’ is false, right?

--8<---------------cut here---------------start------------->8---
@@ -472,11 +480,15 @@ (define (machine-faster? m1 m2)
        (let* ((session (false-if-exception (open-ssh-session best
                                                              %short-timeout)))
               (node    (and session (remote-inferior session)))
-              (load    (and node (normalized-load best (node-load node))))
+              (load    (and node (node-load node)))
+              (threshold (build-machine-overload-threshold best))
               (space   (and node (node-free-disk-space node))))
+         (when load (report-load best load))
          (when node (close-inferior node))
          (when session (disconnect! session))
-         (if (and node (< load 2.) (>= space %minimum-disk-space))
+         (if (and node
+                  (or (not threshold) (< load threshold))
+                  (>= space %minimum-disk-space))
[...]
             (begin
               ;; BEST is unsuitable, so try the next one.
               (when (and space (< space %minimum-disk-space))
                 (format (current-error-port)
                         "skipping machine '~a' because it is low \
on disk space (~,2f MiB free)~%"
                         (build-machine-name best)
                         (/ space (expt 2 20) 1.)))
               (release-build-slot slot)
               (loop others)))))
--8<---------------cut here---------------end--------------->8---

Therefore, the ’else’ branch goes and so the codes does ’(loop others)’.

However, I miss why ’others’ is not empty (only one machine in
/etc/guix/machines.scm).  Well, the message «waiting for locks or build
slots...» suggests that something is restarted and it is not that ’loop’
we are observing but another one.

On daemon side, I do not know what this ’waitingForAWhile’ and
’lastWokenUp’ mean.

--8<---------------cut here---------------start------------->8---
    /* If we are polling goals that are waiting for a lock, then wake
       up after a few seconds at most. */
    if (!waitingForAWhile.empty()) {
        useTimeout = true;
        if (lastWokenUp == 0)
            printMsg(lvlError, "waiting for locks or build slots...");
        if (lastWokenUp == 0 || lastWokenUp > before) lastWokenUp = before;
        timeout.tv_sec = std::max((time_t) 1, (time_t) (lastWokenUp + 
settings.pollInterval - before));
    } else lastWokenUp = 0;
--8<---------------cut here---------------end--------------->8---


Bah it requires more investigations and I agree with Maxim that
efbf5fdd01817ea75de369e3dd2761a85f8f7dd5 is probably not the issue
there.

Cheers,
simon





reply via email to

[Prev in Thread] Current Thread [Next in Thread]