[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadloc
From: |
Ludovic Courtès |
Subject: |
bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock |
Date: |
Fri, 27 Dec 2024 23:45:10 +0100 |
User-agent: |
Gnus/5.13 (Gnus v5.13) |
Hello Reepca,
So glad to get your message. :-)
Reepca Russelstein <reepca@russelstein.xyz> skribis:
> I assume you're not using cached build failures, right? It defaults to
> off, so it should only be in use if you explicitly passed
> --cache-failures to guix-daemon (or had your client pass a value of
> "true" for "build-cache-failure", but I don't see that string anywhere
> in the guile side of guix).
Right, not using cached build failures on these machines.
> One detail that worries me is that SubstitutionGoal.outputLock is a
> std::shared_ptr<PathLocks> instead of a PathLocks. I cannot for the
> life of me figure out why this is the case, since no sharing of
> ownership of the underlying PathLocks seems to ever be done.
> outputLock.reset() seems to be used as if it were a synonym for
> outputLock.unlock() in many places, so whoever wrote those uses probably
> thought the same. As far as I can tell, it *should* always cause
> PathLocks::unlock to be called, so it shouldn't be an issue, it just
> worries me that I don't understand why it's done that way.
Yes, I had the same thoughts. I checked the “shared_ptr<PathLocks>
outputLock;” member is here since at least 2005, when the file was
renamed to build.cc.
> Ahh, I think I may have just found another place in
> DerivationGoal::tryToBuild that might be the source of your issues.
> First, note the comment:
>
> /* Obtain locks on all output paths. The locks are automatically
> released when we exit this function or the client crashes. ... */
>
> Well, as I noticed earlier in the cached build failures case, this isn't
> exactly accurate, and there's probably a reason why. The comment seems
> to believe specifically that leaving the scope of tryToBuild will
> automatically release the locks, which suggests to me that at one point
> outputLocks was a local variable, rather than a field of DerivationGoal.
> This theorized older version would be consistent with what we see in
> this (current) snippet:
>
> if (buildMode != bmCheck && validPaths.size() == drv.outputs.size()) {
> debug(format("skipping build of derivation `%1%', someone beat us to
> it") % drvPath);
> outputLocks.setDeletion(true);
> done(BuildResult::AlreadyValid);
> return;
> }
Uh! Sounds like a plausible problem.
> If leaving the scope of tryToBuild would cause outputLocks.unlock to
> run, then outputLocks.setDeletion is all that would need to be called.
> But since that isn't the case, this will cause the lock to be held for
> as long as the DerivationGoal exists.
Yep.
[...]
> From 2030f198892f972ee6bc2fb8529cbd6afb6bc9ad Mon Sep 17 00:00:00 2001
> From: Reepca Russelstein <reepca@russelstein.xyz>
> Date: Tue, 24 Dec 2024 05:40:58 -0600
> Subject: [PATCH] nix: build.cc: explicitly unlock in the has-become-valid
> case.
>
> (hopefully) fixes #31785. Similar to Nix issue #178, fixed in
> https://github.com/NixOS/nix/commit/29cde917fe6b8f2e669c8bf10b38f640045c83b8.
>
> We can't rely on Goal deletion to release our locks in a timely manner. In
> the case in which multiple guix-daemon processes simultaneously try producing
> an output path path1, the one that gets there first (P1) will get the lock,
> and the second one (P2) will continue trying to acquire the lock until it is
> released. Once it has acquired the lock, it checks to see whether the path
> has already become valid in the meantime, and if so it reports success to
> those Goals waiting on its completion and finishes. Unfortunately, it fails
> to release the locks it holds first, so those stay held until that Goal gets
> deleted.
>
> Suppose we have the following store path dependency graph:
>
> path4
> / | \
> path1 path2 path3
>
> P2 is now sitting on path1's lock, and will continue to do so until path4 is
> completed. Suppose there is also a P3, and it has been blocked while P1
> builds path2. Now P3 is sitting on path2's lock, and can't acquire path1's
> lock to determine that it has been completed. Likewise P2 is sitting on
> path1's lock, and now can't acquire path2's lock to determine that it has been
> completed. Finally, P3 completes path3 while P2 is blocked.
>
> Now:
>
> - P1 knows that path1 and path2 are complete, and holds no locks, but can't
> determine that path3 is complete
> - P2 knows that path1 and path3 are complete, and holds locks on path1 and
> path3, but can't determine that path2 is complete
> - P3 knows that path2 and path3 are complete, and holds a lock on path2, but
> can't determine that path1 is complete
>
> And none of these locks will be released until path4 is complete. Thus, we
> have a deadlock.
>
> To resolve this, we should explicitly release these locks as soon as they
> should be released.
>
> * nix/libstore/build.cc
> (DerivationGoal::tryToBuild, SubstitutionGoal::tryToRun):
> Explicitly release locks in the has-become-valid case.
>
> Change-Id: Ie510f84828892315fe6776c830db33d0f70bcef8
I’m very much convinced by the patch. Yet it bothers me that I cannot
reproduce the problem. I tried first with this test, which attempts to
reproduce what you describe in the commit log above:
--8<---------------cut here---------------start------------->8---
;; https://issues.guix.gnu.org/31785
(use-modules (guix)
((gnu packages) #:select (specification->package))
(srfi srfi-1)
(ice-9 threads)
(ice-9 match)
(ice-9 textual-ports))
(define (nonce)
(logxor (car (gettimeofday)) (cdr (gettimeofday))
(getpid)))
(define drv1
(computed-file "drv1" #~(begin #$(nonce) (sleep 3) (mkdir #$output))))
(define drv2
(computed-file "drv2" #~(begin #$(nonce) (sleep 3) (mkdir #$output))))
(define drv3
(computed-file "drv3" #~(begin #$(nonce) (sleep 3) (mkdir #$output))))
(define drv4
(computed-file "drv4" #~(begin #$(nonce) (sleep 3)
(pk 'deps: #$drv1 #$drv2 #$drv3)
(mkdir #$output))))
(%graft? #f)
(define (log-port prefix)
"Return a logging port that prefixes each line with PREFIX."
(make-custom-textual-output-port
"log"
(lambda (str start count)
(let ((str (substring str start (+ start count))))
(format (current-error-port) "~a: ~a" prefix str)
count))
#f #f #f))
(setvbuf (current-error-port) 'line)
(set-port-encoding! (current-error-port) "UTF-8")
(let* ((builder (lambda (name lst)
(call-with-new-thread
(lambda ()
(parameterize ((current-build-output-port
(log-port name)))
(with-store store
(set-build-options store #:verbosity 4)
(run-with-store store
(mlet %store-monad ((lst (mapm %store-monad
lower-object lst)))
(built-derivations lst)))))))))
(thread1 (builder 'thread1 (list drv4)))
(thread2 (builder 'thread2 (list drv4 drv1 drv2)))
(thread3 (builder 'thread3 (list drv4 drv3 drv2))))
(join-thread thread1)
(join-thread thread2)
(join-thread thread3))
--8<---------------cut here---------------end--------------->8---
But there’s no deadlock, and I think that’s because the problem we’re
seeing has to do with substitute goals, and there’s no such goal here.
So then I tried this:
diff --git a/tests/store.scm b/tests/store.scm
index 45948f4f43..224b9867c4 100644
--- a/tests/store.scm
+++ b/tests/store.scm
@@ -1,5 +1,5 @@
;;; GNU Guix --- Functional package management for GNU
-;;; Copyright © 2012-2021, 2023 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2012-2021, 2023, 2024 Ludovic Courtès <ludo@gnu.org>
;;;
;;; This file is part of GNU Guix.
;;;
@@ -35,6 +35,7 @@ (define-module (test-store)
#:use-module (gnu packages bootstrap)
#:use-module (ice-9 match)
#:use-module (ice-9 regex)
+ #:use-module (ice-9 threads)
#:use-module (rnrs bytevectors)
#:use-module (rnrs io ports)
#:use-module (web uri)
@@ -1042,6 +1043,52 @@ (define %shell
(ensure-path s item)
(path-info-nar-size (query-path-info s item)))))
+(test-assert "substitute deadlock"
+ (with-store s
+ (let* ((guile (package-derivation s %bootstrap-guile (%current-system)))
+ (c (random-text)) ;contents of the output
+ (drv1 (build-expression->derivation
+ s "substitute-me1"
+ `(begin ,c (exit 1)) ;would actually fail
+ #:guile-for-build guile))
+ (drv2 (build-expression->derivation
+ s "substitute-me2"
+ `(begin ,c (exit 1)) ;would actually fail
+ #:guile-for-build guile))
+ (drv3 (build-expression->derivation
+ s "depends-on-substitutable1"
+ `(call-with-output-file %output
+ (lambda (p)
+ (display ,c p)))
+ #:inputs `(("drv1" ,drv1))
+ #:guile-for-build guile))
+ (drv4 (build-expression->derivation
+ s "depends-on-substitutable2"
+ `(call-with-output-file %output
+ (lambda (p)
+ (display ,c p)))
+ #:inputs `(("drv2" ,drv2))
+ #:guile-for-build guile)))
+ (with-derivation-substitute drv1 c
+ (with-derivation-substitute drv2 c
+ (system* "ls" "-l" "/tmp/subst")
+ (let* ((builder (lambda (drv)
+ (call-with-new-thread
+ (lambda ()
+ (with-store store
+ (set-build-options store
+ #:use-substitutes? #t
+ #:substitute-urls
+ (%test-substitute-urls))
+ (build-things store
+ (list (derivation-file-name
drv))))))))
+ (thread1 (builder drv3))
+ (thread2 (builder drv4)))
+ (join-thread thread1)
+ (join-thread thread2)
+ (and (valid-path? s (derivation->output-path drv3))
+ (valid-path? s (derivation->output-path drv4)))))))))
+
(test-assert "export/import several paths"
(let* ((texts (unfold (cut >= <> 10)
(lambda _ (random-text))
But I think we’d have to be lucky for it to trigger the deadlock (lucky
because the two substitutes should be fetched at the same instant,
roughly).
Given that it’s a nasty issue, I’d prefer to have an (ideally automated)
test.
Thoughts? Ideas?
Anyway, thanks for thinking through it and for explaining your reasoning
and coming up with a patch!
And: merry end-of-year holidays too. :-)
Ludo’.