bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadloc


From: Reepca Russelstein
Subject: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock
Date: Tue, 24 Dec 2024 07:08:52 -0600
User-agent: Gnus/5.13 (Gnus v5.13)

(note: the following includes some "thinking out loud", skip to the
patch if you're in a hurry)

I'm a bit puzzled how exactly a true "forward-progress-not-possible"
deadlock can even occur, especially in the manner you describe.  The
derivation graph is inherently a directed acyclic graph, so as long as
locks are only acquired for a given derivation after all of its inputs
(or references, for substitutions) are present, and are released once
the desired outputs are present, it should only even theoretically be
possible to have a deadlock between different outputs of the same
derivation.

These requirements /seem/ to be satisfied by build.cc.  Path locks are
only acquired in DerivationGoal::tryToBuild and
SubstitutionGoal::tryToRun, and the former is only reachable through
DerivationGoal::inputsRealised (and DerivationGoal::buildDone in case of
multiple rounds), and the latter is only reachable through
SubstitutionGoal::referencesValid.  As their names imply, having made it
through them should suffice to prove that the inputs and references are
present.

That leaves the requirement that the locks are released once the desired
outputs are present.  It /is/ possible for a lock to stick around longer
than desired (along with many other bad things) in
DerivationGoal::tryToBuild, but only if a cached build failure is
encountered, to my knowledge (notice how there is no call to done,
amDone, or any rescheduling procedure in that case; the running
DerivationGoal just drops off the face of the earth and tells nobody,
including any goal waiting on it).  This also happens in
DerivationGoal::haveDerivation.

I assume you're not using cached build failures, right?  It defaults to
off, so it should only be in use if you explicitly passed
--cache-failures to guix-daemon (or had your client pass a value of
"true" for "build-cache-failure", but I don't see that string anywhere
in the guile side of guix).

One detail that worries me is that SubstitutionGoal.outputLock is a
std::shared_ptr<PathLocks> instead of a PathLocks.  I cannot for the
life of me figure out why this is the case, since no sharing of
ownership of the underlying PathLocks seems to ever be done.
outputLock.reset() seems to be used as if it were a synonym for
outputLock.unlock() in many places, so whoever wrote those uses probably
thought the same.  As far as I can tell, it *should* always cause
PathLocks::unlock to be called, so it shouldn't be an issue, it just
worries me that I don't understand why it's done that way.

Ahh, I think I may have just found another place in
DerivationGoal::tryToBuild that might be the source of your issues.
First, note the comment:

/* Obtain locks on all output paths.  The locks are automatically
   released when we exit this function or the client crashes. ... */

Well, as I noticed earlier in the cached build failures case, this isn't
exactly accurate, and there's probably a reason why.  The comment seems
to believe specifically that leaving the scope of tryToBuild will
automatically release the locks, which suggests to me that at one point
outputLocks was a local variable, rather than a field of DerivationGoal.
This theorized older version would be consistent with what we see in
this (current) snippet:

    if (buildMode != bmCheck && validPaths.size() == drv.outputs.size()) {
        debug(format("skipping build of derivation `%1%', someone beat us to 
it") % drvPath);
        outputLocks.setDeletion(true);
        done(BuildResult::AlreadyValid);
        return;
    }

If leaving the scope of tryToBuild would cause outputLocks.unlock to
run, then outputLocks.setDeletion is all that would need to be called.
But since that isn't the case, this will cause the lock to be held for
as long as the DerivationGoal exists.

(note that I haven't actually checked the commit history to see whether
outputLocks was or wasn't at some point a local variable - I have no
idea why the commenter thought returning would automatically release it)

Perhaps this same error was made elsewhere.  C-s to see where else
setDeletion is used without an accompanying unlock and... sure enough,
in SubstitutionGoal::tryToRun, we see:

    /* Check again whether the path is invalid. */
    if (!repair && worker.store.isValidPath(storePath)) {
        debug(format("store path `%1%' has become valid") % storePath);
        outputLock->setDeletion(true);
        amDone(ecSuccess);
        return;
    }

Now outputLock will be held for as long as this SubstitutionGoal exists.

The Nix issue that you identified as closely resembling what you were
encountering was resolved with this commit message
(https://github.com/NixOS/nix/commit/29cde917fe6b8f2e669c8bf10b38f640045c83b8):

-------------------------------------------------------------------
Fix deadlock in SubstitutionGoal

We were relying on SubstitutionGoal's destructor releasing the lock,
but if a goal is a top-level goal, the destructor won't run in a
timely manner since its reference count won't drop to zero.  So
release it explicitly.
-------------------------------------------------------------------

Hmmm.  Destructors not running in a timely manner, and in this
particular case - which is guaranteed to occur when one guix-daemon
process gets blocked by another holding a lock and the one initially
holding the lock produces the output path - the only way the locks get
released is when the destructors are run.  I do believe this may be
related!

One other possible reason for destructors not running in as timely a
manner as one might hope (though probably not as severely as for
top-level goals) would be that Worker::run uses a ready queue, awake2,
of type Goals, which contains strong references.  This ensures that any
Goal on this scheduler turn will not be destroyed, even if it has
already run, until the end of this scheduler turn is reached.  This
probably isn't the cause for what you're seeing, but this kind of detail
matters a lot when so much is implicitly tied to the lifetime of
references.  This could be corrected by popping off Goals one at a time
from awake2.

Patch attached for DerivationGoal::tryToBuild and
SubstitutionGoal::tryToRun.  I haven't tested it beyond verifying that
it compiles, but I've got a pretty good feeling™ about it.  The commit
message includes discussion of a concrete scenario in which this could
cause a deadlock.  Interestingly enough, it seems to require at least 3
guix-daemon processes running during the "inciting event", though now
that I think about it, the third doesn't need to be a persistent part of
the deadlock.

I did a brief glance over what I could find of Nix's corresponding code
(of course much has changed), and it looks like there is a chance that
at least their DerivationGoal::tryToBuild in
src/libstore/build/derivation-goal.cc is affected by the same issue (you
see the same pattern of calling outputLocks.setDeletion but not
outputLocks.unlock in the same case).  I couldn't quickly determine
whether it was possible their substitutes were affected, as that code
has more significantly diverged and I see no mention of locks in it.
You may want to give them a nudge and see if this information is of use
to them.

Merry Christmas.

- reepca

Attachment: 0001-nix-build.cc-explicitly-unlock-in-the-has-become-val.patch
Description: Text Data

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]