guix-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug#41119] [PATCH] fix some issues with (guix nar)


From: Caleb Ristvedt
Subject: [bug#41119] [PATCH] fix some issues with (guix nar)
Date: Thu, 28 May 2020 03:50:36 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)

Ludovic Courtès <ludo@gnu.org> writes:

> Hi,
>
> This change had an undesirable effect: the connection would be kept for
> the body of ‘with-temporary-store-file’, during which we’d call:
>
>   finalize-store-file -> register-path
>
> which accesses the database.  At this point, for each ‘guix offload’
> process, we’d thus have the database open twice: once for the session’s
> guix-daemon, and once for that ‘register-path’ call.

If the connection wasn't kept for the body of with-temporary-store-file,
the temporary store file wouldn't be protected from GC during the body
(the daemon treats unlocked temproots files as "stale"), thus rather
defeating the purpose. It makes sense, then, that the connection was
also kept for the body prior to this patch - indeed, unless emacs's
parenthesis-matching capabilities are failing me, it appears that the
body is solidly within the 'with-store' form in
37edbc91e34fb5658261e637e6ffccdb381e5271.

> On berlin, the effect is that we see many ‘guix offload’ processes
> stalled because the SQLite database is busy:

... which makes this quite the mystery indeed. I assume you've tested
with the patch reverted and found that this issue goes away? If so, I am
very puzzled. One would expect that "database open twice" would tend to
have *less* contention issues than "database open at least twice".

<wild brainstorming starts here>

AFAIK just having the database open doesn't by itself impose any
locks. The daemon process we're connected to should have it open, but
should just be blocked waiting for our next RPC. Database locks happen
when transactions are started (either explicitly or implicitly), and
implicitly-started transactions are automatically committed by sqlite
(specifically when the statement that started the transaction is either
reset or finalized). The only loose end I can think of right now is that
call-with-transaction only catches exceptions of type 'sqlite-error, so
in theory if a different type of exception were to be thrown, it could
exit in a broken state where neither a commit nor a rollback has been
performed. Really it should catch all exception types, and use match in
the handler to pick out the sqlite-errors. If that were causing the
problems, though, we'd expect to see some errors appearing in the
offload output.

Actually, come to think of it, there could be another issue with
call-with-transaction: if somehow it's possible for SQLITE_BUSY errors
to occur despite the connection having succeeded with a 'begin
immediate;' (which immediately starts a write transaction), then the
rollback wouldn't occur, and what should be a failed transaction
followed by a successful transaction becomes one long,
restarted-in-the-middle transaction. I'm not sure if that's a problem in
practice, though.

And now that I look at it again, it turns out that most of our database
query procedures in (guix store database) aren't finalizing their
statements in case of a nonlocal exit... which would tend to happen if,
for example, an SQLITE_BUSY error occurred. Which would cause the
statement to not be finalized until the garbage collector got ahold of
it. But due to statement caching the garbage collector likely won't get
ahold of it until the database connection itself is destroyed. The
wording at https://www.sqlite.org/lang_transaction.html makes me think
that this shouldn't be an issue because the errors we'd expect all seem
to roll back automatically, but if we somehow got one that didn't roll
back automatically, there would potentially be an extended amount of
time before the statement was finalized and the implicit transaction
committed.

Also, I've noticed that with the way that finalize-store-file is
written, we actually already have a database open when we call
register-path. This is because it's needed in order to call path-id, but
the scope of that with-database form is rather larger than it needs to
be.

We may have a situation here where things go fine until a single
SQLITE_BUSY error is produced by chance, and that causes more
SQLITE_BUSY errors, and so on.

</wild brainstorming>

In summary, there are many things I could imagine going wrong to cause /
contribute to the observed behavior, but the patch, barring some absurd
guile compilation bug, is not one of them. I do, however, think that
(guix store database) needs some love.

> They loop pretty much indefinitely on this and nothing (or very little)
> happens on the system.

To be clear, the nothing-happening status is common to all processes
that use the database, including daemon processes? That's quite severe.

> I’ll revert this patch but I’m happy to hear what you think, Caleb.

If the data says it's causing those problems, I'd tend to agree with
that. I would really like to understand how, though, because even after
a few hours of brainstorming bizarre edge cases I still can't come up
with a satisfying explanation.

> Another reason to implement temp roots in Scheme, as it would allow us
> to not open a connection to the daemon from ‘guix offload’!

Soon™. Conceptually the code is there, I'm working towards a rebase that
tries to first make the rest of daemon-side guix compatible with fibers
- thread pools✓, eval-with-container✓, fibers-friendly waitpid✓, etc.

- reepca





reply via email to

[Prev in Thread] Current Thread [Next in Thread]