guix-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug#41119] [PATCH] fix some issues with (guix nar)


From: Ludovic Courtès
Subject: [bug#41119] [PATCH] fix some issues with (guix nar)
Date: Thu, 28 May 2020 12:32:51 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)

Hi!

Caleb Ristvedt <caleb.ristvedt@cune.org> skribis:

>> This change had an undesirable effect: the connection would be kept for
>> the body of ‘with-temporary-store-file’, during which we’d call:
>>
>>   finalize-store-file -> register-path
>>
>> which accesses the database.  At this point, for each ‘guix offload’
>> process, we’d thus have the database open twice: once for the session’s
>> guix-daemon, and once for that ‘register-path’ call.
>
> If the connection wasn't kept for the body of with-temporary-store-file,
> the temporary store file wouldn't be protected from GC during the body
> (the daemon treats unlocked temproots files as "stale"), thus rather
> defeating the purpose. It makes sense, then, that the connection was
> also kept for the body prior to this patch - indeed, unless emacs's
> parenthesis-matching capabilities are failing me, it appears that the
> body is solidly within the 'with-store' form in
> 37edbc91e34fb5658261e637e6ffccdb381e5271.

Oh you’re right, sorry for the confusion.

>> On berlin, the effect is that we see many ‘guix offload’ processes
>> stalled because the SQLite database is busy:
>
> ... which makes this quite the mystery indeed. I assume you've tested
> with the patch reverted and found that this issue goes away?

No.  I observed the behavior and looked for recent changes that could
cause the problem.  But I guess I was tired and jumped to silly
conclusions.

> <wild brainstorming starts here>
>
> AFAIK just having the database open doesn't by itself impose any
> locks. The daemon process we're connected to should have it open, but
> should just be blocked waiting for our next RPC. Database locks happen
> when transactions are started (either explicitly or implicitly), and
> implicitly-started transactions are automatically committed by sqlite
> (specifically when the statement that started the transaction is either
> reset or finalized). The only loose end I can think of right now is that
> call-with-transaction only catches exceptions of type 'sqlite-error, so
> in theory if a different type of exception were to be thrown, it could
> exit in a broken state where neither a commit nor a rollback has been
> performed. Really it should catch all exception types, and use match in
> the handler to pick out the sqlite-errors. If that were causing the
> problems, though, we'd expect to see some errors appearing in the
> offload output.

Good point but yes, we’d see an error, and ‘guix offload’ would probably
exit right away.

> Actually, come to think of it, there could be another issue with
> call-with-transaction: if somehow it's possible for SQLITE_BUSY errors
> to occur despite the connection having succeeded with a 'begin
> immediate;' (which immediately starts a write transaction), then the
> rollback wouldn't occur, and what should be a failed transaction
> followed by a successful transaction becomes one long,
> restarted-in-the-middle transaction. I'm not sure if that's a problem in
> practice, though.
>
> And now that I look at it again, it turns out that most of our database
> query procedures in (guix store database) aren't finalizing their
> statements in case of a nonlocal exit... which would tend to happen if,
> for example, an SQLITE_BUSY error occurred. Which would cause the
> statement to not be finalized until the garbage collector got ahold of
> it. But due to statement caching the garbage collector likely won't get
> ahold of it until the database connection itself is destroyed. The
> wording at https://www.sqlite.org/lang_transaction.html makes me think
> that this shouldn't be an issue because the errors we'd expect all seem
> to roll back automatically, but if we somehow got one that didn't roll
> back automatically, there would potentially be an extended amount of
> time before the statement was finalized and the implicit transaction
> committed.
>
> Also, I've noticed that with the way that finalize-store-file is
> written, we actually already have a database open when we call
> register-path. This is because it's needed in order to call path-id, but
> the scope of that with-database form is rather larger than it needs to
> be.
>
> We may have a situation here where things go fine until a single
> SQLITE_BUSY error is produced by chance, and that causes more
> SQLITE_BUSY errors, and so on.

Hmm, sounds plausible.

> In summary, there are many things I could imagine going wrong to cause /
> contribute to the observed behavior, but the patch, barring some absurd
> guile compilation bug, is not one of them. I do, however, think that
> (guix store database) needs some love.

Yeah.

>> They loop pretty much indefinitely on this and nothing (or very little)
>> happens on the system.
>
> To be clear, the nothing-happening status is common to all processes
> that use the database, including daemon processes? That's quite severe.

I just did a random sample, but several offload processes were stuck
like the one I showed, and clients would usually get “database is
locked” messages from the daemon.

>> I’ll revert this patch but I’m happy to hear what you think, Caleb.
>
> If the data says it's causing those problems, I'd tend to agree with
> that. I would really like to understand how, though, because even after
> a few hours of brainstorming bizarre edge cases I still can't come up
> with a satisfying explanation.

No you’re right, my analysis was wrong.  Further investigation needed…

>> Another reason to implement temp roots in Scheme, as it would allow us
>> to not open a connection to the daemon from ‘guix offload’!
>
> Soon™. Conceptually the code is there, I'm working towards a rebase that
> tries to first make the rest of daemon-side guix compatible with fibers
> - thread pools✓, eval-with-container✓, fibers-friendly waitpid✓, etc.

Neat!  For master we could do with a simpler implementation, but we’ll
see.

Thanks,
Ludo’.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]