bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#72165: 31.0.50; Intermittent crashing with recent emacs build


From: Eli Zaretskii
Subject: bug#72165: 31.0.50; Intermittent crashing with recent emacs build
Date: Thu, 18 Jul 2024 12:52:28 +0300

> From: Dima Kogan <dima@secretsauce.net>
> Cc: 72165@debbugs.gnu.org
> Date: Thu, 18 Jul 2024 00:25:14 -0700
> 
> Here's what I see in the core dump:
> 
>   (gdb) p current_thread->m_current_buffer->text->z
>   $22 = 32192
> 
>   (gdb) p current_thread->m_current_buffer->text->z_byte
>   $23 = 32178
> 
>   (gdb) p current_thread->m_current_buffer->pt
>   $24 = 32192
> 
>   (gdb) p current_thread->m_current_buffer->pt_byte
>   $25 = 32178
> 
> So that tells me that the failing condition isn't the one gdb flagged,
> but the one immediately after:
> 
>   if (BYTEPOS (opoint) < CHARPOS (opoint))
>     emacs_abort ();

Yes.

> The compiler optimizations could be responsible for the discrepancy.

Yes, this happens frequently in optimized code.

> Am
> I understanding correctly that this check makes sure that BYTEPOS >=
> CHARPOS, which must always be true because sizeof(emacs character) is
> always >= 1byte?

Yes.

> The buffer name:
> 
>   (gdb) p current_thread->m_current_buffer->name_
>   $26 = XIL(0x7fc685b24c1c)
> 
>   (gdb) xstring
>   $27 = (struct Lisp_String *) 0x7fc685b24c18
>   "*Messages*"

And the *Messages* buffer was displayed in some window when this
happened?

> The full structure:
> 
>   (gdb) p current_thread->m_current_buffer->own_text
>   $45 = {
>     beg                        = 0x561d7100f800 ...
>     z                          = 32192,
>     z_byte                     = 32178,
>     gpt                        = 32191,
>     gpt_byte                   = 32177,

That's the bug: in these two pairs, the character and byte values
should be identical.

The question is: which code modified Z and GPT without updating the
corresponding _BYTE variables, or the other way around?

> Let's look just at the last little bit, to count the bytes:
> 
>   (gdb) printf "%.200s\n", &current_thread->m_current_buffer->text->beg[32000]
>   mail>" 1 9 (face mu4e-context-face help-echo "mu4e context: fastmail")) 10)
>   Error during redisplay: (eval (mu4e--modeline-string) t) signaled 
> (args-out-of-range "" 0) [5 times]
> 
> I asked for at most 200 bytes (up to byte 32200). I got exactly 176
> bytes, so the text ends where the gap supposedly begins. That makes
> sense.

This means Z_BYTE and GPT_BYTE are correct, but the corresponding Z
and GPT values are incorrect.

My suggestion is to run Emacs under GDB with a watchpoint on Z_BYTE,
conditioned on the situation that Z_BYTE and Z are not equal.

This watchpoint needs to be defined when the current buffer is the
*Messages* buffer.  One way of doing that is as follows:

  $ gdb ./emacs
  ...
  (gdb) break Frecenter
  (gdb) run

After Emacs starts, type "C-x b *Messages* RET" to display *Messages*
in a window, then type C-l to trigger the Frecenter breakpoint, and
when GDB kicks in, type at the GDB prompt as follows:

  (gdb) n
  (gdb) n
  (gdb) p buf
  (gdb) watch $1->text->z_byte if $1->text->z_byte != $1->text->z

This relies on the fact that our code always changes Z_BYTE _after_
the suitable change to Z.  The only exception to this rule that I
found is in insdel.c:del_range_2, where we do it in the opposite
order.  So for the above to work, you need to edit that function and
transpose the line of code which modify Z_BYTE with the one which
modifies Z.  Then rebuild Emacs and use the resulting binary to debug
this with the above watchpoint.

> Theory: there's a race condition between error handling that ends up
> writing to *Messages* and the logic that aggregates duplicated messages
> into things like [5 times].

I don't see how this could happen, for two reasons:

 . emacs is a single-threaded program, so how can two pieces of code
   that run in the same thread produce a race condition?
 . in this particular case, both writing to *Messages* and aggregation
   of identical messages happen in the same function, one after the
   other; see xdisp.c:message_dolog.

> I saw the crashing once every 20min maybe, so reproducing it is probably
> possible, but not very quick and easy. Does it make sense to try to fix
> the (condition-case) problem first, since that's easily reproducible?

I don't see how fixing that problem could help.  It might even
interfere, if that problem somehow triggers this one.  Or did I miss
something?





reply via email to

[Prev in Thread] Current Thread [Next in Thread]