bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#69611: 30.0.50; Long bidi line with control characters freezes Emacs


From: Eli Zaretskii
Subject: bug#69611: 30.0.50; Long bidi line with control characters freezes Emacs
Date: Thu, 07 Mar 2024 17:42:44 +0200

> Date: Thu, 07 Mar 2024 14:42:37 +0100
> From:  Stephen Berman via "Bug reports for GNU Emacs,
>  the Swiss army knife of text editors" <bug-gnu-emacs@gnu.org>
> 
> When I visited a certain elisp file generated by a program of mine and
> type `M-v', it took some time (see below for details) for the display to
> scroll to 4% from the top (according to the mode line) and then there
> was no further change and Emacs froze, using 100% of a CPU core.  I
> found no way to unfreeze it within Emacs and after about 15 minutes
> terminated the emacs process from the shell.  This is reliably
> reproducible with this file.
> 
> The file in question is only about 50k bytes long, but it contains one
> line of more than 37k characters, consisting of a mix of ASCII and
> non-ASCII characters, including properly shaped Arabic script.  The file
> itself has base paragraph direction LTR.
> 
> Most of the Arabic words in this file are enclosed in the bidirectional
> control characters POP DIRECTIONAL FORMATTING (#x202c) and RIGHT-TO-LEFT
> EMBEDDING (#x202b).  I did not add these characters, but I had
> copy-&-pasted most of the Arabic from a PDF file I did not create.  I
> don't know if PDFs of Arabic text normally contain these control
> characters, but the consequences for Emacs were dramatic.  When I simply
> visited this file in Emacs (started with -Q) there was an immediate
> slowdown, and in top I could see Emacs using 100% of a CPU thread.  I
> ran `M-: (benchmark-run nil (end-of-buffer))' on this file, and the
> result was:
> 
> (27.962602113 2 0.0226042269999999977)

This is a crazy file.  UBA, the Unicode Bidirectional Algorithm,
allows the RLE..PDF embeddings to nest.  The nesting is allowed to be
up to 125 deep(!), but I have never seen a text file using more than a
couple of nested embeddings.  This file goes up to 111 nested
embedding levels!  Moreover, quite a few embeddings are invalid: there
are 1021 RLE control characters in this file, but only 971 PDF
controls, so they don't pair as they should.  This causes the
reordering algorithm to examine extremely long stretches of characters
each time we need to redisplay even a small portion of the window,
because reordering must always find where each nested level ends to do
its job.

My suggestion is to remove all the RLE and PDF controls from the file.
They are not needed, not in Emacs anyway.  I'm guessing the program
which created this file uses bidi controls because it wants to be
compatible with incomplete implementations of the UBA, which don't
support implicit embedding levels (those cause by bidirectional
properties of characters, as opposed to explicit bidi controls like
RLE and PDF).  With full UBA implementations, the bidi controls are
needed only when the reordering using implicit levels produces wrong
results, which is quite rare.

> The display of the benchmark result only appeared in the echo area after
> more than a minute (I timed it with a stopwatch).  At that point the
> mode line showed the buffer at 4% from the top, and the display remained
> frozen afterwards.  After several minutes during which Emacs consumed
> 100% CPU, and I had switched the focus away from the Emacs frame, the
> CPU consumption stopped, but as soon as I switch focus back to that
> frame, it went back to 100%.  The display never changed from showing the
> buffer at 4%, apparently being in some kind of infinite loop.  After
> about 15 minutes I started gdb, attached the Emacs process and produced
> a backtrace, which I've attached, in the hope it helps to diagnose the
> problem.

The extremely deep nesting of embeddings in the file, coupled with the
fact that the first embedding starts near the beginning of the file,
but ends very near its end, causes the algorithm that finds where to
position the cursor to fail, because it cannot cope with the situation
where, after C-f or C-b, the position of point is very far outside of
the window.  I guess this causes some infloop (even though I don't see
it here, I just see that the cursor doesn't move although point does
move).  It could also be just a very long calculation, not an infloop,
because finding where to place the window-start point in this case is
also very expensive.

> Nevertheless, there seems to be something else besides the control
> characters involved in this issue, because as a further test, I created
> a buffer consisting of more than 1000 copies of the test string
> concatenating the Arabic example in etc/HELLO and "Hello" (see bug#69385
> for more on such test buffers), and manually enclosed each Arabic word
> in the above control characters, but the benchmark result in this buffer
> was not significantly different from the result without the control
> characters (and similar to the above result for the copy of the
> problematic file without the control characters), and the display did
> not freeze.

Yes, because you never tried such deeply-nested embeddings, and didn't
make your embedding levels include so many characters long as this
file does.

This file is an interesting curiosity, as far as I'm concerned, but I
doubt whether I will find enough time and motivation to try to speed
up Emacs when such crazy files are visited.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]