bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#64735: 29.0.92; find invocations are ~15x slower because of ignores


From: Dmitry Gutov
Subject: bug#64735: 29.0.92; find invocations are ~15x slower because of ignores
Date: Tue, 12 Sep 2023 21:48:37 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0

On 12/09/2023 19:32, Eli Zaretskii wrote:
Date: Tue, 12 Sep 2023 17:23:53 +0300
Cc: luangruo@yahoo.com, sbaugh@janestreet.com, yantar92@posteo.net,
  64735@debbugs.gnu.org
From: Dmitry Gutov <dmitry@gutov.dev>

- Dropping most of the setup in read_and_dispose_of_process_output
(which creates some consing too) and calling
Finternal_default_process_filter directly (call_filter_directly.diff),
when it is the filter to be used anyway.

- Going around that function entirely, skipping the creation of a Lisp
string (CHARS -> TEXT) and inserting into the buffer directly (when the
filter is set to the default, of course). Copied and adapted some code
from 'call_process' for that (read_and_insert_process_output.diff).

Neither are intended as complete proposals, but here are some
comparisons. Note that either of these patches could only help the
implementations that don't set up process filter (the naive first one,
and the new parallel number 5 above).

For testing, I used two different repo checkouts that are large enough
to not finish too quickly: gecko-dev and torvalds-linux.

master

| Function                                         | gecko-dev | linux |
| find-directory-files-recursively                 |      1.69 |  0.41 |
| find-directory-files-recursively-2               |      1.16 |  0.28 |
| find-directory-files-recursively-3               |      0.92 |  0.23 |
| find-directory-files-recursively-5               |      1.07 |  0.26 |
| find-directory-files-recursively (rpom 409600)   |      1.42 |  0.35 |
| find-directory-files-recursively-2 (rpom 409600) |      0.90 |  0.25 |
| find-directory-files-recursively-5 (rpom 409600) |      0.89 |  0.24 |

call_filter_directly.diff (basically, not much difference)

| Function                                         | gecko-dev | linux |
| find-directory-files-recursively                 |      1.64 |  0.38 |
| find-directory-files-recursively-5               |      1.05 |  0.26 |
| find-directory-files-recursively (rpom 409600)   |      1.42 |  0.36 |
| find-directory-files-recursively-5 (rpom 409600) |      0.91 |  0.25 |

read_and_insert_process_output.diff (noticeable differences)

| Function                                         | gecko-dev | linux |
| find-directory-files-recursively                 |      1.30 |  0.34 |
| find-directory-files-recursively-5               |      1.03 |  0.25 |
| find-directory-files-recursively (rpom 409600)   |      1.20 |  0.35 |
| find-directory-files-recursively-5 (rpom 409600) | (!!) 0.72 |  0.21 |

So it seems like we have at least two potential ways to implement an
asynchronous file listing routine that is as fast or faster than the
synchronous one (if only thanks to starting the parsing in parallel).

Combining the last patch together with using the very large value of
read-process-output-max seems to yield the most benefit, but I'm not
sure if it's appropriate to just raise that value in our code, though.

Thoughts?

I'm not sure what exactly is here to think about.  Removing portions
of read_and_insert_process_output, or bypassing it entirely, is not
going to fly, because AFAIU it basically means we don't decode text,
which can only work with plain ASCII file names, and/or don't move the
markers in the process buffer, which also cannot be avoided.

That one was really a test to see whether the extra handling added any meaningful consing to affect GC. Removing it didn't make a difference, table number 2, so no.

If you
want to conclude that inserting the process's output into a buffer
without consing Lisp strings is faster (which I'm not sure, see below,
but it could be true),

That's what my tests seem to show, see table 3 (the last one).

then we could try extending
internal-default-process-filter (or writing a new filter function
similar to it) so that it inserts the stuff into the gap and then uses
decode_coding_gap,

Can that work at all? By the time internal-default-process-filter is called, we have already turned the string from char* into Lisp_Object text, which we then pass to it. So consing has already happened, IIUC.

which converts inserted bytes in-place -- that, at
least, will be correct and will avoid consing intermediate temporary
strings from the process output, then decoding them, then inserting
them.  Other than that, the -2 and -3 variants are very close
runners-up of -5, so maybe I'm missing something, but I see no reason
be too excited here?  I mean, 0.89 vs 0.92? really?

The important part is not 0.89 vs 0.92 (that would be meaningless indeed), but that we have an _asyncronous_ implementation of the feature that works as fast as the existing synchronous one (or faster! if we also bind read-process-output-max to a large value, the time is 0.72).

The possible applications for that range from simple (printing progress bar while the scan is happening) to more advanced (launching a concurrent process where we pipe the received file names concurrently to 'xargs grep'), including visuals (xref buffer which shows the intermediate search results right away, updating them gradually, all without blocking the UI).

About inserting into the buffer: what we do is insert into the gap,
and when the gap becomes full, we enlarge it.  Enlarging the gap
involves: (a) enlarging the chunk of memory allocated to buffer text
(which might mean we ask the OS for more memory), and (b) moving the
characters after the gap to the right to free space for inserting more
stuff.  This is pretty fast, but still, with a large pipe buffer and a
lot of output, we do this many times, so it could add up to something
pretty tangible.  It's hard to me to tell whether this is
significantly faster than consing strings and inserting them, only
measurements can tell.

See the benchmark tables and the POC patch in my previous email. Using a better filter function would be ideal, but it seems like that's not going to fit the current design. Happy to be proven wrong, though.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]