bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#64735: 29.0.92; find invocations are ~15x slower because of ignores


From: Spencer Baugh
Subject: bug#64735: 29.0.92; find invocations are ~15x slower because of ignores
Date: Sat, 22 Jul 2023 16:53:05 -0400
User-agent: Gnus/5.13 (Gnus v5.13)

Eli Zaretskii <eliz@gnu.org> writes:

>> From: sbaugh@catern.com
>> Date: Sat, 22 Jul 2023 17:18:19 +0000 (UTC)
>> Cc: sbaugh@janestreet.com, yantar92@posteo.net, rms@gnu.org, 
>> dmitry@gutov.dev,
>>      michael.albinus@gmx.de, 64735@debbugs.gnu.org
>> 
>> First my results:
>> 
>> (my-bench 100 "~/public_html" "")
>> (("built-in" . "Elapsed time: 1.140173s (0.389344s in 5 GCs)")
>>  ("with-find" . "Elapsed time: 0.643306s (0.305130s in 4 GCs)"))
>> 
>> (my-bench 10 "~/.local/src/linux" "")
>> (("built-in" . "Elapsed time: 2.402341s (0.937857s in 11 GCs)")
>>  ("with-find" . "Elapsed time: 1.544024s (0.827364s in 10 GCs)"))
>> 
>> (my-bench 100 "/ssh:catern.com:~/public_html" "")
>> (("built-in" . "Elapsed time: 36.494233s (6.450840s in 79 GCs)")
>>  ("with-find" . "Elapsed time: 4.619035s (1.133656s in 14 GCs)"))
>> 
>> 2x speedup on local files, and almost a 10x speedup for remote files.
>
> Thanks, that's impressive.  But you omitted some of the features of
> directory-files-recursively, see below.
>
>> And my implementation *isn't even using the fact that find can run in
>> parallel with Emacs*.  If I did start using that, I expect even more
>> speed gains from parallelism, which aren't achievable in Emacs itself.
>
> I'm not sure I understand what you mean by "in parallel" and why it
> would be faster.

I mean having Emacs read output from the process and turn them into
strings while find is still running and walking the directory tree.  So
the two parts are running in parallel.  This, specifically:

(defun find-directory-files-recursively (dir regexp &optional 
include-directories _predicate follow-symlinks)
  (cl-assert (null _predicate) t "find-directory-files-recursively can't accept 
arbitrary predicates")
  (cl-assert (not (file-remote-p dir)))
  (let* (buffered
         result
         (proc
          (make-process
           :name "find" :buffer nil
           :connection-type 'pipe
           :noquery t
           :sentinel (lambda (_proc _state))
           :filter (lambda (proc data)
                     (let ((start 0))
                       (when-let (end (string-search "\0" data start))
                         (push (concat buffered (substring data start end)) 
result)
                         (setq buffered "")
                         (setq start (1+ end))
                         (while-let ((end (string-search "\0" data start)))
                           (push (substring data start end) result)
                           (setq start (1+ end))))
                       (setq buffered (concat buffered (substring data 
start)))))
           :command (append
                     (list "find" (file-local-name dir))
                     (if follow-symlinks
                         '("-L")
                       '("!" "(" "-type" "l" "-xtype" "d" ")"))
                     (unless (string-empty-p regexp)
                       "-regex" (concat ".*" regexp ".*"))
                     (unless include-directories
                       '("!" "-type" "d"))
                     '("-print0")
                     ))))
    (while (accept-process-output proc))
    result))

Can you try this further change on your Windows (and GNU/Linux) box?  I
just tested on a different box and my original change gets:

(("built-in" . "Elapsed time: 4.506643s (2.276269s in 21 GCs)")
 ("with-find" . "Elapsed time: 4.114531s (2.848497s in 27 GCs)"))

while this parallel implementation gets

(("built-in" . "Elapsed time: 4.479185s (2.236561s in 21 GCs)")
 ("with-find" . "Elapsed time: 2.858452s (1.934647s in 19 GCs)"))

so it might have a favorable impact on Windows and your other GNU/Linux
box.

>> So can we add something like this (with the appropriate fallbacks to
>> directory-files-recursively), since it has such a big speedup even
>> without parallelism?
>
> We can have an alternative implementation, yes.  But it should support
> predicate, and it should sort the files in each directory like
> directory-files-recursively does, so that it's a drop-in replacement.
> Also, I believe that Find does return "." in each directory, and your
> implementation doesn't filter them, whereas
> directory-files-recursively does AFAIR.
>
> And I see no need for any fallback: that's for the application to do
> if it wants.
>
>>   (cl-assert (null _predicate) t "find-directory-files-recursively can't 
>> accept arbitrary predicates")
>
> It should.

This is where I think a fallback would be useful - it's basically
impossible to support arbitrary predicates efficiently here, since it
requires us to put Lisp in control of whether find descends into a
directory.  So I'm thinking I would just fall back to running the old
directory-files-recursively whenever there's a predicate.  Or just not
supporting this at all...

>>           (if follow-symlinks
>>               '("-L")
>>             '("!" "(" "-type" "l" "-xtype" "d" ")"))
>>           (unless (string-empty-p regexp)
>>             "-regex" (concat ".*" regexp ".*"))
>>           (unless include-directories
>>             '("!" "-type" "d"))
>>           '("-print0")
>
> Some of these switches are specific to GNU Find.  Are we going to
> support only GNU Find?

POSIX find doesn't support -regex, so I think we have to.  We could
stick to just POSIX find if we only allowed globs in
find-directory-files-recursively, instead of full regexes.

>>           ))
>>         (remote (file-remote-p dir))
>>         (proc
>>          (if remote
>>              (let ((proc (apply #'start-file-process
>>                                 "find" (current-buffer) command)))
>>                (set-process-sentinel proc (lambda (_proc _state)))
>>                (set-process-query-on-exit-flag proc nil)
>>                proc)
>>            (make-process :name "find" :buffer (current-buffer)
>>                          :connection-type 'pipe
>>                          :noquery t
>>                          :sentinel (lambda (_proc _state))
>>                          :command command))))
>>       (while (accept-process-output proc))
>
> Why do you call accept-process-output here? it could interfere with
> reading output from async subprocesses running at the same time.  To
> come think of this, why use async subprocesses here and not
> call-process?

See my new iteration which does use the async-ness.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]