emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: dired-do-find-regexp failure with latin-1 encoding


From: Dmitry Gutov
Subject: Re: dired-do-find-regexp failure with latin-1 encoding
Date: Sat, 28 Nov 2020 23:04:10 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0

On 28.11.2020 22:29, Eli Zaretskii wrote:

Ah, so this way the user explicitly searches for a regexp encoded as
latin-1?

More accurately, this is how to search in files encoded in Latin-1.
(The regexp also gets encoded in latin-1, but the important part is
the files' encoding.)

Right. So when there are files in different encodings, the result will be not great, as expected.

Adding -a probably cannot do any harm, but its support should be
detected, since I don't think it's portable enough (it isn't in the
latest Posix spec, at least).

Are you sure about that? Are we sure it won't make searching binary
files slower, for example?

It will be slower, but more useful: by default Grep just says "Binary
file foo matches".

Do we want to search the "binary" files at all? Right now we simply filter such matches out (see the definition of xref-matches-in-files), and I have seen no complaints.

Also, the manual has this warning:

Warning: The -a option might output  binary  garbage,  which  can  have
nasty  side effects if the output is a terminal and if the terminal
driver interprets some of it as commands.

...which might conceivably mess up our parsing of Grep output sometimes?

This is not relevant, since we read that output, there's no terminal
device driver to interpret it and get messed up.

Our interpreter is our regexp with which we parse. But I suppose as long as Grep doesn't insert unexpected newlines, the parser will be fine.

I actually don't think I understand why we need -a in this case, since
Grep looks for null bytes to decide this is a binary file, and encoded
non-ASCII characters don't have null bytes 9except if they are in
UTF-16).

Good question.

P.S. Or we can forgo all that and ask the users who want to search for
non-ASCII strings to install ripgrep.

We should support Grep regardless, since not everyone will have
ripgrep.  And in any case, "C-x RET c" will be needed with it as well,
no?

I'd have to test it explicitly to say for sure, but:

  ripgrep supports searching files in text encodings other than UTF-8,
  such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some
  support for automatically detecting UTF-16 is provided. Other text
  encodings must be specifically specified with the -E/--encoding flag.)

https://blog.burntsushi.net/ripgrep/#pitch

So if the file encoding is UTF-8, UTF-16, or latin-1 (AND the current system locale matches that encoding), the search should work fine across such files in different encodings, and without 'C-x RET c'.

Which doesn't cover all situations, of course, but it's about as much as can be expected. And more than Grep can.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]