[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#33205: 26.1; unibyte/multibyte missing in rx.el
From: |
Mattias Engdegård |
Subject: |
bug#33205: 26.1; unibyte/multibyte missing in rx.el |
Date: |
Wed, 7 Nov 2018 19:08:43 +0100 |
5 nov. 2018 kl. 17.49 skrev Eli Zaretskii <eliz@gnu.org>:
> After looking into this, my conclusion is that what I wrote above was
> not too wrong. Indeed, currently [:ascii:]/[:nonascii:] cannot be
> distinguished from [:unibyte:]/[:multibyte:]. In a nutshell, it turns
> out [:unibyte:] is not what one might think it is, you can see that in
> re_wctype_to_bit, for example.
Thank you very much for taking your time to look at this, and for the detailed
answer.
My apologies for severely complicating what I initially thought was quite a
trifle!
> That ^[:ascii:] is not the same as [:nonascii:], and the same with
> [:unibyte:] vs ^[:multibyte:], is arguably a bug. The reason for that
> becomes clear if you look at how we generate the fastmap in each of
> these cases and how we set the bits in the work-area of the range
> table, but I don't know enough to say how easy would it be to fix
> that.
>
> An alternative is to use an explicit character class, as in \000-\377,
> that works as you'd expect.
I'm not sure what I expected [\000-\377] to mean in a multibyte string; one
endpoint is ASCII and the other is a raw byte. It does work, as you noted,
because two ranges are generated, as if written [\000-\177\200-\377].
In old Emacs versions (I tried 22.1.1), [:unibyte:] appears to include raw
bytes in multibyte strings/buffers, and everything in unibyte strings/buffers
(aka [\000-\377] in both cases), and [:multibyte:] the complement of that.
Thus, at some point the behaviour changed, but I cannot find any NEWS reference
to it. It could have been an accident.
Perhaps those char classes didn't see much use.
The old behaviour seems a little more intuitive, but it must be rare to need
regex matching of rubbish bytes in multibyte strings. If you could argue that
the status quo is fine then I wouldn't necessarily object, but would suggest
that at least the code be made explicit about it (and the documentation, as
well).
> Well, what do you think now? Is it worth adding those to rx.el? I'm
> not sure. How important is it to find unibyte characters in a string,
> anyway?
Unless we manage to make [:unibyte:]/[:multibyte:] more useful in their own
right, it's fine to leave rx.el as is, as far as I'm concerned. There is no
loss of expressivity.