[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#6283: doc/lispref/searching.texi reference to octal code `0377' corr
From: |
Eli Zaretskii |
Subject: |
bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct? |
Date: |
Tue, 01 Jun 2010 21:38:41 +0300 |
> Date: Mon, 31 May 2010 20:24:00 -0400
> From: MON KEY <monkey@sandpframing.com>
> Cc: 6283@debbugs.gnu.org
>
> If I evauate the following:
>
> (progn
> (save-excursion
> (insert-byte (multibyte-char-to-unibyte 4194221) 1)
> (insert-byte (multibyte-char-to-unibyte 4194303) 1))
> (search-forward-regexp "ÿ" nil t))
>
> I don't match.
Because ÿ is a character, whereas `(multibyte-char-to-unibyte 4194303)'
is a raw byte. Emacs can distinguish between these two because it
uses a special multibyte representation for raw bytes, which is
different from any other Unicode character. See this fragment from
the ELisp manual:
Emacs defines several special character sets. The character set
`unicode' includes all the characters whose Emacs code points are in
the range `0..#x10FFFF'. The character set `emacs' includes all ASCII
and non-ASCII characters. Finally, the `eight-bit' charset includes
the 8-bit raw bytes; Emacs uses it to represent raw bytes encountered
in text.
and also this one:
To support this multitude of characters and scripts, Emacs closely
follows the "Unicode Standard". The Unicode Standard assigns a unique
number, called a "codepoint", to each and every character. The range
of codepoints defined by Unicode, or the Unicode "codespace", is
`0..#x10FFFF' (in hexadecimal notation), inclusive. Emacs extends this
range with codepoints in the range `#x110000..#x3FFFFF', which it uses
for representing characters that are not unified with Unicode and "raw
8-bit bytes" that cannot be interpreted as characters. Thus, a
character codepoint in Emacs is a 22-bit integer number.
> Whereas if I evaluate:
>
> (progn
> (save-excursion (insert 10 #o377))
> (search-forward-regexp "ÿ" nil t))
>
> I get a match.
Because `(insert 10 #o377)' inserts LATIN SMALL LETTER Y WITH
DIAERESIS, by design.
> Likewise, if I evaluate
>
> (progn (save-excursion (insert 10 4194303))
> (search-forward-regexp "\377" nil t))
>
> I get a match.
>
> Which is to say, given the example regexp from the manual, i.e:
>
> ,----
> | You cannot always match all non-ASCII characters with the regular
> | expression `"[\200-\377]"'
> `----
>
> I am unable to locate the character: ÿ (255, #o377, #xff) e.g.
> LATIN SMALL LETTER Y WITH DIAERESIS
Sounds like a bug to me --- not in the conventions used by the
manual, but rather in regexp search in Emacs. Feel free to file a
separate bug about that.
> To be clear, my issue isn't that I am not able to match `ÿ' but rather
> that I am able to match the raw-byte character representation with a
> visual appearance which coincides with the octal value for the `ÿ'
> character code i.e. #o377 this being otherwise widely understood as
> `octal 0377'.
>
> I hope this is more clear than the previous mail. I apologize if it is not.
I hope my answers make this issue more clear. (Did I say that use of
raw bytes is complicated and full of subtleties?)
- bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?,
Eli Zaretskii <=