[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Surrogate pairs for addwstr?
From: |
Tim Allen |
Subject: |
Re: Surrogate pairs for addwstr? |
Date: |
Mon, 11 Oct 2021 15:05:49 +1100 |
On Sun, Oct 10, 2021 at 11:38:22AM -0400, Bill Gray wrote:
> The other way to put this would be to ask : if you're on a
> system with 32-bit wchar_ts, what should happen for this line?
>
> mvaddwstr( 0, 2, L"\xd83d\xdd1e Treble clef with a surrogate pair");
Honestly, what I'd *expect* to happen is a compile-time or run-time
error. This is what, for example, Rust does:
error: invalid unicode character escape
--> src/main.rs:2:34
|
2 | println!("Treble clef: {}", "\u{d83d}\u{dd1e}");
| ^^^^^^^^ invalid escape
|
= help: unicode escape must not be a surrogate
...and also what Python 3 does:
>>> print("\ud83d\udd1e")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position
0-1: surrogates not allowed
I guess C/C++ compilers don't report this as a problem because
technically the wide-character encoding is a property of libc, not of
the compiler, and they don't want to assume that libc is Unicode-based.
> #ifdef USING_UTF_16
Apparently the incantation is:
#if WCHAR_MAX == 65535
> At present, I've got surrogate pairs combining regardless of
> encoding in PDCursesMod; is there really a situation where I ought
> to instead be displaying glyphs of some sort for U+D800 to U+DFFF?
Printing gibberish is never particularly helpful, but encouraging people
to assume wide-string literals (or wide-strings in general) use UTF-16
encoding seems like a bad idea. Sure, you can make it work transparently
for curses, but there's other libraries (like libc) that are likely to
get tripped up, and that seems like a foot-gun waiting to happen. Even
if you provide a utf16towcs() helper, people are going to forget to call
it since the input and output types are both wchar_t*.
The absolute simplest and safest thing a portable program could do is to
restrict itself to the Basic Multilingual Plane. The second simplest and
safest thing would probably be to store strings as UTF-8 (narrow) string
literals, and provide some kind of utf8stowcs() that decodes to UTF-16
or to UTF-32 depending on the value of WCHAR_MAX.
Tim.