[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Surrogate pairs for addwstr?
From: |
Bill Gray |
Subject: |
Re: Surrogate pairs for addwstr? |
Date: |
Sun, 10 Oct 2021 11:38:22 -0400 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 |
Hi Thomas, Tim,
On 10/9/21 7:04 PM, Tim Allen wrote:
Surrogate pairs only combine to create a single character in UTF-16
encoded data, or on platforms (Windows, Java, JavaScript, macOS Cocoa)
that use UTF-16 as an internal representation. Code-points in the
surrogate pair range are not allowed to appear in un-encoded Unicode
data, so if they show up, at best they'll be ignored, but they might
show up as blanks or as U+FFFE � REPLACEMENT CHARACTER.
ncurses' wide mode might use the locale's encoding (UTF-8, almost
universally) or might just hard-code UTF-8 as the internal
representation, since it's generally the best choice for the kind of
data ncurses handles. The behaviour you describe is within the range of
behaviour I'd expect.
Thank you. I see your points; in theory, U+D83D and U+DD1E
should only happen with UTF-16 data. And in theory, theory and
practice are the same thing. In practice, they aren't.
The other way to put this would be to ask : if you're on a
system with 32-bit wchar_ts, what should happen for this line?
mvaddwstr( 0, 2, L"\xd83d\xdd1e Treble clef with a surrogate pair");
At least when I run it in xterm, the cursor advances twice for
the surrogate pairs. I agree that doing so is certainly within what
you could expect for UTF-16. But it is slightly problematic, and I
don't see a drawback to recognizing the obvious intention and merging
the pair into U+1D11E.
I will grant you that one can write, say,
#ifdef USING_UTF_16
#define TREBLE_CLEF L"\xd83d\xdd1e"
#else
#define TREBLE_CLEF L"\x1d11e"
#endif
....
mvaddwstr( 0, 2, TREBLE_CLEF " Treble clef for your platform");
and work around the problem. (With, I _think_, USING_UTF_16
basically meaning "is sizeof( wchar_t) == 2", but you can't do that
in a #define.) But for a cross-platform solution, it would
certainly be easier just to provide the surrogate pair.
At present, I've got surrogate pairs combining regardless of
encoding in PDCursesMod; is there really a situation where I ought
to instead be displaying glyphs of some sort for U+D800 to U+DFFF?
-- Bill