sed-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode


From: Eli Zaretskii
Subject: Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode
Date: Thu, 19 Jan 2017 18:26:30 +0200

> From: Assaf Gordon <address@hidden>
> Date: Thu, 19 Jan 2017 00:46:47 -0500
> 
> > 5. How do we handles MinGW and Cygwin where wchar_t is 16 bits, vs. 32
> > bits just about everywhere else?
> 
> The parsing is the same (i.e. "\uHHHH" to internal 'unsigned int' or 'ucs4_t' 
> from gnulib).
> 
> The conversion to multibyte will use gnulib's (or the system's native)
> widechar-to-multibyte functions.
> 
> In case of cygwin/mingw, an extra step of converting the 'uint32' to two 
> 'uint16' is needed,
> and then two calls for wctomb are needed.

I don't see how this could work: AFAIK the MS-Windows wctomb accepts a
single wchar_t value, so it can only support Unicode codepoints inside
the BMP.  You cannot call it with 2 wchar_t values one after the other
to get support for the full Unicode range.  (This is relevant to
MinGW; I think Cygwin doesn't have this problem.)

Really, to have a decent support for Unicode on MS-Windows, you will
need to abandon the Windows runtime support for wchar_t, and instead
use your own 32-bit data type and conversion functions.

One more quirk of MS-Windows is that no locale can use UTF-8 as its
codeset, so the assumption of "UTF-8 locale everywhere" is not useful
on Windows.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]