[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: BUG? RFE? printf lacking unicode support in multiple areas
From: |
Eric Blake |
Subject: |
Re: BUG? RFE? printf lacking unicode support in multiple areas |
Date: |
Fri, 20 May 2011 14:49:22 -0600 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110428 Fedora/3.1.10-1.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.10 |
On 05/20/2011 02:30 PM, Linda Walsh wrote:
> i.e. it's showing me a 16-bit value: 0x203c, which I thought would be the
> wide-char value for the double-exclamation. Going from the wchar
> definition
> on NT, it is a 16-bit value. Perhaps it is different under POSIX? but
> 0x203c taken as 32 bits with 2 high bytes of zeros would seem to specify
> the same codepoint for the Dbl-EXcl.
POSIX allows wchar_t to be either 2-byte or 4-byte, although only a
4-byte wchar_t can properly represent all of Unicode (with 2-byte
wchar_t as on windows or Cygwin, you are inherently restricted from
using any Unicode character larger than 0xffff if you want to maintain
POSIX compliance).
>
>> Since there is no way to produce a word containing a NUL character it is
>> impossible to support %lc in any useful way.
> ----
> That's annoying. How can one print out unicode characters
> that are supposed to be 1 char long?
I think you are misunderstanding the difference between wide characters
(exactly one wchar_t per character) and multi-byte characters (1 or more
char [byte] per character).
Unicode can be represented in two different ways. One way is with wide
characters (every character represents exactly one Unicode codepoint,
and code points < 0x100 have embedded NUL bytes if you view the memory
containing those wchar_t as an array of bytes). The other way is with
multi-byte encodings, such as UTF-8 (every character occupies a variable
number of bytes, and the only character that can contain an embedded NUL
byte is the NUL character at codepoint 0).
Bash _only_ uses multi-byte characters for input and output. %lc only
uses wchar_t. Since wchar_t output is not useful for a shell that does
not do input in wchar_t, that explains why bash printf need not support
%lc. POSIX doesn't require it, at any rate, but it also doesn't forbid
it as an extension.
> This isn't just a bash problem given how well most of the unix "character"
> utils work with unicode -- that's something that really needs to be solved
> if those character utils are going to continue to be _as useful_ in the
> future.
> Sure they will have their current functionality which is of use in many
> ways, but
> for anyone not processing ASCII text it becomes a problem, but this
> isn't really
> a bash is.
Most utilities that work with Unicode work with UTF-8 (that is, with
multi-byte-characters using variable number of bytes), and NOT with wide
characters (that is, with all characters occupying a fixed width). But
you can switch between encodings using the iconv(1) utility, so it
shouldn't really be a problem in practice in converting from one
encoding type to another.
> That said, it was my impression that a wchar was 16-bits (at least it
> is on MS. Is it different under POSIX?
POSIX allows 16-bit wchar_t, but if you have a 16-bit wchar_t, you
cannot support all of Unicode.
--
Eric Blake eblake@redhat.com +1-801-349-2682
Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature