lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Help with unicode "REGISTERED SIGN"


From: Vadim Zeitlin
Subject: Re: [lmi] Help with unicode "REGISTERED SIGN"
Date: Thu, 1 Oct 2015 18:44:07 +0200

On Thu, 01 Oct 2015 15:54:52 +0000 Greg Chicares <address@hidden> wrote:

GC> On 2015-10-01 13:18, Vadim Zeitlin wrote:
GC> 
GC> >  The short reason this doesn't work currently is that only ASCII 
characters
GC> > are really supported in INPUT -- and this is not one of them.
GC> 
GC> I'm not sure whether "INPUT" here means
GC>  - generically speaking, any data fed into a program; or
GC>  - lmi's class Input

 I meant the data fed into the program from external sources (such as
files), but in this particular case this data does end up in the Input
class, which is why the word "input" seemed to be appropriate to me.

GC> >  It's not your mistake but, immodest as it sounds, I actually saw it
GC> > immediately after looking at the output above: the problem is that UTF-8
GC> > input is being interpreted as ASCII, i.e. each of the two bytes in
GC> > REGISTERED SIGN (U+00AE) UTF-8 representation is output separately.
GC> 
GC> Here's what still puzzles me: writing (U+00AE) as two bytes should give
GC>   0x00, 0xAE
GC> so where does the "LATIN CAPITAL LETTER A WITH CIRCUMFLEX" come from?
GC> Just curious.

 The input file contains "c2 ae" which is the UTF-8 representation of
U+00AE. When it is converted to wxString using the default encoding which
is probably CP1252 (this is the Windows-specific equivalent of Latin-1), it
becomes "U+00C2 U+00AE" because each byte is converted individually
(CP1252, just as Latin-1, is a fixed with encoding using one byte per code
point). And this is what you see because U+00C2 is just "Â". However
originally this byte wasn't a character at all, it was a prefix of a
multibyte UTF-8 sequence. So while it is not a mirage, it's definitely a
conversion artefact.

GC> Why does it look like "Â" rather than, say, "Ϋ"?

 Because "Ÿ", i.e. 'GREEK CAPITAL LETTER UPSILON WITH DIALYTIKA' is U+03AB
and so can't appear as a prefix of UTF-8 sequence. Other characters could,
e.g. U+00C3 ("Ã") is also commonly seen when UTF-8 data is misinterpreted
as being Latin-1.

GC> > GC> which looks good at first glance, but apparently "&#xAE;" there is not
GC> > GC> an entity reference but instead literal characters { & # x A E }:
GC> > GC> 
GC> > GC> /home/earl[0]$grep SUPREME /opt/lmi/data/sample.policy |od -t c
GC> > GC> 0000000                   <   d   a   t   u   m   >   U   L       S   
U
GC> > GC> 0000020   P   R   E   M   E   &   #   x   A   E   ;   <   /   d   a   
t
GC> > GC> 
GC> > GC> Why? The product files are not HTML but XML, which doesn't reserve
GC> > GC> "REGISTERED SIGN".
GC> > 
GC> >  Sorry, I'm not sure what would you expect to happen here?
GC> 
GC> Instead of this: S   U   P   R   E   M   E   &    #   x   A   E   ;
GC> I expected this: S   U   P   R   E   M   E 0x0 0xAE
GC> 
GC> AIUI, libxml uses UTF-8, and XML substitutes entities only for
GC> {" & ' < >}, so I thought 0x00AE would become just 0x0 0xAE
GC> instead of "&#reg;" or "&#xAE;".

 But you actually wrote yourself that the entity "&#reg;" is not defined
for XML -- so this is why it's not used here, to use it we'd need to also
provide its definition which doesn't seem to be worth it. And the file is
not encoded in UTF-16-BE, which would be the only format in which it could
possible have "00 ae" bytes sequence. I could be wrong here, but I think
using "c2 ae" could work here too, but generating ASCII file with all the
other characters is definitely the safest option, so it doesn't surprise me
at all that it is used here and I don't see anything wrong with it.

GC> AIUI, std::wstring is templated on wchar_t, which is two bytes on msw,
GC> so it sounds like UTF-16. I'm not sure why that would be preferable
GC> to something like Glib::ustring except that ustring is not standard.

 Exactly because of this. One implication this has is that conversion from
std::wstring is implicit (and always correct, unlike with std::string where
we have to guess its encoding) and fast as they use the same encoding while
converting anything else to wxString would require writing a lot of
explicit conversions and would be less efficient too (and this _might_
actually matter, conversions are not that instantaneous, especially when
there are many long strings being converted).

 Also, when we do switch to C++11, we should probably use std::u32string
rather than ustring anyhow.

 Regards,
VZ

reply via email to

[Prev in Thread] Current Thread [Next in Thread]