lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lmi] Fwd: Re: Help with unicode "REGISTERED SIGN"


From: Greg Chicares
Subject: [lmi] Fwd: Re: Help with unicode "REGISTERED SIGN"
Date: Wed, 07 Oct 2015 13:56:52 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.3.0

[resending because original was missing from html archive]


-------- Forwarded Message --------
Subject: Re: [lmi] Help with unicode "REGISTERED SIGN"
Date: Thu, 1 Oct 2015 15:18:58 +0200
From: Vadim Zeitlin <address@hidden>
To: Let me illustrate... <address@hidden>
CC: Greg Chicares <address@hidden>

On Thu, 01 Oct 2015 04:38:49 +0000 Greg Chicares <address@hidden> wrote:

GC> [Copying Vadim again--sorry, mailing list remains sluggish.]
[It seems to work again now but I'm still cc'ing this to you just in case]

GC> I really, really want a "short" product name that contains "REGISTERED 
SIGN".

 The short reason this doesn't work currently is that only ASCII characters
are really supported in INPUT -- and this is not one of them. To make the
PDF generation code work with arbitrary UTF-8 characters, we need to decode
it somewhere and currently we just don't do this at all AFAICS. The
attached patch does it just for the PDF generation code which might be good
enough if the fields containing non-ASCII characters are only used in the
PDF output (as I suspect is the case), but they still wouldn't appear
correctly elsewhere in the UI.


GC> When the short product name is rendered in the group-quote PDF, here:
GC>     wxString const image_text
GC>         (report_data_.short_product_
GC>          + "\nPremium & Benefit Summary"
GC>         );
GC> I want: UL SUPREME®
GC> I get:  UL SUPREME®
GC> Probably if I say exactly what I'm doing you can spot my mistake.

 It's not your mistake but, immodest as it sounds, I actually saw it
immediately after looking at the output above: the problem is that UTF-8
input is being interpreted as ASCII, i.e. each of the two bytes in
REGISTERED SIGN (U+00AE) UTF-8 representation is output separately.

 The especially confusing thing is that the second byte is the same as the
Unicode code point of this character, so it looks like you've somehow got
an "extra" "Â". But actually you didn't.

GC> That code generates this product file:
GC>
GC> grep SUPREME /opt/lmi/data/sample.policy
GC>     <datum>UL SUPREME&#xAE;</datum>
GC>
GC> which looks good at first glance, but apparently "&#xAE;" there is not
GC> an entity reference but instead literal characters { & # x A E }:
GC>
GC> /home/earl[0]$grep SUPREME /opt/lmi/data/sample.policy |od -t c
GC> 0000000                   <   d   a   t   u   m   >   U   L       S   U
GC> 0000020   P   R   E   M   E   &   #   x   A   E   ;   <   /   d   a   t
GC>
GC> Why? The product files are not HTML but XML, which doesn't reserve
GC> "REGISTERED SIGN".

 Sorry, I'm not sure what would you expect to happen here?

GC> And 0xC3 0x82 is..."LATIN CAPITAL LETTER A WITH CIRCUMFLEX". Well, yes,
GC> so I had discerned. How do I get rid of it?

 The attached patch fixes the problem (tested with a different field, but
this really shouldn't matter). It's simple and hopefully complete (but I
could have missed something, this is another place where it would have been
great to have the compiler detect my errors for me, but currently it
can't). Please let me know if you have any problems with it, other than it
being a horrible hack and feel free to skip the rest of this email if
you don't have time to discuss the real problem right now (as, again, I
suspect is the case).


 But, just for the reference, this patch is clearly a bad solution. The
real problem is that we work with std::strings containing UTF-8 data and we
must be careful to not forget to decode them before doing anything with
them.

 First idea might be to avoid using UTF-8 and use some legacy encoding like
Latin-1 (a.k.a. ISO-8859-1) for the product files instead as it does
contain U+00AE. I'm happy to say that, in addition to be totally misguided
(as you can see, Unicode is important in even for the US-only software, and
using anything other than UTF-8 for the external textual data is just a
horrible idea nowadays and for the last 10+ years), this just plain doesn't
work: whatever is the encoding of the input file, libxml2 always returns
its contents encoded in UTF-8. So we just have to admit that we've got to
deal with UTF-8 (which is, again, a good thing, as we're not even tempted
by any lesser hacks as they wouldn't work anyhow).

 So the question is where should it be decoded. For me, ideal would be to
do it directly when loading the data. However this would require 2 big
changes:

1. We would need to use std::wstring instead of std::string for the field
   values.
2. We would need to have a working way to convert from UTF-8 to wchar_t
   (which, helpfully, can be either UTF-16 or UTF-32 in C++, of course)
   not involving wxWidgets as this needs to be done in non-GUI code.

Realistically, I think the only way for this to happen is to switch to
C++11 as the currently used MinGW 3.4.5 compiler doesn't have any wchar_t
support at all, so it would be much better to upgrade it before doing
anything Unicode-related. And if we do, and go directly to C++11, then we'd
get std::codecvt_utf8 "for free" and could just use it.

 Needless to say, this is not going to happen in the immediate future
(although the reasons to do it sooner rather than later do keep piling up).
And until then we have to live with std::string containing UTF-8 and ensure
that it is decoded correctly before being used. Unfortunately I don't see
any good way to do it everywhere: conversion from std::string to wxString
is implicit and while it's very convenient (and is not only convenient but
also always the right thing to do when using std::wstring and will be very
useful to us if/when we switch from std::string to wstring), it does have
the big disadvantage of being invisible in the code and so we can't easily
find it. So the only way I do see is to make the implicit conversion do the
right thing for us, i.e. always interpret std::string as being in UTF-8
encoding by setting the global wxConvLibc "variable" to wxConvUTF8. But, as
mentioned above, this is not a _good_ way, as wxConvLibc, as indicated by
its name, is used for everything related to the standard library, i.e. all
strings passed to and returned by the standard library functions. And as
they never use UTF-8 locale under Windows, this means that they won't work
correctly for any strings containing non-ASCII characters. And changing the
global wxConvLibc before/after calling any standard function just doesn't
seem very appealing.

 Final solution is even more global: build wxWidgets with --enable-utf8
option so that all strings are UTF-8 internally. This should make
everything work correctly in theory, but in practice UTF-8 build is not
well-tested, especially under Windows (where it makes less sense as the
native encoding is UTF-16, unlike with e.g. GTK+ which uses UTF-8), so I
wouldn't be surprised if we ran into some problems.

 I think that switching to std::wstring is clearly the best solution, but
if we need something sooner than we can implement it (which, again, depends
on the compiler upgrade), then using UTF-8 build of wxWidgets is probably
the best we can do as littering all the code with FromUTF8() calls, as the
attached patch does, is not only ugly but also very error-prone as we're
just bound to miss some.

 Anyhow, please let me know if this patch helps at least for now,
VZ



Attachment: 0001-Interpret-all-input-strings-as-being-in-UTF-8-in-gro.patch
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]