Re: [lmi] Help with unicode "REGISTERED SIGN"

lmi

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Help with unicode "REGISTERED SIGN"

From:	Greg Chicares
Subject:	Re: [lmi] Help with unicode "REGISTERED SIGN"
Date:	Tue, 06 Oct 2015 16:51:01 +0000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.3.0

On 2015-10-01 13:18, Vadim Zeitlin wrote:
> On Thu, 01 Oct 2015 04:38:49 +0000 Greg Chicares <address@hidden> wrote:
> 
> GC> [Copying Vadim again--sorry, mailing list remains sluggish.]
> [It seems to work again now but I'm still cc'ing this to you just in case]

And I don't see this message in either the September or the October
archives, where a search for "REGISTERED" finds only this:
  http://lists.nongnu.org/archive/html/lmi/2015-10/msg00000.html
so I won't snip any of this valuable discussion. The only new things
I have to say for now are that the attached patch has been committed
20151006T1648Z (revision 6324), and that I'm not sure whether we'll
also want the following one-line change that I had in my local tree:

     wxString const image_text
-        (report_data_.company_
+        (wxString::FromUTF8(report_data_.short_product_.c_str())
          + "\nPremium & Benefit Summary"

Nothing new below, just quotation...

> GC> I really, really want a "short" product name that contains "REGISTERED 
> SIGN".
> 
>  The short reason this doesn't work currently is that only ASCII characters
> are really supported in INPUT -- and this is not one of them. To make the
> PDF generation code work with arbitrary UTF-8 characters, we need to decode
> it somewhere and currently we just don't do this at all AFAICS. The
> attached patch does it just for the PDF generation code which might be good
> enough if the fields containing non-ASCII characters are only used in the
> PDF output (as I suspect is the case), but they still wouldn't appear
> correctly elsewhere in the UI.
> 
> 
> GC> When the short product name is rendered in the group-quote PDF, here:
> GC>     wxString const image_text
> GC>         (report_data_.short_product_
> GC>          + "\nPremium & Benefit Summary"
> GC>         );
> GC> I want: UL SUPREME®
> GC> I get:  UL SUPREMEÂ®
> GC> Probably if I say exactly what I'm doing you can spot my mistake.
> 
>  It's not your mistake but, immodest as it sounds, I actually saw it
> immediately after looking at the output above: the problem is that UTF-8
> input is being interpreted as ASCII, i.e. each of the two bytes in
> REGISTERED SIGN (U+00AE) UTF-8 representation is output separately.
> 
>  The especially confusing thing is that the second byte is the same as the
> Unicode code point of this character, so it looks like you've somehow got
> an "extra" "Â". But actually you didn't.
> 
> GC> That code generates this product file:
> GC> 
> GC> grep SUPREME /opt/lmi/data/sample.policy
> GC>     <datum>UL SUPREME&#xAE;</datum>
> GC> 
> GC> which looks good at first glance, but apparently "&#xAE;" there is not
> GC> an entity reference but instead literal characters { & # x A E }:
> GC> 
> GC> /home/earl[0]$grep SUPREME /opt/lmi/data/sample.policy |od -t c
> GC> 0000000                   <   d   a   t   u   m   >   U   L       S   U
> GC> 0000020   P   R   E   M   E   &   #   x   A   E   ;   <   /   d   a   t
> GC> 
> GC> Why? The product files are not HTML but XML, which doesn't reserve
> GC> "REGISTERED SIGN".
> 
>  Sorry, I'm not sure what would you expect to happen here?
> 
> GC> And 0xC3 0x82 is..."LATIN CAPITAL LETTER A WITH CIRCUMFLEX". Well, yes,
> GC> so I had discerned. How do I get rid of it?
> 
>  The attached patch fixes the problem (tested with a different field, but
> this really shouldn't matter). It's simple and hopefully complete (but I
> could have missed something, this is another place where it would have been
> great to have the compiler detect my errors for me, but currently it
> can't). Please let me know if you have any problems with it, other than it
> being a horrible hack and feel free to skip the rest of this email if
> you don't have time to discuss the real problem right now (as, again, I
> suspect is the case).
> 
> 
>  But, just for the reference, this patch is clearly a bad solution. The
> real problem is that we work with std::strings containing UTF-8 data and we
> must be careful to not forget to decode them before doing anything with
> them.
> 
>  First idea might be to avoid using UTF-8 and use some legacy encoding like
> Latin-1 (a.k.a. ISO-8859-1) for the product files instead as it does
> contain U+00AE. I'm happy to say that, in addition to be totally misguided
> (as you can see, Unicode is important in even for the US-only software, and
> using anything other than UTF-8 for the external textual data is just a
> horrible idea nowadays and for the last 10+ years), this just plain doesn't
> work: whatever is the encoding of the input file, libxml2 always returns
> its contents encoded in UTF-8. So we just have to admit that we've got to
> deal with UTF-8 (which is, again, a good thing, as we're not even tempted
> by any lesser hacks as they wouldn't work anyhow).
> 
>  So the question is where should it be decoded. For me, ideal would be to
> do it directly when loading the data. However this would require 2 big
> changes:
> 
> 1. We would need to use std::wstring instead of std::string for the field
>    values.
> 2. We would need to have a working way to convert from UTF-8 to wchar_t
>    (which, helpfully, can be either UTF-16 or UTF-32 in C++, of course)
>    not involving wxWidgets as this needs to be done in non-GUI code.
> 
> Realistically, I think the only way for this to happen is to switch to
> C++11 as the currently used MinGW 3.4.5 compiler doesn't have any wchar_t
> support at all, so it would be much better to upgrade it before doing
> anything Unicode-related. And if we do, and go directly to C++11, then we'd
> get std::codecvt_utf8 "for free" and could just use it.
> 
>  Needless to say, this is not going to happen in the immediate future
> (although the reasons to do it sooner rather than later do keep piling up).
> And until then we have to live with std::string containing UTF-8 and ensure
> that it is decoded correctly before being used. Unfortunately I don't see
> any good way to do it everywhere: conversion from std::string to wxString
> is implicit and while it's very convenient (and is not only convenient but
> also always the right thing to do when using std::wstring and will be very
> useful to us if/when we switch from std::string to wstring), it does have
> the big disadvantage of being invisible in the code and so we can't easily
> find it. So the only way I do see is to make the implicit conversion do the
> right thing for us, i.e. always interpret std::string as being in UTF-8
> encoding by setting the global wxConvLibc "variable" to wxConvUTF8. But, as
> mentioned above, this is not a _good_ way, as wxConvLibc, as indicated by
> its name, is used for everything related to the standard library, i.e. all
> strings passed to and returned by the standard library functions. And as
> they never use UTF-8 locale under Windows, this means that they won't work
> correctly for any strings containing non-ASCII characters. And changing the
> global wxConvLibc before/after calling any standard function just doesn't
> seem very appealing.
> 
>  Final solution is even more global: build wxWidgets with --enable-utf8
> option so that all strings are UTF-8 internally. This should make
> everything work correctly in theory, but in practice UTF-8 build is not
> well-tested, especially under Windows (where it makes less sense as the
> native encoding is UTF-16, unlike with e.g. GTK+ which uses UTF-8), so I
> wouldn't be surprised if we ran into some problems.
> 
>  I think that switching to std::wstring is clearly the best solution, but
> if we need something sooner than we can implement it (which, again, depends
> on the compiler upgrade), then using UTF-8 build of wxWidgets is probably
> the best we can do as littering all the code with FromUTF8() calls, as the
> attached patch does, is not only ugly but also very error-prone as we're
> just bound to miss some.
> 
>  Anyhow, please let me know if this patch helps at least for now,
> VZ
>

[Prev in Thread]

Current Thread

[Next in Thread]

[lmi] Help with unicode "REGISTERED SIGN", Greg Chicares, 2015/10/08
- [lmi] Fwd: Help with unicode "REGISTERED SIGN", Greg Chicares, 2015/10/08
- Re: [lmi] Help with unicode "REGISTERED SIGN", Vadim Zeitlin, 2015/10/08
  - Re: [lmi] Help with unicode "REGISTERED SIGN", Greg Chicares <=
  - Re: [lmi] Help with unicode "REGISTERED SIGN", Greg Chicares, 2015/10/08
    - Re: [lmi] Help with unicode "REGISTERED SIGN", Vadim Zeitlin, 2015/10/08
    - [lmi] Fwd: Re[2]: Help with unicode "REGISTERED SIGN", Greg Chicares, 2015/10/08
    - Re: [lmi] Help with unicode "REGISTERED SIGN", Greg Chicares, 2015/10/08
    - [lmi] Fwd: Re: Help with unicode "REGISTERED SIGN", Greg Chicares, 2015/10/08
    - [lmi] Fwd: Re: Help with unicode "REGISTERED SIGN", Greg Chicares, 2015/10/08
  - [lmi] Fwd: Re: Help with unicode "REGISTERED SIGN", Greg Chicares, 2015/10/08

Prev by Date: [lmi] 'Οχι day is almost upon us [Was: Call for suggestions: new input fields]
Next by Date: Re: [lmi] [lmi-commits] [6323] Use unique_filepath() for all PDF and spreadsheet output
Previous by thread: Re: [lmi] Help with unicode "REGISTERED SIGN"
Next by thread: Re: [lmi] Help with unicode "REGISTERED SIGN"
Index(es):
- Date
- Thread