lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Help with unicode "REGISTERED SIGN"


From: Greg Chicares
Subject: Re: [lmi] Help with unicode "REGISTERED SIGN"
Date: Thu, 01 Oct 2015 15:54:52 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.3.0

On 2015-10-01 13:18, Vadim Zeitlin wrote:
> On Thu, 01 Oct 2015 04:38:49 +0000 Greg Chicares <address@hidden> wrote:
[...]
> GC> I really, really want a "short" product name that contains "REGISTERED 
> SIGN".

You gave me what I want--what I really, really want...

>  The short reason this doesn't work currently is that only ASCII characters
> are really supported in INPUT -- and this is not one of them.

I'm not sure whether "INPUT" here means
 - generically speaking, any data fed into a program; or
 - lmi's class Input
Just to be very clear, the string with "REGISTERED SIGN" comes from lmi's
product files. It does not come from class Input (lmi's data-entry GUI),
and we don't need GUI support right now for anything other than 7-bit
US-ASCII.

Of course, someday it would be nice to have UTF-8 everywhere. The product
files mentioned above can in principle be viewed and modified in the
product editor, but this comment in 'product_data.hpp'
    // TODO ?? Most of the following are missing from the GUI.
precedes and applies to the new field for which we want "REGISTERED SIGN".

> To make the
> PDF generation code work with arbitrary UTF-8 characters, we need to decode
> it somewhere and currently we just don't do this at all AFAICS. The
> attached patch does it just for the PDF generation code which might be good
> enough if the fields containing non-ASCII characters are only used in the
> PDF output (as I suspect is the case), but they still wouldn't appear
> correctly elsewhere in the UI.

Patched, built successfully, and now I get:
  UL SUPREME®
which is perfect, so the immediate problem is solved.

> GC> When the short product name is rendered in the group-quote PDF, here:
> GC>     wxString const image_text
> GC>         (report_data_.short_product_
> GC>          + "\nPremium & Benefit Summary"
> GC>         );
> GC> I want: UL SUPREME®
> GC> I get:  UL SUPREME®
> GC> Probably if I say exactly what I'm doing you can spot my mistake.
> 
>  It's not your mistake but, immodest as it sounds, I actually saw it
> immediately after looking at the output above: the problem is that UTF-8
> input is being interpreted as ASCII, i.e. each of the two bytes in
> REGISTERED SIGN (U+00AE) UTF-8 representation is output separately.

Here's what still puzzles me: writing (U+00AE) as two bytes should give
  0x00, 0xAE
so where does the "LATIN CAPITAL LETTER A WITH CIRCUMFLEX" come from?
Just curious.

>  The especially confusing thing is that the second byte is the same as the
> Unicode code point of this character, so it looks like you've somehow got
> an "extra" "Â". But actually you didn't.

That's what I don't understand. I can see an "extra" "Â", but are
you saying it's a mirage? Again, I'm just curious: how does that
illusion arise? Why does it look like "Â" rather than, say, "Ϋ"?

> GC> That code generates this product file:
> GC> 
> GC> grep SUPREME /opt/lmi/data/sample.policy
> GC>     <datum>UL SUPREME&#xAE;</datum>
> GC> 
> GC> which looks good at first glance, but apparently "&#xAE;" there is not
> GC> an entity reference but instead literal characters { & # x A E }:
> GC> 
> GC> /home/earl[0]$grep SUPREME /opt/lmi/data/sample.policy |od -t c
> GC> 0000000                   <   d   a   t   u   m   >   U   L       S   U
> GC> 0000020   P   R   E   M   E   &   #   x   A   E   ;   <   /   d   a   t
> GC> 
> GC> Why? The product files are not HTML but XML, which doesn't reserve
> GC> "REGISTERED SIGN".
> 
>  Sorry, I'm not sure what would you expect to happen here?

Instead of this: S   U   P   R   E   M   E   &    #   x   A   E   ;
I expected this: S   U   P   R   E   M   E 0x0 0xAE

AIUI, libxml uses UTF-8, and XML substitutes entities only for
{" & ' < >}, so I thought 0x00AE would become just 0x0 0xAE
instead of "&#reg;" or "&#xAE;".

> GC> And 0xC3 0x82 is..."LATIN CAPITAL LETTER A WITH CIRCUMFLEX". Well, yes,
> GC> so I had discerned. How do I get rid of it?
> 
>  The attached patch fixes the problem

Indeed it does. Thanks very much.

> [...] and feel free to skip the rest of this email if
> you don't have time to discuss the real problem right now (as, again, I
> suspect is the case).

I do agree that we should be using UTF-8, because characters outside
7-bit ASCII are useful and this is the twenty-first century.

AIUI, std::wstring is templated on wchar_t, which is two bytes on msw,
so it sounds like UTF-16. I'm not sure why that would be preferable
to something like Glib::ustring except that ustring is not standard.
But that (and my questions above) can wait--what's important is that
you've solved the immediate problem that we care about right now.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]