[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[lmi] Fwd: Re: Help with unicode "REGISTERED SIGN"
From: |
Greg Chicares |
Subject: |
[lmi] Fwd: Re: Help with unicode "REGISTERED SIGN" |
Date: |
Wed, 07 Oct 2015 13:58:55 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.3.0 |
[resending because original was missing from html archive]
-------- Forwarded Message --------
Subject: Re: [lmi] Help with unicode "REGISTERED SIGN"
Date: Thu, 01 Oct 2015 15:54:52 +0000
From: Greg Chicares <address@hidden>
Reply-To: Let me illustrate... <address@hidden>
To: Vadim Zeitlin <address@hidden>, Let me illustrate... <address@hidden>
On 2015-10-01 13:18, Vadim Zeitlin wrote:
> On Thu, 01 Oct 2015 04:38:49 +0000 Greg Chicares <address@hidden> wrote:
[...]
> GC> I really, really want a "short" product name that contains "REGISTERED
> SIGN".
You gave me what I want--what I really, really want...
> The short reason this doesn't work currently is that only ASCII characters
> are really supported in INPUT -- and this is not one of them.
I'm not sure whether "INPUT" here means
- generically speaking, any data fed into a program; or
- lmi's class Input
Just to be very clear, the string with "REGISTERED SIGN" comes from lmi's
product files. It does not come from class Input (lmi's data-entry GUI),
and we don't need GUI support right now for anything other than 7-bit
US-ASCII.
Of course, someday it would be nice to have UTF-8 everywhere. The product
files mentioned above can in principle be viewed and modified in the
product editor, but this comment in 'product_data.hpp'
// TODO ?? Most of the following are missing from the GUI.
precedes and applies to the new field for which we want "REGISTERED SIGN".
> To make the
> PDF generation code work with arbitrary UTF-8 characters, we need to decode
> it somewhere and currently we just don't do this at all AFAICS. The
> attached patch does it just for the PDF generation code which might be good
> enough if the fields containing non-ASCII characters are only used in the
> PDF output (as I suspect is the case), but they still wouldn't appear
> correctly elsewhere in the UI.
Patched, built successfully, and now I get:
UL SUPREME®
which is perfect, so the immediate problem is solved.
> GC> When the short product name is rendered in the group-quote PDF, here:
> GC> wxString const image_text
> GC> (report_data_.short_product_
> GC> + "\nPremium & Benefit Summary"
> GC> );
> GC> I want: UL SUPREME®
> GC> I get: UL SUPREME®
> GC> Probably if I say exactly what I'm doing you can spot my mistake.
>
> It's not your mistake but, immodest as it sounds, I actually saw it
> immediately after looking at the output above: the problem is that UTF-8
> input is being interpreted as ASCII, i.e. each of the two bytes in
> REGISTERED SIGN (U+00AE) UTF-8 representation is output separately.
Here's what still puzzles me: writing (U+00AE) as two bytes should give
0x00, 0xAE
so where does the "LATIN CAPITAL LETTER A WITH CIRCUMFLEX" come from?
Just curious.
> The especially confusing thing is that the second byte is the same as the
> Unicode code point of this character, so it looks like you've somehow got
> an "extra" "Â". But actually you didn't.
That's what I don't understand. I can see an "extra" "Â", but are
you saying it's a mirage? Again, I'm just curious: how does that
illusion arise? Why does it look like "Â" rather than, say, "Ϋ"?
> GC> That code generates this product file:
> GC>
> GC> grep SUPREME /opt/lmi/data/sample.policy
> GC> <datum>UL SUPREME®</datum>
> GC>
> GC> which looks good at first glance, but apparently "®" there is not
> GC> an entity reference but instead literal characters { & # x A E }:
> GC>
> GC> /home/earl[0]$grep SUPREME /opt/lmi/data/sample.policy |od -t c
> GC> 0000000 < d a t u m > U L S U
> GC> 0000020 P R E M E & # x A E ; < / d a t
> GC>
> GC> Why? The product files are not HTML but XML, which doesn't reserve
> GC> "REGISTERED SIGN".
>
> Sorry, I'm not sure what would you expect to happen here?
Instead of this: S U P R E M E & # x A E ;
I expected this: S U P R E M E 0x0 0xAE
AIUI, libxml uses UTF-8, and XML substitutes entities only for
{" & ' < >}, so I thought 0x00AE would become just 0x0 0xAE
instead of "&#reg;" or "®".
> GC> And 0xC3 0x82 is..."LATIN CAPITAL LETTER A WITH CIRCUMFLEX". Well, yes,
> GC> so I had discerned. How do I get rid of it?
>
> The attached patch fixes the problem
Indeed it does. Thanks very much.
> [...] and feel free to skip the rest of this email if
> you don't have time to discuss the real problem right now (as, again, I
> suspect is the case).
I do agree that we should be using UTF-8, because characters outside
7-bit ASCII are useful and this is the twenty-first century.
AIUI, std::wstring is templated on wchar_t, which is two bytes on msw,
so it sounds like UTF-16. I'm not sure why that would be preferable
to something like Glib::ustring except that ustring is not standard.
But that (and my questions above) can wait--what's important is that
you've solved the immediate problem that we care about right now.
_______________________________________________
lmi mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/lmi