[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] Help with unicode "REGISTERED SIGN"
From: |
Greg Chicares |
Subject: |
Re: [lmi] Help with unicode "REGISTERED SIGN" |
Date: |
Thu, 01 Oct 2015 18:42:23 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.3.0 |
On 2015-10-01 16:44, Vadim Zeitlin wrote:
> On Thu, 01 Oct 2015 15:54:52 +0000 Greg Chicares <address@hidden> wrote:
>
> GC> On 2015-10-01 13:18, Vadim Zeitlin wrote:
> GC>
> GC> > The short reason this doesn't work currently is that only ASCII
> characters
> GC> > are really supported in INPUT -- and this is not one of them.
> GC>
> GC> I'm not sure whether "INPUT" here means
> GC> - generically speaking, any data fed into a program; or
> GC> - lmi's class Input
>
> I meant the data fed into the program from external sources (such as
> files), but in this particular case this data does end up in the Input
> class, which is why the word "input" seemed to be appropriate to me.
Just to be sure there's no confusion...
The string that contains "REGISTERED SIGN" actually resides in class
product_data, and is copied thence into class LedgerInvariant, which
class group_quote_pdf_generator_wx uses to create a PDF file. That's
the only usage I was trying to support.
I'm not trying to allow end users to enter anything but ASCII in
Input::Comments, e.g., to be echoed in that group-quote "Summary" box.
(It's okay if that becomes possible as a result of this change:
I don't want to forbid it, but it's not my goal to allow it either.)
Specifically, I'm not concerned with the way non-ASCII characters
flow through the lmi MVC classes, class Input, and wx.
> GC> Here's what still puzzles me: writing (U+00AE) as two bytes should give
> GC> 0x00, 0xAE
> GC> so where does the "LATIN CAPITAL LETTER A WITH CIRCUMFLEX" come from?
> GC> Just curious.
U+00AE "REGISTERED SIGN"
C2AE UTF-8 encoding: "C2" is just UTF-8 "leading byte" 11000010
C2 AE <-> ® works like an erroneous reinterpret_cast
> GC> Why does it look like "Â" rather than, say, "Ϋ"?
>
> Because "Ÿ", i.e. 'GREEK CAPITAL LETTER UPSILON WITH DIALYTIKA' is U+03AB
> and so can't appear as a prefix of UTF-8 sequence.
Now I understand. That's why Cyrillic used to look like a random
assortment of latin capital vowels with diacritics with a latin
codepage (and conversely Russians saw кракозябры). I can't tell
you how long I've wondered why those particular letters appeared.
Now I see that Â, Ã, etc. just happen to be placed in a latin
codepage in locations corresponding to UTF-8 leading-byte bit
patterns. I really appreciate your taking the time to answer my
бНОПНЯ.