lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Help with unicode "REGISTERED SIGN"


From: Greg Chicares
Subject: Re: [lmi] Help with unicode "REGISTERED SIGN"
Date: Thu, 01 Oct 2015 18:42:23 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.3.0

On 2015-10-01 16:44, Vadim Zeitlin wrote:
> On Thu, 01 Oct 2015 15:54:52 +0000 Greg Chicares <address@hidden> wrote:
> 
> GC> On 2015-10-01 13:18, Vadim Zeitlin wrote:
> GC> 
> GC> >  The short reason this doesn't work currently is that only ASCII 
> characters
> GC> > are really supported in INPUT -- and this is not one of them.
> GC> 
> GC> I'm not sure whether "INPUT" here means
> GC>  - generically speaking, any data fed into a program; or
> GC>  - lmi's class Input
> 
>  I meant the data fed into the program from external sources (such as
> files), but in this particular case this data does end up in the Input
> class, which is why the word "input" seemed to be appropriate to me.

Just to be sure there's no confusion...

The string that contains "REGISTERED SIGN" actually resides in class
product_data, and is copied thence into class LedgerInvariant, which
class group_quote_pdf_generator_wx uses to create a PDF file. That's
the only usage I was trying to support.

I'm not trying to allow end users to enter anything but ASCII in
Input::Comments, e.g., to be echoed in that group-quote "Summary" box.
(It's okay if that becomes possible as a result of this change:
I don't want to forbid it, but it's not my goal to allow it either.)
Specifically, I'm not concerned with the way non-ASCII characters
flow through the lmi MVC classes, class Input, and wx.

> GC> Here's what still puzzles me: writing (U+00AE) as two bytes should give
> GC>   0x00, 0xAE
> GC> so where does the "LATIN CAPITAL LETTER A WITH CIRCUMFLEX" come from?
> GC> Just curious.

U+00AE   "REGISTERED SIGN"
  C2AE   UTF-8 encoding: "C2" is just UTF-8 "leading byte" 11000010
 C2 AE <-> ® works like an erroneous reinterpret_cast

> GC> Why does it look like "Â" rather than, say, "Ϋ"?
> 
>  Because "Ÿ", i.e. 'GREEK CAPITAL LETTER UPSILON WITH DIALYTIKA' is U+03AB
> and so can't appear as a prefix of UTF-8 sequence.

Now I understand. That's why Cyrillic used to look like a random
assortment of latin capital vowels with diacritics with a latin
codepage (and conversely Russians saw кракозябры). I can't tell
you how long I've wondered why those particular letters appeared.
Now I see that Â, Ã, etc. just happen to be placed in a latin
codepage in locations corresponding to UTF-8 leading-byte bit
patterns. I really appreciate your taking the time to answer my
бНОПНЯ.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]