octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode support in io Forge package


From: Markus Mützel
Subject: Re: Unicode support in io Forge package
Date: Sat, 19 Oct 2019 22:26:50 +0200

Am 19. Oktober 2019 um 20:35 Uhr schrieb "Andrew Janke":
> The io code uses native2unicode as an alternative if it's available,
> using a feature test. Here's an example from xls2oct.m:
>
>
>    ## Convert from UTF-8 and strip characters that are not supported by
> Octave
>    ## (any chars < 32 or > 255).
>    if (! strcmp (xls.xtype, "COM") && (spsh_opts.convert_utf))
>      if (exist ("native2unicode", "file"))
>        conv_fcn = @(str) unicode2native (native2unicode (str, "UTF-8"));
>      else
>        conv_fcn = @utf82unicode;
>      endif
>      rawarr = tidyxml (rawarr, conv_fcn);
>    endif
>
> This is leaving me even more confused: I'm not sure what the round trip
> through both native2unicode and unicode2native accomplishes, especially
> since native2unicode converts from the specified code page to UTF-8, so
> doing native2unicode(str, "UTF-8") should basically be a no-op.
>
> Putting aside the first native2unicode call, I _think_ the use of
> unicode2native here is incorrect, because even on Windows, Octave's
> internal strings are now UTF-8 and not the system default code page. I'm
> going to do some more research and set up some test spreadsheets, but I
> suspect all the encoding conversion logic here should just be removed.
>

Please, ignore my previous messages.
I think you are right! I also believe it should be removed completely. The XML 
in the .xlsx files is encoded in UTF-8 (always?) and that is Octave's internal 
encoding. No transcoding should be done at all.
The code was originally introduced for bug #49222:
https://savannah.gnu.org/bugs/?49222
It's embarrassing to re-read how I initially completely mis-understood the 
issue and came up with a fix that seemed to work (on a western Windows) back 
then.
If I correctly understand the last few comments, the problem was (or is?) that 
UTF-8 encoded strings weren't displayed correctly on legacy Windows. But I 
don't think that the io package should interfere with the encoding of the 
strings it reads to work around this.
If this is still an issue (it isn't on Windows 10), it should be resolved 
differently.

Markus





reply via email to

[Prev in Thread] Current Thread [Next in Thread]