Re: char type in Octave

From:

Michael D Godfrey

Subject:

Date:

Sun, 20 May 2018 07:43:48 +0100

User-agent:

Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0

On 05/17/2018 10:55 PM, Rik wrote:

On 05/17/2018 04:05 AM, address@hidden wrote:
Subject:
Re: Handle encoding of Octave strings

From:
mmuetzel <address@hidden>

Date:
05/17/2018 03:52 AM

To:
address@hidden

List-Post:
<mailto:address@hidden>

Content-Transfer-Encoding:
quoted-printable

Precedence:
list

MIME-Version:
1.0

References:
<address@hidden> <address@hidden> <address@hidden> <address@hidden> <address@hidden> <address@hidden>

In-Reply-To:
<address@hidden>

Message-ID:
<address@hidden>

Content-Type:
text/plain; charset=UTF-8

Message:
7
What does Matlab do?  If your choice is different, I am sure that we
will see bug reports about it.
In Matlab:
 str = 'aäbc'
str =
aäbc
str(1)
ans =
a
str(2)
ans =
ä
str(3)
ans =
b
str(4)
ans =
c
whos str
  Name      Size            Bytes  Class    Attributes
  str       1x4                 8  char               


So in Matlab one "char" has a size of 2 bytes. On the other hand, in Octave
one "char" has 1 byte.
This is a known difference. Matlab uses wide chars (wchar_t) which is 16 bits, rather than regular char (8 bits).
Do we want to change the way Octave stores its char class? Initially I was
in favor of keeping the relation of 1 byte = 1 char (hence using UTF-8). But
it would make indexing more straight forward if we changed to UTF-16 (1
"char" = 2 bytes). At least when it comes to the BMP which encompasses
characters from most current scripts.

A first step towards this could be to add "from_u8", "to_u8", ("from_u16",
"to_u16") methods to our char class. 
Than we would need to identify all places in the code where we construct
char arrays from external sources (.m files, terminal, reading from files,
...) and where we pass strings to external sources (library functions,
writing to files, ...).
When this is done we might be able to switch the internal representation
from C-"char" to "uint16_t" without breaking everything...

Do you think that this is feasible?
If we want perfect compatibility we may be driven this way, but it will be a lot of work. Part of the point of Octave is to rely on good quality code found in external libraries, so there are a lot of interfaces (regexp in PCRE, file operations in stdlib, font rendering libraries, external programs via pipes like gnuplot, etc.). Is the gain in compatibility going to be worth the pain of implementing this?

--Rik

The arguments against include:

1. A LOT of work.
2. Residual induced bugs lasting probably for years.
3. Compatibility with UTF-8 packages, etc.

Does anyone know what are the specific Matlab compatibility cases?

Michael

Re: char type in Octave, Rik, 2018/05/17

Re: char type in Octave, Michael D Godfrey <=
- Re: char type in Octave, mmuetzel, 2018/05/24

Re: char type in Octave, Rik, 2018/05/24
- Re: char type in Octave, mmuetzel, 2018/05/27