octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: char type in Octave


From: Rik
Subject: Re: char type in Octave
Date: Thu, 17 May 2018 14:55:02 -0700

On 05/17/2018 04:05 AM, address@hidden wrote:
Subject:
Re: Handle encoding of Octave strings
From:
mmuetzel <address@hidden>
Date:
05/17/2018 03:52 AM
To:
address@hidden
List-Post:
<mailto:address@hidden>
Content-Transfer-Encoding:
quoted-printable
Precedence:
list
MIME-Version:
1.0
References:
<address@hidden> <address@hidden> <address@hidden> <address@hidden> <address@hidden> <address@hidden>
In-Reply-To:
<address@hidden>
Message-ID:
<address@hidden>
Content-Type:
text/plain; charset=UTF-8
Message:
7

What does Matlab do?  If your choice is different, I am sure that we
will see bug reports about it.
In Matlab:
 str = 'aäbc'
str =
aäbc
str(1)
ans =
a
str(2)
ans =
ä
str(3)
ans =
b
str(4)
ans =
c
whos str
  Name      Size            Bytes  Class    Attributes
  str       1x4                 8  char               


So in Matlab one "char" has a size of 2 bytes. On the other hand, in Octave
one "char" has 1 byte.
This is a known difference.  Matlab uses wide chars (wchar_t) which is 16 bits, rather than regular char (8 bits).

Do we want to change the way Octave stores its char class? Initially I was
in favor of keeping the relation of 1 byte = 1 char (hence using UTF-8). But
it would make indexing more straight forward if we changed to UTF-16 (1
"char" = 2 bytes). At least when it comes to the BMP which encompasses
characters from most current scripts.

A first step towards this could be to add "from_u8", "to_u8", ("from_u16",
"to_u16") methods to our char class. 
Than we would need to identify all places in the code where we construct
char arrays from external sources (.m files, terminal, reading from files,
...) and where we pass strings to external sources (library functions,
writing to files, ...).
When this is done we might be able to switch the internal representation
from C-"char" to "uint16_t" without breaking everything...

Do you think that this is feasible?

If we want perfect compatibility we may be driven this way, but it will be a lot of work.  Part of the point of Octave is to rely on good quality code found in external libraries, so there are a lot of interfaces (regexp in PCRE, file operations in stdlib, font rendering libraries, external programs via pipes like gnuplot, etc.).  Is the gain in compatibility going to be worth the pain of implementing this?

--Rik

reply via email to

[Prev in Thread] Current Thread [Next in Thread]