gm2
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Inquiring about the status of the proposed UNICODE library


From: Benjamin Kowarsch
Subject: Re: Inquiring about the status of the proposed UNICODE library
Date: Thu, 7 Mar 2024 23:05:36 +0900



On Thu, 7 Mar 2024 at 18:02, Alice Osako wrote:

Even if this were not the case, supporting only a single UNICODE
encoding is potentially problematic, even if it is pretty common.

A standard should pick the absolute necessary, it should not be an egg laying wool milk sow.

In our revision we defined type UNICODE as a set of 32-bit values in ISO 10646 UCS-4 representation.

With this representation every possible Unicode code point can be represented and there is no variable length codes which would make scanning and accessing characters within strings by index extremely cumbersome.

On modern hardware there is absolutely no need to save a few bytes when processing text in memory. The need for compactness only arises when transmitting text over communications channels and when storing text on persisten storage medium. For this, the IO system can be designed to do a conversion to and from UTF-8 on the fly.

Consequently there is also no need to perform string operations on UTF-8 text in memory.

And if another format is required for compatibility with legacy systems, this should then be provided in form of a library. It should not be part of the language. Especially not when the language is based on a philosophy of simplicity.


Indeed, the fact that JSON is defined with UTF-8 as the sole supported
encoding is where I got to this point in the first place.

<snip>

Just contemplating how I would implement UTF-8's variable-size encoding
is intimidating enough, never mind grasping the intricacies of UNICODE
code points across multiple encodings.

I would recommend the same approach we chose in our revision:

* read and write UTF-8 from and to disk
* convert on the fly between UTF-8 and in-memory format
* use ISO 10646 UCS-4 as in-memory format when processing text

If any other encodings are needed, convert between in-memory UCS-4 and those formats while doing IO.

For sorting on strings (both CHAR and UNICHAR), we chose to define an overloadable collation table of 92 ordinal values. A new built-in function COLLATION() takes a CHAR or UNICHAR value and returns its index in the collation order defined by the overloadable collation table. Tabulator and linefeed is treated as whitespace, all other control codes are ignored and for any accented letter the collation index of the non-accented letter is returned. All other characters return their code point value.

https://github.com/m2sf/m2bsk/wiki/Language-Specification-(9)-:-Predefined-Identifiers#collation

However, this does not permit lexicographical sorting for non-Latin character sets. We will need to add some extension for those in the future, likely based on the same principle (overloadable collation tables).

In any event, lookup tables are the most straightforward approach to determining the sort value of a character, no matter what encoding system and no matter what character set is used.

Good luck.

regards
benjamin

reply via email to

[Prev in Thread] Current Thread [Next in Thread]