gm2
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Inquiring about the status of the proposed UNICODE library


From: Benjamin Kowarsch
Subject: Re: Inquiring about the status of the proposed UNICODE library
Date: Fri, 8 Mar 2024 17:45:51 +0900


On Fri, 8 Mar 2024 at 06:36, Alice Osako wrote:
 
I think I will fork the Lilley code as a starting point for a library implementation, modifying it to use UTF-32 rather UTF-16. This will mean using the GPL v.3 as its license, of course, but I have no objection to that. I while it is certainly not necessary so long as I follow the GPL requirements, I would feel much better about this if I were to hear from Chris Lilley about this.

I think you are overthinking the task at hand.

Write down your requirements first, like so:

(1) read UTF-8 from a file, convert every UTF-8 value read to an equivalent UCS-4 value
(2) write UTF-8 to a file, convert every UCS-4 value to be written to an equivalent UTF-8 value.

Then ask yourself questions like

Do I need uppercase/lowercase conversion on the UCS-4 text?
Do I need to match accented to non-accented UCS-4 characters and vice versa?
Do I need lexicographical sorting of UCS-4 text? If so, in which realm (Latin, Greek, Cyrillic...)?
etc etc

You will find that converting between UTF-8 and UCS-4 is rather straightforward.

The complexity that you seem to find intimidating lies in the text processing bits.

But how much of the latter do you really need for a JSON parser?

The way I see it, your task is to write a UTF-8 to UCS-4 decoder, and a UCS-4 to UTF-8 encoder.

And that task will be easier to do from scratch than to modify an existing UTF-8/UCS-2 decoder/encoder.

I would recommend the same approach we chose in our revision:

* read and write UTF-8 from and to disk
* convert on the fly between UTF-8 and in-memory format
* use ISO 10646 UCS-4 as in-memory format when processing text

If any other encodings are needed, convert between in-memory UCS-4 and those formats while doing IO.
This does sound like the best approach, though I am honestly trepidatious about the whole project. I have only a basic grasp of UNICODE in general, and am hesitant to dive into the standard(s) to the level needed for this project.

Like I said, you are overthinking this.

What made Unicode so nightmarish to work with were the 16-bit formats because they didn't cover the entire Unicode range and weren't linear. Also, they brought endianness issues with them.

But UTF-8 is a stream of single octets, and UCS-4 covers the entire Unicode range and is linear. All the complexities related to encoding and decoding that exist with 16-bit formats are avoided altogether.

Your UTF8 I/O char buffer will be an ARRAY [0 .. 5] OF [0 .. 127] OF CARDINAL.
Your UNICHAR type will be a range [0 .. 10FFFFH] OF CARDINAL.

The UTF-8 specification (IETF RFC 2279) defines a very simple encoding and decoding algorithm for converting between these two representations.

https://datatracker.ietf.org/doc/html/rfc2279

The mappings between UTF-8 and UCS-4 are as follows:

0000 0000 .. 0000 007F   0xxxxxxx
0000 0080 .. 0000 07FF   110xxxxx 10xxxxxx
0000 0800 .. 0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx
0001 0000 .. 001F FFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000 .. 03FF FFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000 .. 7FFF FFFF   1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

So, there are six cases to be distinguished and then simply mapped.

There are three steps:

(1) input data verification

For UTF-8 to UCS-5 decoding, assert that there are no leading indicator bits that don't fit the above pattern.
For UCS-4 to UTF-8 encoding, if you so desire, you can use a lookup table to filter out (yet) unassigned code points.

(2) determination which of the six cases apply

For UTF-8 to UCS-4 decoding, test the leading bits of the first octet to determine which case.
For UCS-4 to UTF-8 encoding, test the range in which the UCS-4 value lies to determine which case.

(3) copying the payload bits

Then all you have to do is collect the payload bits and copy them into their positions in the target value.

Basically you need two functions

PROCEDURE utf8ToUnichar ( utf8 : ARRAY [0..5] OF CARDINAL [0..127]; VAR ch : UNICHAR );

PROCEDURE unicharToUtf8 ( ch : UNICHAR; VAR utf8 : ARRAY [0..5] OF CARDINAL [0..127] );

That's it. Straightforward. There will be tons of things you have done without much prior knowledge that were significantly more complex. Don't overthink. Don't be intimidated. It's not rocket science.

regards
benjamin

reply via email to

[Prev in Thread] Current Thread [Next in Thread]