gm2
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Inquiring about the status of the proposed UNICODE library


From: Alice Osako
Subject: Re: Inquiring about the status of the proposed UNICODE library
Date: Fri, 8 Mar 2024 05:45:02 -0500
User-agent: Mozilla Thunderbird



Benjamin Kowarsch:
On Fri, 8 Mar 2024 at 06:36, Alice Osako wrote:
 
I think I will fork the Lilley code as a starting point for a library implementation, modifying it to use UTF-32 rather UTF-16. This will mean using the GPL v.3 as its license, of course, but I have no objection to that. I while it is certainly not necessary so long as I follow the GPL requirements, I would feel much better about this if I were to hear from Chris Lilley about this.

I think you are overthinking the task at hand.

Write down your requirements first, like so:

(1) read UTF-8 from a file, convert every UTF-8 value read to an equivalent UCS-4 value
(2) write UTF-8 to a file, convert every UCS-4 value to be written to an equivalent UTF-8 value.

Then ask yourself questions like

Do I need uppercase/lowercase conversion on the UCS-4 text?
Do I need to match accented to non-accented UCS-4 characters and vice versa?
Do I need lexicographical sorting of UCS-4 text? If so, in which realm (Latin, Greek, Cyrillic...)?
etc etc

For the JSON parser, no, I really don't need any of those things. I will want to make it easy to add them later, should I need to, but for my immediate purposes, I can do as you say below.

You will find that converting between UTF-8 and UCS-4 is rather straightforward.

The complexity that you seem to find intimidating lies in the text processing bits.

But how much of the latter do you really need for a JSON parser?

The way I see it, your task is to write a UTF-8 to UCS-4 decoder, and a UCS-4 to UTF-8 encoder.

And that task will be easier to do from scratch than to modify an existing UTF-8/UCS-2 decoder/encoder.

That's a very good point. As it happens, Lilley had really only gotten as far as the internal representation and a few utility procedures to support it, anyway;  adapting that part is trivial,

I would recommend the same approach we chose in our revision:

* read and write UTF-8 from and to disk
* convert on the fly between UTF-8 and in-memory format
* use ISO 10646 UCS-4 as in-memory format when processing text

If any other encodings are needed, convert between in-memory UCS-4 and those formats while doing IO.
This does sound like the best approach, though I am honestly trepidatious about the whole project. I have only a basic grasp of UNICODE in general, and am hesitant to dive into the standard(s) to the level needed for this project.

Like I said, you are overthinking this.

What made Unicode so nightmarish to work with were the 16-bit formats because they didn't cover the entire Unicode range and weren't linear. Also, they brought endianness issues with them.

But UTF-8 is a stream of single octets, and UCS-4 covers the entire Unicode range and is linear. All the complexities related to encoding and decoding that exist with 16-bit formats are avoided altogether.

<detailed implementation recommendations snipped>

Basically you need two functions

PROCEDURE utf8ToUnichar ( utf8 : ARRAY [0..5] OF CARDINAL [0..127]; VAR ch : UNICHAR );

PROCEDURE unicharToUtf8 ( ch : UNICHAR; VAR utf8 : ARRAY [0..5] OF CARDINAL [0..127] );

That's it. Straightforward. There will be tons of things you have done without much prior knowledge that were significantly more complex. Don't overthink. Don't be intimidated. It's not rocket science.

Thank you, this does help put it all into perspective.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]