openexr-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Openexr-devel] UTF-8


From: David Aguilar
Subject: Re: [Openexr-devel] UTF-8
Date: Wed, 14 Nov 2012 14:32:10 -0800

On Wed, Nov 14, 2012 at 11:47 AM, Florian Kainz <address@hidden> wrote:
>
> The ACES image container specification, meant to be compatible OpenEXR,
> prescribes UTF-8 for the representation of strings.  Therefore I suggest
> that OpenEXR adopt the following rules:
>
> - All text strings are to be interpreted as Unicode, encoded as UTF-8.
>   This includes attribute names and strings contained in attributes,
>   for example, as channel names.
>
> - Text strings stored in files must be in Normalization Form C (NFC,
>   canonical decomposition followed by canonical composition).

I would stay far away from dealing with normalization issues.

Poke around on OS X and its broken HFS filesystem to see why:

http://radsoft.net/rants/20080405,00.shtml

If the library verified utf-8 that would be enough IMO.

Imagine some poor sucker who goes and stores unicode filenames in a
header.  It's not fun to have a library silently "fix" things for you.

What's the upside of doing the normalization?  How about just leave it
as-is?  That way the code can stay simple.  Whatever you put in can be
byte-for-byte identical to what you get out.

Other then that, UTF-8 all the way as the "recommended" encoding.

> - Where text strings need to be collated, strcmp() is used to compare
>   the corresponding char sequences:  string A comes before (or is less
>   than) string B if
>
>     strcmp(A,B) == -1
>
>   (Note: this is not ambigous; the C99 standard specifies that strcmp()
>   interprets the bytes that make up a string as unsigned.)
>
> - Text strings passed to the IlmImf library must be encoded as UTF-8
>   and in Normalization Form C.
>
> As far as I can tell, these rules are entirely compatible with all
> existing versions of the IlmImf library.  Users whose writing system
> includes non-ASCII Unicode characters can continue to employ the
> existing library versions without change.
>
> Future versions of the library should verify that text strings are
> valid UTF-8.  In addition, the library should either verify that
> strings are normalized to NFC, or normalize to NFC on the fly.

If we treat them like raw bytes then we really don't care about the
encoding, do we?  (that's why I said, "recommended")

It would be nice if the thing stayed agnostic.

Is there a reason why it needs to enforce the encoding,
or is a strong recommendation to use UTF-8 good enough?
-- 
David



reply via email to

[Prev in Thread] Current Thread [Next in Thread]