openexr-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Openexr-devel] UTF-8


From: Florian Kainz
Subject: Re: [Openexr-devel] UTF-8
Date: Wed, 14 Nov 2012 14:57:35 -0800
User-agent: Thunderbird 2.0.0.24 (X11/20100428)


The problem is that a channel or attribute name such as
"grün" could be represented as the character sequence

    0067 0072 0075 0803 006E
    (g, r, u, combining diaeresis, n)

or as

    0067 0072 00FC 006E
    (g, r, u with diaeresis, n).

Typographically the representations look identical, but
string comparisons would treat them as different.
I can't imagine users being happy to be told that a file
contains, for example, a "grün" channel of type HALF, and
a "grün" channel of type FLOAT, where the only difference
between the names is how they are represented as Unicode.

As far as I can tell, either string comparison needs to
perform some normalization on the fly, or the strings that
are compared must already be normalized.

Yes, normalization is a headache, but with Unicode there is
not a one-to-one correspondence between the character sequence
stored in a string and the typographical representation of
that string.

Florian


David Aguilar wrote:
On Wed, Nov 14, 2012 at 11:47 AM, Florian Kainz <address@hidden> wrote:
The ACES image container specification, meant to be compatible OpenEXR,
prescribes UTF-8 for the representation of strings.  Therefore I suggest
that OpenEXR adopt the following rules:

- All text strings are to be interpreted as Unicode, encoded as UTF-8.
  This includes attribute names and strings contained in attributes,
  for example, as channel names.

- Text strings stored in files must be in Normalization Form C (NFC,
  canonical decomposition followed by canonical composition).

I would stay far away from dealing with normalization issues.

Poke around on OS X and its broken HFS filesystem to see why:

http://radsoft.net/rants/20080405,00.shtml

If the library verified utf-8 that would be enough IMO.

Imagine some poor sucker who goes and stores unicode filenames in a
header.  It's not fun to have a library silently "fix" things for you.

What's the upside of doing the normalization?  How about just leave it
as-is?  That way the code can stay simple.  Whatever you put in can be
byte-for-byte identical to what you get out.

Other then that, UTF-8 all the way as the "recommended" encoding.

- Where text strings need to be collated, strcmp() is used to compare
  the corresponding char sequences:  string A comes before (or is less
  than) string B if

    strcmp(A,B) == -1

  (Note: this is not ambigous; the C99 standard specifies that strcmp()
  interprets the bytes that make up a string as unsigned.)

- Text strings passed to the IlmImf library must be encoded as UTF-8
  and in Normalization Form C.

As far as I can tell, these rules are entirely compatible with all
existing versions of the IlmImf library.  Users whose writing system
includes non-ASCII Unicode characters can continue to employ the
existing library versions without change.

Future versions of the library should verify that text strings are
valid UTF-8.  In addition, the library should either verify that
strings are normalized to NFC, or normalize to NFC on the fly.

If we treat them like raw bytes then we really don't care about the
encoding, do we?  (that's why I said, "recommended")

It would be nice if the thing stayed agnostic.

Is there a reason why it needs to enforce the encoding,
or is a strong recommendation to use UTF-8 good enough?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]