On Wed, Nov 14, 2012 at 11:47 AM, Florian Kainz <address@hidden> wrote:
The ACES image container specification, meant to be compatible OpenEXR,
prescribes UTF-8 for the representation of strings. Therefore I suggest
that OpenEXR adopt the following rules:
- All text strings are to be interpreted as Unicode, encoded as UTF-8.
This includes attribute names and strings contained in attributes,
for example, as channel names.
- Text strings stored in files must be in Normalization Form C (NFC,
canonical decomposition followed by canonical composition).
I would stay far away from dealing with normalization issues.
Poke around on OS X and its broken HFS filesystem to see why:
http://radsoft.net/rants/20080405,00.shtml
If the library verified utf-8 that would be enough IMO.
Imagine some poor sucker who goes and stores unicode filenames in a
header. It's not fun to have a library silently "fix" things for you.
What's the upside of doing the normalization? How about just leave it
as-is? That way the code can stay simple. Whatever you put in can be
byte-for-byte identical to what you get out.
Other then that, UTF-8 all the way as the "recommended" encoding.
- Where text strings need to be collated, strcmp() is used to compare
the corresponding char sequences: string A comes before (or is less
than) string B if
strcmp(A,B) == -1
(Note: this is not ambigous; the C99 standard specifies that strcmp()
interprets the bytes that make up a string as unsigned.)
- Text strings passed to the IlmImf library must be encoded as UTF-8
and in Normalization Form C.
As far as I can tell, these rules are entirely compatible with all
existing versions of the IlmImf library. Users whose writing system
includes non-ASCII Unicode characters can continue to employ the
existing library versions without change.
Future versions of the library should verify that text strings are
valid UTF-8. In addition, the library should either verify that
strings are normalized to NFC, or normalize to NFC on the fly.
If we treat them like raw bytes then we really don't care about the
encoding, do we? (that's why I said, "recommended")
It would be nice if the thing stayed agnostic.
Is there a reason why it needs to enforce the encoding,
or is a strong recommendation to use UTF-8 good enough?