openexr-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Openexr-devel] UTF-8


From: Jim Atkinson
Subject: Re: [Openexr-devel] UTF-8
Date: Thu, 15 Nov 2012 14:45:37 -0800

You're on a slippery slope once you decide that some strings should be 
normalized and others should not.  Should we keep a table of names that should 
or shouldn't be normalized?  Or change "comment" from a std::string attribute 
to a new "unnormalized string" type?

And I don't think you can solve the problem of strings that look 
typographically the same but are in fact different.  If you assert that the 
names "grün" as:

   0067 0072 0075 0803 006E
   (g, r, u, combining diaeresis, n)

and "grün" as:

   0067 0072 00FC 006E
   (g, r, u with diaeresis, n)

are the same string because they look the same.  Do you also believe that "foo" 
as:

   0066 006F 006F
   (f, o, o)

and "fоо" as:

   0066 043E 043E
   (f, cyrillic small letter o, cyrillic small letter o)

are also the same?

It seems to me that they are completely different strings that happen to look 
the same (maybe slightly different based on your font).  Normalization won't 
help with that.

There are lots of characters that look like other characters.  There are 4 
unicode variations on ".", so is "foo.R" the red channel in layer "foo" or not? 
 White space is also allowed in strings so space, no-break space, en space, em 
space, etc. are all allowed.  Personally, I think that anyone who puts these 
characters in a channel or attribute name deserves whatever happens as a 
result, but I don't think it is possible to prevent them from doing so without 
limiting the character set to ASCII or some equally limited set.

If the application normalizes all the strings it provides to the library, 
everything will work.  And if the application doesn't normalize all the 
strings, everything will still "work".  A user may just find that "grün" and 
"grün" will be different channel names just like "foo" and "fоо" are.

How about adding normalization routines into ilmbase somewhere and recommending 
their use?

- Jim

On Nov 15, 2012, at 12:21 PM, Florian Kainz <address@hidden> wrote:

> After mulling this over a bit more, I think the rule that _all_ strings
> must be normalized is too restrictive.  Strings such as the value of the
> "comments" attribute are only stored in the file or retrieved from the file.
> The library performs no other processing, so there's no requirement for
> normalization.
> 
> However, attribute names and channel names are used by the library for table
> lookups.  In those cases normalization is necessary, or the lookup will not
> work correctly.  Comparison with strcmp() can fail when a string has more
> than one possible representation.
> 
> I don't see how the onus could be shifted to application code.  If a user
> types a channel name such as ??.R (see my earlier mail), must the application
> try all possible representations of this string in order to find out if the
> corresponding channel exists?  If the application fails to do this, should
> the library allow a channel list that contains multiple channels with the
> name ??.R, the only difference being that in one case ?? is represented
> as two Hangul syllable characters, in the next ? has been split into Jamo
> but ? is a syllable, and so on?
> 
> If a single application generates all the attribute and channel names in
> a file then we can reasonably assume that the application uses consistent
> rules for encoding strings throughout its code base.  However, applications
> must be able to handle channel names found in files that may have been
> generated by other applications, possibly with different internal conventions
> for text processing.  For example, an application that internally represents
> Korean texts using syllables should be able to handle OpenEXR files that
> were written by an application that uses Jamo, or any combination of Jamo
> and syllables.
> 
> 
> I propose a revised set of rules:
> 
> - All text strings are to be interpreted as Unicode, encoded as UTF-8.
>  This includes attribute names and strings contained in attributes,
>  for example, as channel names.
> 
> - Attribute names and channel names stored in files must be in Normalization
>  Form C (NFC, canonical decomposition followed by canonical composition).
> 
> - Where attribute names or channel names need to be collated, strcmp() is
>  used to compare the corresponding char sequences:  string A comes before
>  (or is less than) string B if
> 
>    strcmp(A,B) == -1
> 
>  (Note: this is not ambigous; the C99 standard specifies that strcmp()
>  interprets the bytes that make up a string as unsigned.)
> 
> - Attribute names and channel names passed to the IlmImf library must be
>  encoded as UTF-8 and in Normalization Form C.
> 
>  (Note - this last rule could be changed to: Attribute names and channel
>  names must be encoded as UTF-8.  The library converts the names to
>  Normalization Form C before any further processing.)
> 
> Florian




reply via email to

[Prev in Thread] Current Thread [Next in Thread]