[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Openexr-devel] UTF-8
From: |
Jim Atkinson |
Subject: |
Re: [Openexr-devel] UTF-8 |
Date: |
Thu, 15 Nov 2012 14:45:37 -0800 |
You're on a slippery slope once you decide that some strings should be
normalized and others should not. Should we keep a table of names that should
or shouldn't be normalized? Or change "comment" from a std::string attribute
to a new "unnormalized string" type?
And I don't think you can solve the problem of strings that look
typographically the same but are in fact different. If you assert that the
names "grün" as:
0067 0072 0075 0803 006E
(g, r, u, combining diaeresis, n)
and "grün" as:
0067 0072 00FC 006E
(g, r, u with diaeresis, n)
are the same string because they look the same. Do you also believe that "foo"
as:
0066 006F 006F
(f, o, o)
and "fоо" as:
0066 043E 043E
(f, cyrillic small letter o, cyrillic small letter o)
are also the same?
It seems to me that they are completely different strings that happen to look
the same (maybe slightly different based on your font). Normalization won't
help with that.
There are lots of characters that look like other characters. There are 4
unicode variations on ".", so is "foo.R" the red channel in layer "foo" or not?
White space is also allowed in strings so space, no-break space, en space, em
space, etc. are all allowed. Personally, I think that anyone who puts these
characters in a channel or attribute name deserves whatever happens as a
result, but I don't think it is possible to prevent them from doing so without
limiting the character set to ASCII or some equally limited set.
If the application normalizes all the strings it provides to the library,
everything will work. And if the application doesn't normalize all the
strings, everything will still "work". A user may just find that "grün" and
"grün" will be different channel names just like "foo" and "fоо" are.
How about adding normalization routines into ilmbase somewhere and recommending
their use?
- Jim
On Nov 15, 2012, at 12:21 PM, Florian Kainz <address@hidden> wrote:
> After mulling this over a bit more, I think the rule that _all_ strings
> must be normalized is too restrictive. Strings such as the value of the
> "comments" attribute are only stored in the file or retrieved from the file.
> The library performs no other processing, so there's no requirement for
> normalization.
>
> However, attribute names and channel names are used by the library for table
> lookups. In those cases normalization is necessary, or the lookup will not
> work correctly. Comparison with strcmp() can fail when a string has more
> than one possible representation.
>
> I don't see how the onus could be shifted to application code. If a user
> types a channel name such as ??.R (see my earlier mail), must the application
> try all possible representations of this string in order to find out if the
> corresponding channel exists? If the application fails to do this, should
> the library allow a channel list that contains multiple channels with the
> name ??.R, the only difference being that in one case ?? is represented
> as two Hangul syllable characters, in the next ? has been split into Jamo
> but ? is a syllable, and so on?
>
> If a single application generates all the attribute and channel names in
> a file then we can reasonably assume that the application uses consistent
> rules for encoding strings throughout its code base. However, applications
> must be able to handle channel names found in files that may have been
> generated by other applications, possibly with different internal conventions
> for text processing. For example, an application that internally represents
> Korean texts using syllables should be able to handle OpenEXR files that
> were written by an application that uses Jamo, or any combination of Jamo
> and syllables.
>
>
> I propose a revised set of rules:
>
> - All text strings are to be interpreted as Unicode, encoded as UTF-8.
> This includes attribute names and strings contained in attributes,
> for example, as channel names.
>
> - Attribute names and channel names stored in files must be in Normalization
> Form C (NFC, canonical decomposition followed by canonical composition).
>
> - Where attribute names or channel names need to be collated, strcmp() is
> used to compare the corresponding char sequences: string A comes before
> (or is less than) string B if
>
> strcmp(A,B) == -1
>
> (Note: this is not ambigous; the C99 standard specifies that strcmp()
> interprets the bytes that make up a string as unsigned.)
>
> - Attribute names and channel names passed to the IlmImf library must be
> encoded as UTF-8 and in Normalization Form C.
>
> (Note - this last rule could be changed to: Attribute names and channel
> names must be encoded as UTF-8. The library converts the names to
> Normalization Form C before any further processing.)
>
> Florian
- [Openexr-devel] UTF-8, Brendan Bolles, 2012/11/13
- [Openexr-devel] UTF-8, Hồ Châu, 2012/11/14
- Re: [Openexr-devel] UTF-8, Florian Kainz, 2012/11/14
- Re: [Openexr-devel] UTF-8, David Aguilar, 2012/11/14
- Re: [Openexr-devel] UTF-8, Florian Kainz, 2012/11/14
- Re: [Openexr-devel] UTF-8, David Aguilar, 2012/11/14
- Re: [Openexr-devel] UTF-8, Florian Kainz, 2012/11/15
- Re: [Openexr-devel] UTF-8, David Aguilar, 2012/11/15
- Re: [Openexr-devel] UTF-8, Jim Atkinson, 2012/11/15
- Re: [Openexr-devel] UTF-8, Florian Kainz, 2012/11/15
- Re: [Openexr-devel] UTF-8,
Jim Atkinson <=
- Re: [Openexr-devel] UTF-8, Florian Kainz, 2012/11/15
- Re: [Openexr-devel] UTF-8, Jim Atkinson, 2012/11/16
- Re: [Openexr-devel] UTF-8, Larry Gritz, 2012/11/16
- Re: [Openexr-devel] UTF-8, Britton, Andrew D, 2012/11/16
- Re: [Openexr-devel] UTF-8, Brendan Bolles, 2012/11/15
- Message not available
- Re: [Openexr-devel] UTF-8, Brendan Bolles, 2012/11/15