[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSeq
From: |
Richard Frith-Macdonald |
Subject: |
Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex: |
Date: |
Sun, 8 Apr 2018 10:42:45 +0100 |
> On 8 Apr 2018, at 09:38, Fred Kiefer <address@hidden> wrote:
>
>
>
>> Am 07.04.2018 um 20:51 schrieb David Chisnall <address@hidden>:
>>
>> I am testing out a new version of the compiler / runtime that is producing
>> NSConstantString instances with UTF-16 data. I have currently disabled a
>> lot of the NSConstantString optimisations, on the basis of ‘make it work
>> then make it fast’ and I’m still seeing quite a lot of test failures. The
>> most recent ones seem to come from the fact that GSUnicodeString’s
>> implementation of rangeOfComposedCharacterSequenceAtIndex: calls
>> rangeOfSequence_u(), which returns a different range to NSString’s
>> implementation.
>>
>> I have ls (an GSUnicodeString) and indianLong (a UTF-16 NSConstantString)
>> from the NSString/test00.m. If I call -getCharacters:range: on both, then I
>> get the same set of characters for [indianLong length] characters. This is
>> as expected. When searching for indianLong in ls, it is not found.
>> Sticking in a lot of debugging code, I eventually tracked it down to this
>> disagreement and when I comment out GSUnicodeString’s implementation of
>> rangeOfComposedCharacterSequenceAtIndex: so that it uses the superclass
>> implementation then this test passes.
>>
>> Please can someone who understands these bits of exciting unicode logic take
>> a look and see if there’s any reason for the disagreement?
>
> I am surely no expert here, but I had a quick look at the code and the two
> algorithms seem to be very similar. The only difference is the set of code
> points that the characters get compared to. NSString uses [NSCharacterSet
> nonBaseCharacterSet], which looks correct to me. On the other hand GSString
> uses uni_isnonsp(), which I would read as "non spacing“ but is never
> explained. The code here is as follows:
>
> BOOL
> uni_isnonsp(unichar u)
> {
> /*
> * Treating upper surrogates as non-spacing is a convenient solution
> * to a number of issues with UTF-16
> */
> if ((u >= 0xdc00) && (u <= 0xdfff))
> return YES;
>
> // FIXME check is uni_cop good for this
> if (GSPrivateUniCop(u))
> return YES;
> else
> return NO;
> }
>
> As a side effect this should handle the upper surrogates correctly, but not
> the lower and I have no idea what GSPrivateUniCop does, even after looking at
> the code various times. OK, it is a binary search on uni_cop_table, but what
> is in that table?
I don't think we have an expert on this (I didn't write the original unicode
stuff), but I was able to find out what GSPrivateUniCop() is ... it's checking
for combining characters (diacriticvals etc), so it's use makes sense here if
the idea is to detect runs of 16bit values which are part of the same
'user-perceived character' (important where the 16bit components of that
character cross a range boundary).
I suspect neither uni_isnonsp() nor nonBaseCharacterSet is the correct
mechanism though.
One thing that I do know from looking at cop.h is that the uni_isnonsp[() code,
depending on cop.h is using out of date information;
The NSCharacterSet info is periodically regenerated from the latest unicode
standard (I re-did that quite recently), but the headers in
Source/Additions/unicode are not. The cop.h file is definitely missing
information and I'd therefore consider them other headers untrustworthy. It
would be good to ensure that we machine-generate all unicode related
information rather than depending on header fiels which were (presumably) built
by hand at some point.
Also, reading the unicode documentation it's clear that, even of the cop.h data
was complete, it's only a subset of the characters in the nonBaseCharacterSet,
and those other categories should be included ... so we should not be using
GSPrivateUniCop(), and we should be using nonBaseCharacterSet to get all the
follow-on characters in a multipart user-perceived character.
On the other hand, the earlier comment also looks correct:
/*
* Treating upper surrogates as non-spacing is a convenient solution
* to a number of issues with UTF-16
*/
I checked, and nonBaseCharacterSet does not include the surrogate pair
characters.
So I think we actually need to fix *both* cases to
a. check for the second half of a surrogate pair and
b. check for a non-base character
Probably the best thing to do would be to fix uni_isnonsp() to use
nonBaseCharacterSet and have the NSString code call uni_insnonsp().
Incidentally, I agree that we want ChangeLog entries (git seems very
unfriendly/poor at presenting such information) ...
If it's trivial for David to generate them from git commit entries, I'd rather
have the time spent on writing a few comments be devoted to that trivial change
instead.