Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSeq

gnustep-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSeq

From:	Richard Frith-Macdonald
Subject:	Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:
Date:	Sun, 8 Apr 2018 10:42:45 +0100

> On 8 Apr 2018, at 09:38, Fred Kiefer <address@hidden> wrote:
> 
> 
> 
>> Am 07.04.2018 um 20:51 schrieb David Chisnall <address@hidden>:
>> 
>> I am testing out a new version of the compiler / runtime that is producing 
>> NSConstantString instances with UTF-16 data.  I have currently disabled a 
>> lot of the NSConstantString optimisations, on the basis of ‘make it work 
>> then make it fast’ and I’m still seeing quite a lot of test failures.  The 
>> most recent ones seem to come from the fact that GSUnicodeString’s 
>> implementation of rangeOfComposedCharacterSequenceAtIndex: calls 
>> rangeOfSequence_u(), which returns a different range to NSString’s 
>> implementation.
>> 
>> I have ls (an GSUnicodeString) and indianLong (a UTF-16 NSConstantString) 
>> from the NSString/test00.m. If I call -getCharacters:range: on both, then I 
>> get the same set of characters for [indianLong length] characters.  This is 
>> as expected.  When searching for indianLong in ls, it is not found.  
>> Sticking in a lot of debugging code, I eventually tracked it down to this 
>> disagreement and when I comment out GSUnicodeString’s implementation of 
>> rangeOfComposedCharacterSequenceAtIndex: so that it uses the superclass 
>> implementation then this test passes.
>> 
>> Please can someone who understands these bits of exciting unicode logic take 
>> a look and see if there’s any reason for the disagreement?
> 
> I am surely no expert here, but I had a quick look at the code and the two 
> algorithms seem to be very similar. The only difference is the set of code 
> points that the characters get compared to. NSString uses [NSCharacterSet 
> nonBaseCharacterSet], which looks correct to me. On the other hand GSString 
> uses uni_isnonsp(), which I would read as "non spacing“ but is never 
> explained. The code here is as follows: 
> 
> BOOL
> uni_isnonsp(unichar u)
> {
>  /*
>   * Treating upper surrogates as non-spacing is a convenient solution
>   * to a number of issues with UTF-16
>   */
>  if ((u >= 0xdc00) && (u <= 0xdfff))
>    return YES;
> 
> // FIXME check is uni_cop good for this
>  if (GSPrivateUniCop(u))
>    return YES;
>  else
>    return NO;
> }
> 
> As a side effect this should handle the upper surrogates correctly, but not 
> the lower and I have no idea what GSPrivateUniCop does, even after looking at 
> the code various times. OK, it is a binary search on uni_cop_table, but what 
> is in that table?

I don't think we have an expert on this (I didn't write the original unicode 
stuff), but I was able to find out what GSPrivateUniCop() is ... it's checking 
for combining characters (diacriticvals etc), so it's use makes sense here if 
the idea is to detect runs of 16bit values which are part of the same 
'user-perceived character' (important where the 16bit components of that 
character cross a range boundary).

I suspect neither uni_isnonsp() nor nonBaseCharacterSet is the correct 
mechanism though.

One thing that I do know from looking at cop.h is that the uni_isnonsp[() code, 
depending on cop.h is using out of date information;
The NSCharacterSet info is periodically regenerated from the latest unicode 
standard (I re-did that quite recently), but the headers in 
Source/Additions/unicode are not.  The cop.h file is definitely missing 
information and I'd therefore consider them other headers untrustworthy.  It 
would be good to ensure that we machine-generate all unicode related 
information rather than depending on header fiels which were (presumably) built 
by hand at some point.

Also, reading the unicode documentation it's clear that, even of the cop.h data 
was complete, it's only a subset of the characters in the nonBaseCharacterSet, 
and those other categories should be included ... so we should not be using 
GSPrivateUniCop(), and we should be using nonBaseCharacterSet to get all the 
follow-on characters in a multipart user-perceived character.

On the other hand, the earlier comment also looks correct:
  /*
   * Treating upper surrogates as non-spacing is a convenient solution
   * to a number of issues with UTF-16
   */

I checked, and nonBaseCharacterSet does not include the surrogate pair 
characters.

So I think we actually need to fix *both*  cases to 
a. check for the second half of a surrogate pair and
b. check for a non-base character

Probably the best thing to do would be to fix uni_isnonsp() to use 
nonBaseCharacterSet and have the NSString code call uni_insnonsp().

Incidentally, I agree that we want ChangeLog entries (git seems very 
unfriendly/poor at presenting such information) ... 
If it's trivial for David to generate them from git commit entries, I'd rather 
have the time spent on writing a few comments be devoted to that trivial change 
instead.

[Prev in Thread]

Current Thread

[Next in Thread]

GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:, David Chisnall, 2018/04/07
- Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:, David Chisnall, 2018/04/07
- Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:, Fred Kiefer, 2018/04/08
  - Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:, Richard Frith-Macdonald <=
  - Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:, David Chisnall, 2018/04/08

Prev by Date: Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:
Next by Date: Re: New ABI NSConstantString
Previous by thread: Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:
Next by thread: Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:
Index(es):
- Date
- Thread