Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSeq

gnustep-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSeq

From:	Fred Kiefer
Subject:	Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:
Date:	Sun, 8 Apr 2018 10:38:55 +0200


> Am 07.04.2018 um 20:51 schrieb David Chisnall <address@hidden>:
> 
> I am testing out a new version of the compiler / runtime that is producing 
> NSConstantString instances with UTF-16 data.  I have currently disabled a lot 
> of the NSConstantString optimisations, on the basis of ‘make it work then 
> make it fast’ and I’m still seeing quite a lot of test failures.  The most 
> recent ones seem to come from the fact that GSUnicodeString’s implementation 
> of rangeOfComposedCharacterSequenceAtIndex: calls rangeOfSequence_u(), which 
> returns a different range to NSString’s implementation.
> 
> I have ls (an GSUnicodeString) and indianLong (a UTF-16 NSConstantString) 
> from the NSString/test00.m. If I call -getCharacters:range: on both, then I 
> get the same set of characters for [indianLong length] characters.  This is 
> as expected.  When searching for indianLong in ls, it is not found.  Sticking 
> in a lot of debugging code, I eventually tracked it down to this disagreement 
> and when I comment out GSUnicodeString’s implementation of 
> rangeOfComposedCharacterSequenceAtIndex: so that it uses the superclass 
> implementation then this test passes.
> 
> Please can someone who understands these bits of exciting unicode logic take 
> a look and see if there’s any reason for the disagreement?

I am surely no expert here, but I had a quick look at the code and the two 
algorithms seem to be very similar. The only difference is the set of code 
points that the characters get compared to. NSString uses [NSCharacterSet 
nonBaseCharacterSet], which looks correct to me. On the other hand GSString 
uses uni_isnonsp(), which I would read as "non spacing“ but is never explained. 
The code here is as follows: 

BOOL
uni_isnonsp(unichar u)
{
  /*
   * Treating upper surrogates as non-spacing is a convenient solution
   * to a number of issues with UTF-16
   */
  if ((u >= 0xdc00) && (u <= 0xdfff))
    return YES;

// FIXME check is uni_cop good for this
  if (GSPrivateUniCop(u))
    return YES;
  else
    return NO;
}

As a side effect this should handle the upper surrogates correctly, but not the 
lower and I have no idea what GSPrivateUniCop does, even after looking at the 
code various times. OK, it is a binary search on uni_cop_table, but what is in 
that table?

[Prev in Thread]

Current Thread

[Next in Thread]

GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:, David Chisnall, 2018/04/07
- Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:, David Chisnall, 2018/04/07
- Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:, Fred Kiefer <=
  - Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:, Richard Frith-Macdonald, 2018/04/08
  - Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:, David Chisnall, 2018/04/08

Prev by Date: Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:
Next by Date: Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:
Previous by thread: Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:
Next by thread: Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:
Index(es):
- Date
- Thread