[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: New ABI NSConstantString
From: |
David Chisnall |
Subject: |
Re: New ABI NSConstantString |
Date: |
Thu, 5 Apr 2018 20:33:47 +0100 |
This might be slightly confusing, because your mail client doesn’t seem to do
anything sane for quoting:
On 5 Apr 2018, at 20:09, Stefan Bidigaray <address@hidden> wrote:
>
> On Thu, Apr 5, 2018 at 1:41 PM, David Chisnall <address@hidden> wrote:
> On 5 Apr 2018, at 17:27, Stefan Bidigaray <address@hidden> wrote:
> >
> > Hi David,
> > I forgot to make a comment when you originally posted the idea, and I think
> > this would be a great time to add my 2 cents.
> >
> > Regarding the structure:
> > * Would it not be better to add the flags bit field immediately after the
> > isa pointer? My thought here is that it can be checked for if different
> > versions of the structure exist. This is important for CoreBase since it
> > does not have the luxury of real classes.
>
> I’m concerned with structure padding here. Even on a 64-bit platform, we
> either need an 8-byte flags field (which is wasteful) or end up with 4 bytes
> of padding. With 128-bit pointers (which are probably coming sooner than you
> expect) we will end up with 12 bytes of padding if we have a 32-bit flags
> field followed by a pointer.
>
> Well, I was hoping there is a way we can define this structure so that it can
> be used directly in CoreBase, without having to call the toll-free bridging
> mechanism. If a 32-bit hash is used, could it be combined with the "flags"
> variable (see the structure I included at the end of this email)? I'm hoping
> to be able to have use the same constant strings without having to call the
> bridging mechanism. It's pretty slow and cumbersome.
Can you explain why CoreBase needs to store the hash as anything other than a
32-bit value that it can zero extend when returning a 64-bit value? It the
CoreFoundation and Foundation implementations of hash are compatible, then it
will currently be returning a 28-bit value in a 64-bit register, so I don’t
understand the issue here.
>
> By the way, I noticed there was not uint32_t flags in your original
> structure, making it 24 bytes in 32-bit CPUs.
>
> > * Would it be possible to make the hash variable a NSUInterger? The output
> > of -hash is an NSUInterger, and that would allow the value to be expanded
> > in the future.
>
> We can, though that would again increase the size quite noticeably. I think
> I’m happy with a 32-bit hash, because as rfm points out with a decent hash
> algorithm that basically gives us unique hashes.
>
> Sounds reasonable.
>
> > * Why have both count and length? Would it not make more sense to keep a
> > single variable here called count and define it as, "The count/number of
> > code units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16
> > it would be the # of 16-bit codes. The Apple documentation states "The
> > number of UTF-16 code units in the receiver", making at least the ASCII and
> > UTF-16 numbers correct. The way I understand the current implementation,
> > the value for length would return the UTF-32 # of characters, which is
> > inconsistent with the docs.
>
> If a UTF-8 string contains multi-byte sequences, then the length of the
> buffer and the number if UTF-16 code units will be different. If we know the
> number of bytes, then we can use more efficient C standard library functions
> for things like comparisons, though that may not be important.
>
> I guess I'm still a bit confused about the meaning and/or different of the
> variables count and length.
One tells you the logical number of characters, the other the length of the
buffer in bytes. A lot of bytes-scanning functions are far more efficient if
they know the length up front, because they can then process one word at a time
until the last word.
> I know this is probably going to be rejected, but how about making constant
> string either ASCII or UTF-16 only? Scratching UTF-8 altogether? I know this
> would increase the byte count for most European languages using Latin
> characters, but I don't see the point of maintaining both UTF-8 and UTF-16
> encoding. Everything that can be done with UTF-16 can be encoded in UTF-8
> (and vise-versa), so how would the compiler pick between the two?
> Additionally, wouldn't sticking to just 1 of the 2 encoding simplify the code
> significantly?
There’s also the issue that -UTF8String is one of the most commonly used
methods on NSString, so if we represent something as UTF-16 internally then it
needs converting and returning in an autoreleased buffer, whereas with a UTF-8
string it can just return the pointer. On non-Windows platforms, -UTF8String
is the way of getting a string that you pass to pretty much any OS function.
>
> > * I would also think that it makes more sense to have the length/count
> > variable before the data pointer. I don't have a strong opinion about this
> > one, but it just makes more sense in my head.
>
> Again, this gives us more padding in the structure.
>
> Would it? Isn't sizeof (long) == sizeof (void *) in all 32 and 64-bit
> architectures (except WIN64)? I thought a long would not be padded any more
> than a pointer for most applications.
Not Win64, not on anything with larger than 64-bit pointers.
> >
> > Regarding the hash function:
> > Why are we using Murmur3 hash? I know it is significantly more efficient
> > than our current one-at-a-time approach, but how much better is it to
> > competing hash functions? Is there a bench mark out there comparing some of
> > the major ones? For example, how does it compare with lookup3 or
> > SpookyHash. If we are storing the hash in the string structure, the speed
> > of calculating the hash is not as important as the spread. Additionally,
> > Murmur3 seems ill suited if NSUInteger is used to store the hash value
> > since, as far as I could tell, it only outputs 32-bit and 128-bit hashes.
> > Lookup3 and SpookyHash, for example, output 64-bit values (2 32-bit words
> > in the case of lookup3), as well.
>
> The size of the type doesn’t necessarily give us the range. We are
> completely free to give only a 32-bit or even 28-bit range within an
> NSUInteger (which is what we do now) and if we have good coverage. A good
> hash function has even distribution of entropy across all bits, so taking a
> 32-bit or 128-bit hash and truncating it is fine. That said, I’m happy to
> make the hash value 8 bytes on 64-bit platforms if this seems like a good use
> of bits.
>
> I’m not wedded to the idea of Murmur3. We do need to use the same hash for
> constant and non-constant strings, so execution speed is important. I’m
> somewhat tempted to suggest SHA256, because it’s fairly easy to accelerate
> with SSE and newer CPUs have full hardware offload for it. That said, the
> goal is not to mandate the use of the compiler-generated hash for constant
> strings, it’s to provide a space to store one that the compiler initialises
> to something sensible.
>
> Given the analysis I’ve done in the reply to Ivan, I think it’s worth
> consuming space to improve performance.
>
> I agree.
>
> So how about a structure like:
>
> struct {
> id isa; /* Class pointer */
> uint64_t flags;
> /* Flags bitfield:
> Low 2 bits, enum with values:
> 0: ASCII string
> 1: UTF-16 string
> 2 and 3: Reserved for future encodings
> (1<<2) to (1<<3): 0 for one-at-a-time; 1 for murmur hash; 2 and 3
> reserved for future hashes
> (1<<4) to (1<<15): Reserved for future compiler-defined flags
> (1<<16) to (1<<31): Reserved for use by the constant string class
> (I'm hoping this could hold the CFTypeID of a constant string so it can be
> identified by corebase)
> (1<<32) to (1<<63): hash
> */
> const char *data; /* Pointer to the buffer. ro_data section, so
> immutable. NULL-terminated */
> long count; /* Number of UTF-16 code units, not including the null
> terminator */
> }
I don’t see why we’d use a single uint64_t rather than a pair of uint32_ts and
I don’t like the ordering (it will be annoying to have to order the fields
differently on 128-bit pointer platforms). I’m not convinced that it’s worth
omitting the length to save 8 bytes per string. It’s probably also not
actually worth using longs for the length on 64-bit platforms, so both of these
should probably be 32 bits. 4GB of string literal seems a bit excessive (for
one thing, I doubt the compiler will be entirely happy with it, and I don’t
know happy linkers are with 4GB symbols…).
David
- Re: New ABI NSConstantString, (continued)
- Re: New ABI NSConstantString, Fred Kiefer, 2018/04/01
- Re: New ABI NSConstantString, David Chisnall, 2018/04/01
- Re: New ABI NSConstantString, Richard Frith-Macdonald, 2018/04/01
- Re: New ABI NSConstantString, David Chisnall, 2018/04/05
- Re: New ABI NSConstantString, Ivan Vučica, 2018/04/05
- Re: New ABI NSConstantString, David Chisnall, 2018/04/05
- Re: New ABI NSConstantString, Ivan Vučica, 2018/04/05
- Re: New ABI NSConstantString, Stefan Bidigaray, 2018/04/05
- Re: New ABI NSConstantString, David Chisnall, 2018/04/05
- Re: New ABI NSConstantString, Stefan Bidigaray, 2018/04/05
- Re: New ABI NSConstantString,
David Chisnall <=
- Re: New ABI NSConstantString, Stefan Bidigaray, 2018/04/05
- Re: New ABI NSConstantString, David Chisnall, 2018/04/06
- Re: New ABI NSConstantString, Stefan Bidigaray, 2018/04/06
- Re: New ABI NSConstantString, David Chisnall, 2018/04/07
- Re: New ABI NSConstantString, Ivan Vučica, 2018/04/07
- Re: New ABI NSConstantString, David Chisnall, 2018/04/07
- Re: New ABI NSConstantString, Richard Frith-Macdonald, 2018/04/07
- Re: New ABI NSConstantString, Ivan Vučica, 2018/04/07
- Re: New ABI NSConstantString, Stefan Bidigaray, 2018/04/07
Re: New ABI NSConstantString, David Chisnall, 2018/04/06