bug#26058: utf16->string and utf32->string don't conform to R6RS

bug-guile

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#26058: utf16->string and utf32->string don't conform to R6RS

From:	Mark H Weaver
Subject:	bug#26058: utf16->string and utf32->string don't conform to R6RS
Date:	Mon, 15 Oct 2018 00:57:41 -0400
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)

Hi Taylan,

address@hidden (Taylan Ulrich "Bayırlı/Kammer") writes:

> Andy Wingo <address@hidden> writes:
>
>> Adopting the behavior is more or less fine.  If it can be done while
>> relying on the existing behavior, that is better than something ad-hoc
>> in a module.

In general, I agree with Andy's sentiment that it would be better to
avoid redundant BOM handling code, and moreover, I appreciate his
reluctance to apply a fix without careful consideration of our existing
BOM semantics.

However, as Taylan discovered, Guile does not provide a mechanism to
specify a default endianness of a UTF-16 or UTF-32 port in case a BOM is
not found.  I see no straightforward way to implement these R6RS
interfaces using ports.

We could certainly add such a mechanism if needed, but I see another
problem with this approach: the expense of creating and later collecting
a bytevector port object would be a very heavy burden to place on these
otherwise fairly lightweight operations.  Therefore, I would prefer to
avoid that implementation strategy for these operations.

Although BOM handling for ports is quite complex with many subtle points
to consider, detecting a BOM at the beginning of a bytevector is so
trivial that I personally have no objection to this tiny duplication of
logic.

Therefore, my preference would be to adopt code similar to that proposed
by Taylan, although I believe it can, and should, be further simplified:

> diff --git a/module/rnrs/bytevectors.scm b/module/rnrs/bytevectors.scm
> index 9744359f0..997a8c9cb 100644
> --- a/module/rnrs/bytevectors.scm
> +++ b/module/rnrs/bytevectors.scm
> @@ -69,7 +69,9 @@
>             bytevector-ieee-double-native-set!
>  
>             string->utf8 string->utf16 string->utf32
> -           utf8->string utf16->string utf32->string))
> +           utf8->string
> +           (r6rs-utf16->string . utf16->string)
> +           (r6rs-utf32->string . utf32->string)))
>  
>  
>  (load-extension (string-append "libguile-" (effective-version))
> @@ -80,4 +82,52 @@
>        `(quote ,sym)
>        (error "unsupported endianness" sym)))
>  
> +(define (read-bom16 bv)
> +  (let ((c0 (bytevector-u8-ref bv 0))
> +        (c1 (bytevector-u8-ref bv 1)))
> +    (cond
> +     ((and (= c0 #xFE) (= c1 #xFF))
> +      'big)
> +     ((and (= c0 #xFF) (= c1 #xFE))
> +      'little)
> +     (else
> +      #f))))

We should gracefully handle the case of an empty bytevector, returning
an empty string without error in that case.

Also, we should use a single 'bytevector-u16-ref' operation to check for
the BOM.  Pick an arbitrary endianness for the operation (big-endian?),
and compare the resulting integer with both #xFEFF and #xFFFE.  That
way, the code will be simpler and more efficient.  Note that our VM has
dedicated instructions for these multi-byte bytevector accessors, and
there will be fewer comparison operations as well.  Similarly for the
utf32 case.

What do you think?

> +(define r6rs-utf16->string
> +  (case-lambda
> +    ((bv default-endianness)
> +     (let ((bom-endianness (read-bom16 bv)))
> +       (if (not bom-endianness)
> +           (utf16->string bv default-endianness)
> +           (substring/shared (utf16->string bv bom-endianness) 1))))

Better to use plain 'substring' here, I think.  The machinery of shared
substrings is more expensive, and unnecessary in this case.

Otherwise, it looks good to me.  Would you like to propose a revised
patch?

Andy, what do you think?

       Mark

[Prev in Thread]

Current Thread

[Next in Thread]

bug#26058: utf16->string and utf32->string don't conform to R6RS, Mark H Weaver <=

Prev by Date: bug#32580: Setting variables %load-should-autocompile and GUILE_AUTO_COMPILE in ~/.guile doesn't prevent compiling
Next by Date: bug#21883: unnecessary bit shifting range limits
Previous by thread: bug#32580: Setting variables %load-should-autocompile and GUILE_AUTO_COMPILE in ~/.guile doesn't prevent compiling
Next by thread: bug#33044: Invalid read access of chars of wide string in scm_seed_to_random_state
Index(es):
- Date
- Thread