bug-guile
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#20109: Incompatible API change in 2.0 series for string port encodin


From: David Kastrup
Subject: bug#20109: Incompatible API change in 2.0 series for string port encoding
Date: Wed, 18 Mar 2015 13:32:55 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.50 (gnu/linux)

Mark H Weaver <address@hidden> writes:

> David Kastrup <address@hidden> writes:
>
>> Mark H Weaver <address@hidden> writes:
>>
>>> This hack of giving Guile a buffer containing UTF-8, but claiming that
>>> it is Latin-1, is not good.  It will cause Guile to see non-ASCII
>>> characters as garbage.
>>
>> For one thing we are talking about an external file here that is
>> mainly parsed by LilyPond.  LilyPond provides sensible pinpointing of
>> UTF-8 encoding errors, something which GUILE cannot do with its UTF-8
>> representation since it has no transparent or reproducible
>> representation of bad bytes.  Emacs uses overlong encodings for 0-127
>> to represent badly encoded bytes (which includes any overlong
>> sequences) in the range 128-255, making 128-255 encode as patterns
>> 0xc0 0x80 to 0xc1 0xbf.
>
> I intend to add a similar mechanism to Guile, but it is not yet done.

I think it would be pretty important since it makes it possible to treat
problems at those points in processing where it makes most sense.

However, it would also seem important to have GUILE handle utf-8
strings.  At the current point of time, its only native types are what
it calls "latin-1" and likely "UTF-32".  Which does not make much sense
in connection with its string ports being unconditionally UTF-8 instead.

Concatenating a string from smaller pieces sequentially via string
operations is O(n^2), so string ports are a natural way to assemble
large strings.  They are also nice for reading from strings.  Not
requiring conversions for most of that would be nice.

>>> However, if you insist on doing this, I would
>>> suggest using a bytevector input port instead, like this: (untested)
>>>
>>>   char *buf = c_str ();
>>>   SCM bv = scm_c_make_bytevector (strlen (buf) + 1);
>>>   strcpy (SCM_BYTEVECTOR_CONTENTS (bv), buf);
>>>   str_port_ = scm_open_bytevector_input_port (bv, SCM_UNDEFINED);
>>
>> address@hidden:/usr/local/tmp/guile$ git grep
>> scm_open_byte_vector_input_port v2.0.11
>> address@hidden:/usr/local/tmp/guile$ git grep
>> scm_open_byte_vector_input_port origin/stable-2.0
>> address@hidden:/usr/local/tmp/guile$ 
>
> You have mispelled the name of the function.  The following (untested)
> code should work on Guile 2.0.5 or later:
>
>    char *buf = c_str ();
>    size_t len = strlen (buf);
>    SCM bv = scm_c_make_bytevector (len);
>    memcpy (SCM_BYTEVECTOR_CONTENTS (bv), buf, len);
>    str_port_ = scm_open_bytevector_input_port (bv, SCM_UNDEFINED);

One would expect that I'd be able to do a simple copy&paste of a
function name.  Sorry for messing this up.

Yes, this looks like it should indeed provide a better match of
"encoding intentions" to our original code.  I'll have to see whether
I can make this approach work with the rest of our code.

I somehow missed that r6rs ports were more than just a compatibility
wrapper written in Scheme.

-- 
David Kastrup





reply via email to

[Prev in Thread] Current Thread [Next in Thread]