emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: creating unibyte strings


From: Stefan Monnier
Subject: Re: creating unibyte strings
Date: Fri, 22 Mar 2019 11:37:59 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux)

[ Boy this discussion is really frustrating.  I should have just added
  the damn feature and moved on.  Now I'm stuck in this morass!  ]

>> But this has nothing to do with the modules API: it's not more tricky
>> then when doing it purely in Elisp.  Are you seriously suggesting we
>> deprecate unibyte strings altogether?
> We won't deprecate unibyte strings, but we decided long ago to
> minimize their use.

Minimize their use doesn't mean that the places where they are used are
less important.  Sometimes what you need is a unibyte string and nothing
else will do.

It also doesn't explain why you want to make it extra cumbersome for
modules whereas Elisp can still do it conveniently.

>> Then I don't know what subtleties you're talking about.
>> Can you give some examples of the kinds of things you're thinking of?
> String concatenation, for one.  Regular expression search for another.
> And those just the ones I thought about in the first 5 seconds.

I don't see in which way these are better hidden for multibyte strings
than they are for unibyte strings.

>> >> > Instead, how about doing that via vectors of byte values?
>> >> What's the advantage?  That seems even more convoluted: create a Lisp
>> >> vector of the right size (i.e. 8x the size of your string on a 64bit
>> >> system), loop over your string turning each byte into a Lisp integer
>> >> (with the reverted API, this involves allocation of an `emacs_value`
>> >> box), then pass that to `concat`?
>> > That's one way, but I'm sure I can come up with a simpler one. ;-)
>> I'm all ears.
> Provide an Emacs primitive for that, then at least some of the
> awkwardness is gone.

No matter the primitive you provide, it means that to build a unibyte
Elisp strings out of a C char[], you're suggesting we go through an
extra copy that uses up 8x the memory.

With such inefficient interfaces, the whole idea of writing modules
becomes completely unattractive: better write a separate application and
communicate via pipes (then you can get unibyte strings in the natural
way).

> And/or use records.

I don't understand what you mean by "use records".

>> >> It's probably going to be even less efficient than going through utf-8
>> >> and back.
>> > I doubt that.  It's just an assignment.  And it's a rare situation
>> > anyway.
>> Why do you think it's rare?
> Because the number of Emacs features that require you to submit a
> unibyte string is very small.

Maybe rare in terms of number of lines of code that will want to do.
But that doesn't mean rare in terms of number of times it'll be executed
for a specific user, so performance considerations should apply.

>> 2- the C side string contains text in latin-1, big5, younameit.
>>    The module API provides nothing convenient.  Should we force our
>>    module to link to C-side coding-system libraries to convert to utf-8
>>    before passing it on to the Elisp, even though Emacs already has all
>>    the needed facilities?  Really?
>
> Yes, really.  Why is that a problem?  libiconv exists on every
> platform we support, and is easy to use.  Moreover, if you just want
> to convert a native string into another native string, using Emacs
> built-in en/decoding machinery is inconvenient, because it involves
> more copying than necessary.

The idea is not to use Emacs as a C library for text conversion, but
that if you receive a latin-1 string and want to pass it to Emacs, it
makes a lot of sense to do:

    make_bytestring (s)
    
and later

    (decode-coding-string s)

then having to link with libiconv.

>> 3- The C side string contains binary data (say PNG images).
>>    What does "arrange for it to be UTF-8" even mean?
> Nothing, since in this case there's no meaning to "decoding".

My point exactly: what should be done instead?

The solution currently used for this existing case is to call make_string
on it (even though it's not a utf-8 string) and then pass it through
(encode-coding-string s 'utf-8) which is ridiculously inefficient
compared to what make_bytestring would do.


        Stefan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]