bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#70007: [PATCH] native JSON encoder


From: Eli Zaretskii
Subject: bug#70007: [PATCH] native JSON encoder
Date: Fri, 29 Mar 2024 09:04:21 +0300

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Thu, 28 Mar 2024 21:59:38 +0100
> Cc: casouri@gmail.com,
>  70007@debbugs.gnu.org
> 
> 27 mars 2024 kl. 20.05 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> >>> This rejects unibyte non-ASCII strings, AFAU, in which case I suggest
> >>> to think whether we really want that.  E.g., why is it wrong to encode
> >>> a string to UTF-8, and then send it to JSON?
> >> 
> >> The way I see it, that would break the JSON abstraction: it transports 
> >> strings of Unicode characters, not strings of bytes.
> > 
> > What's the difference?  AFAIU, JSON expects UTF-8 encoded strings, and
> > whether that is used as a sequence of bytes or a sequence of
> > characters is in the eyes of the beholder: the bytestream is the same,
> > only the interpretation changes.
> 
> Well no -- JSON transports Unicode strings: the JSON serialiser takes a 
> Unicode string as input and outputs a byte sequence; the JSON parser takes a 
> byte sequence and returns a Unicode string (assuming we are just interested 
> in strings).
> 
> That the transport format uses UTF-8 is unrelated;

It is not unrelated.  A JSON stream is AFAIK supposed to have strings
represented in UTF-8 encoding.  When a Lisp program produces a JSON
stream, all that should matter to it is that any string there has a
valid UTF-8 sequence; where and how that sequence was obtained is of
secondary importance.

> if the user hands an encoded byte sequence to us then it seems more likely 
> that it's a mistake.

We don't know that.  Since Emacs lets Lisp programs produce unibyte
UTF-8 encoded strings very easily, a program could do just that, for
whatever reasons.  Unless we have very serious reasons not to allow
UTF-8 sequences produced by something other than the JSON serializer
itself (and I think we don't), we should not prohibit it.  The Emacs
spirit is to let bad Lisp program enough rope to hang themselves if
that allows legitimate programs do their job more easily and flexibly.

> After all, it cannot have come from a received JSON message.

It could have, if it was encoded by the calling Lisp program.  It
could also have been received from another source, in unibyte form
that is nonetheless valid UTF-8.  If we force non-ASCII strings to be
multibyte, Lisp programs will be unable to take a unibyte UTF-8 string
received from an external source and plug it directly into an object
to be serialized into JSON; instead, they will have to decode the
string, then let the serializer encode it back -- a clear waste of CPU
cycles.

> I think it was just an another artefact of the old implementation. That code 
> incorrectly used encode_string_utf_8 even on non-ASCII unibyte strings and 
> trusted Jansson to validate the result. That resulted in a lot of wasted work 
> and some strange strings getting accepted.

I'm not talking about the old implementation.  I was not completely
happy with it, either, and in particular with its insistence of
signaling errors due to encoding issues.  I think this is not our
business in this case: the responsibility for submitting a valid UTF-8
sequence, when we get a unibyte string, is on the caller.

> While it's theoretically possible that there are users with code relying on 
> this behaviour, I can't find any evidence for it in the packages that I've 
> looked at.

Once again, my bother is not about some code that expects us to encode
UTF-8 byte sequences -- doing that is definitely not TRT.  What I
would like to see is that unibyte strings are passed through
unchanged, so that valid UTF-8 strings will be okay, and invalid ones
will produce invalid JSON.  This is better than signaling errors,
IMNSHO, and in particular is more in-line with how Emacs handles
unibyte strings elsewhere.

> > I didn't suggest to decode the input string, not at all.  I suggested
> > to allow unibyte strings, and process them just like you process
> > pure-ASCII strings, leaving it to the caller to make sure the string
> > has only valid UTF-8 sequences.
> 
> Users of this raw-bytes-input feature (if they exist at all) previously had 
> their input validated by Jansson. While mistakes would probably be detected 
> at the other end I'm not sure it's a good idea.

Why not?  Once again, if we get a unibyte string, the onus is on the
caller to verify it's valid UTF-8, or suffer the consequences.

> >  Forcing callers to decode such
> > strings is IMO too harsh and largely unjustified.
> 
> We usually force them to do so in most other contexts. To take a random 
> example, `princ` doesn't work with encoded strings. But it's rarely a problem.

There are many examples to the contrary.  For example, primitives that
deal with file names can accept both multibyte and unibyte encoded
strings.

> Let's see how testing goes. We'll find a solution no matter what, 
> pass-through or separate slow-path validation, if it turns out that we really 
> need to after all.

OK.  FTR, I'm not in favor of validation of unibyte strings, I just
suggest that we treat them as plain-ASCII: pass them through without
any validation, leaving the validation to the callers.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]