emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: I created a faster JSON parser


From: Herman , Géza
Subject: Re: I created a faster JSON parser
Date: Sat, 09 Mar 2024 12:08:54 +0100


Eli Zaretskii <eliz@gnu.org> writes:

From: Herman, Géza <geza.herman@gmail.com>
Cc: Herman Géza <geza.herman@gmail.com>,
 emacs-devel@gnu.org
Date: Fri, 08 Mar 2024 21:22:13 +0100
Yes, it seems that EMACS_UINT is good for my purpose, thanks for
the suggestion.

Are you sure you need the unsigned variety? If EMACS_INT fits the bill, then it is a better candidate, since unsigned arithmetics has
its quirks.

Yes, I think it's better to use unsigned: read the sign, and then parse the number as unsigned, and then apply the sign at the end. If the number is parsed with its sign, it needs an additional step at each character (the sign needs to be applied to each digit).

Also, I see that json-parse-string calls some utf8 encoding related
function before parsing, but json-parse-buffer doesn't (and it
doesn't do anything encoding related thing in the callback, it just
calls memcpy).

This is a part I was never happy about. But, as I say above, we can
get to handling these rare cases later.

I think this is an additional benefit of my parser: this feature can be added to it more easily than into jansson. Even, I'm tempted to say that we could just remove utf-8 checking from my code, and then Emacs's encoding method should work right out of the box.

Or, to say that utf-8 handling should stay as is. Because as far as I understand, if the JSON contains an invalid utf-8 sequence which is not invalid according to Emacs's character representation, then this problem won't be detected. So checking for utf-8 encoding errors shouldn't be the job of the json parser, but around IO handling, which has the chance to know that the JSON stream itself must only contain a valid utf-8 encoding.

Or, as the JSON specification explcitly says that the allowed character range is 0x20 .. 0x10ffff, the current solution is fine, because it is actually against JSON rules to allow anything else outside of this range.

Once again, we can extend the parser for codepoints outside of the
Unicode range later.  For now, it's okay to reject them with a
suitable error.

OK, cool, I added Qjson_utf8_decode_error to indicate decoding errors.

How can we proceed further? This is the current state of the patch: https://github.com/geza-herman/emacs/commit/ce5d990776a1ccdfd0b6d9c4d5e5e5df55245672.patch

I think I did everything that was asked for, except Po Lu's parenthesis-related comment, because I still don't know what to parenthesize and what not to. I saw a lot of "a + x * y" kind of expressions in emacs codebase without any parenthesis. Are the exact rules documented somewhere?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]