|
From: | Herman , Géza |
Subject: | Re: I created a faster JSON parser |
Date: | Sat, 09 Mar 2024 12:08:54 +0100 |
Eli Zaretskii <eliz@gnu.org> writes:
From: Herman, Géza <geza.herman@gmail.com> Cc: Herman Géza <geza.herman@gmail.com>, emacs-devel@gnu.org Date: Fri, 08 Mar 2024 21:22:13 +0100Yes, it seems that EMACS_UINT is good for my purpose, thanks forthe suggestion.Are you sure you need the unsigned variety? If EMACS_INT fits the bill, then it is a better candidate, since unsigned arithmetics hasits quirks.
Yes, I think it's better to use unsigned: read the sign, and then parse the number as unsigned, and then apply the sign at the end. If the number is parsed with its sign, it needs an additional step at each character (the sign needs to be applied to each digit).
Also, I see that json-parse-string calls some utf8 encoding relatedfunction before parsing, but json-parse-buffer doesn't (and itdoesn't do anything encoding related thing in the callback, it justcalls memcpy).This is a part I was never happy about. But, as I say above, we canget to handling these rare cases later.
I think this is an additional benefit of my parser: this feature can be added to it more easily than into jansson. Even, I'm tempted to say that we could just remove utf-8 checking from my code, and then Emacs's encoding method should work right out of the box.
Or, to say that utf-8 handling should stay as is. Because as far as I understand, if the JSON contains an invalid utf-8 sequence which is not invalid according to Emacs's character representation, then this problem won't be detected. So checking for utf-8 encoding errors shouldn't be the job of the json parser, but around IO handling, which has the chance to know that the JSON stream itself must only contain a valid utf-8 encoding.
Or, as the JSON specification explcitly says that the allowed character range is 0x20 .. 0x10ffff, the current solution is fine, because it is actually against JSON rules to allow anything else outside of this range.
Once again, we can extend the parser for codepoints outside of theUnicode range later. For now, it's okay to reject them with a suitable error.
OK, cool, I added Qjson_utf8_decode_error to indicate decoding errors.
How can we proceed further? This is the current state of the patch: https://github.com/geza-herman/emacs/commit/ce5d990776a1ccdfd0b6d9c4d5e5e5df55245672.patch
I think I did everything that was asked for, except Po Lu's parenthesis-related comment, because I still don't know what to parenthesize and what not to. I saw a lot of "a + x * y" kind of expressions in emacs codebase without any parenthesis. Are the exact rules documented somewhere?
[Prev in Thread] | Current Thread | [Next in Thread] |