qemu-commits
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-commits] [qemu/qemu] 2a9604: json: Fix lexer for lookahead charact


From: GitHub
Subject: [Qemu-commits] [qemu/qemu] 2a9604: json: Fix lexer for lookahead character beyond '\x...
Date: Tue, 25 Sep 2018 05:04:43 -0700

  Branch: refs/heads/master
  Home:   https://github.com/qemu/qemu
  Commit: 2a96042a8da60b625cc9dbbdab3b03cd7586e34f
      
https://github.com/qemu/qemu/commit/2a96042a8da60b625cc9dbbdab3b03cd7586e34f
  Author: Markus Armbruster <address@hidden>
  Date:   2018-09-24 (Mon, 24 Sep 2018)

  Changed paths:
    M qobject/json-lexer.c

  Log Message:
  -----------
  json: Fix lexer for lookahead character beyond '\x7F'

The lexer fails to end a valid token when the lookahead character is
beyond '\x7F'.  For instance, input

    true\xC2\xA2

produces the tokens

    JSON_ERROR     true\xC2
    JSON_ERROR     \xA2

This should be

    JSON_KEYWORD   true
    JSON_ERROR     \xC2
    JSON_ERROR     \xA2

instead.

The culprit is

    #define TERMINAL(state) [0 ... 0x7F] = (state)

It leaves [0x80..0xFF] zero, i.e. IN_ERROR.  Has always been broken.
Fix it to initialize the complete array.

Signed-off-by: Markus Armbruster <address@hidden>
Reviewed-by: Eric Blake <address@hidden>
Message-Id: <address@hidden>


  Commit: 852dfa76b85c5d23541377809aa4bcfb4fc037db
      
https://github.com/qemu/qemu/commit/852dfa76b85c5d23541377809aa4bcfb4fc037db
  Author: Markus Armbruster <address@hidden>
  Date:   2018-09-24 (Mon, 24 Sep 2018)

  Changed paths:
    M qobject/json-lexer.c

  Log Message:
  -----------
  json: Clean up how lexer consumes "end of input"

When the lexer isn't in its start state at the end of input, it's
working on a token.  To flush it out, it needs to transit to its start
state on "end of input" lookahead.

There are two ways to the start state, depending on the current state:

* If the lexer is in a TERMINAL(JSON_FOO) state, it can emit a
  JSON_FOO token.

* Else, it can go to IN_ERROR state, and emit a JSON_ERROR token.

There are complications, however:

* The transition to IN_ERROR state consumes the input character and
  adds it to the JSON_ERROR token.  The latter is inappropriate for
  the "end of input" character, so we suppress that.  See also recent
  commit a2ec6be72b8 "json: Fix lexer to include the bad character in
  JSON_ERROR token".

* The transition to a TERMINAL(JSON_FOO) state doesn't consume the
  input character.  In that case, the lexer normally loops until it is
  consumed.  We have to suppress that for the "end of input" input
  character.  If we didn't, the lexer would consume it by entering
  IN_ERROR state, emitting a bogus JSON_ERROR token.  We fixed that in
  commit bd3924a33a6.

However, simply breaking the loop this way assumes that the lexer
needs exactly one state transition to reach its start state.  That
assumption is correct now, but it's unclean, and I'll soon break it.
Clean up: instead of breaking the loop after one iteration, break it
after it reached the start state.

Signed-off-by: Markus Armbruster <address@hidden>
Reviewed-by: Eric Blake <address@hidden>
Message-Id: <address@hidden>


  Commit: c0ee3afa7fa2547b5766dd25e52ced292c204d4e
      
https://github.com/qemu/qemu/commit/c0ee3afa7fa2547b5766dd25e52ced292c204d4e
  Author: Markus Armbruster <address@hidden>
  Date:   2018-09-24 (Mon, 24 Sep 2018)

  Changed paths:
    M qobject/json-lexer.c
    M qobject/json-parser-int.h

  Log Message:
  -----------
  json: Make lexer's "character consumed" logic less confusing

The lexer uses macro TERMINAL_NEEDED_LOOKAHEAD() to decide whether a
state transition consumes the input character.  It returns true when
the state transition is defined with the TERMINAL() macro.  To detect
that, it checks whether input '\0' would have resulted in the same
state transition, and the new state is not IN_ERROR.

Why does that even work?  For all states, the new state on input '\0'
is either IN_ERROR or defined with TERMINAL().  If the state
transition equals the one we'd get for input '\0', it goes to IN_ERROR
or to the argument of TERMINAL().  We never use TERMINAL(IN_ERROR),
because it makes no sense.  Thus, if it doesn't go to IN_ERROR, it
must be defined with TERMINAL().

Since this isn't quite confusing enough, we negate the result to get
@char_consumed, and ignore it when @flush is true.

Instead of deriving the lookahead bit from the state transition, make
it explicit.  This is easier to understand, and a bit more flexible,
too.

Signed-off-by: Markus Armbruster <address@hidden>
Reviewed-by: Eric Blake <address@hidden>
Message-Id: <address@hidden>


  Commit: 0f07a5d5f1f484c9c334d52193617e89442da7c9
      
https://github.com/qemu/qemu/commit/0f07a5d5f1f484c9c334d52193617e89442da7c9
  Author: Markus Armbruster <address@hidden>
  Date:   2018-09-24 (Mon, 24 Sep 2018)

  Changed paths:
    M qobject/json-lexer.c
    M tests/qmp-test.c

  Log Message:
  -----------
  json: Nicer recovery from lexical errors

When the lexer chokes on an input character, it consumes the
character, emits a JSON error token, and enters its start state.  This
can lead to suboptimal error recovery.  For instance, input

    0123 ,

produces the tokens

    JSON_ERROR    01
    JSON_INTEGER  23
    JSON_COMMA    ,

Make the lexer skip characters after a lexical error until a
structural character ('[', ']', '{', '}', ':', ','), an ASCII control
character, or '\xFE', or '\xFF'.

Note that we must not skip ASCII control characters, '\xFE', '\xFF',
because those are documented to force the JSON parser into known-good
state, by docs/interop/qmp-spec.txt.

The lexer now produces

    JSON_ERROR    01
    JSON_COMMA    ,

Update qmp-test for the nicer error recovery: QMP now reports just one
error for input %p instead of two.  Also drop the newline after %p; it
was needed to tease out the second error.

Signed-off-by: Markus Armbruster <address@hidden>
Reviewed-by: Eric Blake <address@hidden>
Message-Id: <address@hidden>
[Conflict with commit ebb4d82d888 resolved]


  Commit: 2ce4ee64c4fe0463c53a99955a3acdaa8a451136
      
https://github.com/qemu/qemu/commit/2ce4ee64c4fe0463c53a99955a3acdaa8a451136
  Author: Markus Armbruster <address@hidden>
  Date:   2018-09-24 (Mon, 24 Sep 2018)

  Changed paths:
    M qobject/json-lexer.c
    M qobject/json-parser-int.h

  Log Message:
  -----------
  json: Eliminate lexer state IN_ERROR

Signed-off-by: Markus Armbruster <address@hidden>
Reviewed-by: Eric Blake <address@hidden>
Message-Id: <address@hidden>


  Commit: 1e960b46024d468e76d2f42ddcfa5a9d521db492
      
https://github.com/qemu/qemu/commit/1e960b46024d468e76d2f42ddcfa5a9d521db492
  Author: Markus Armbruster <address@hidden>
  Date:   2018-09-24 (Mon, 24 Sep 2018)

  Changed paths:
    M qobject/json-lexer.c
    M qobject/json-parser-int.h

  Log Message:
  -----------
  json: Eliminate lexer state IN_WHITESPACE, pseudo-token JSON_SKIP

The lexer ignores whitespace like this:
    on whitespace      on non-ws   spontaneously
    IN_START --> IN_WHITESPACE --> JSON_SKIP --> IN_START
              ^    |
               \__/  on whitespace

This accumulates a whitespace token in state IN_WHITESPACE, only to
throw it away on the transition via JSON_SKIP to the start state.
Wasteful.  Go from IN_START to IN_START on whitespace directly,
dropping the whitespace character.

Signed-off-by: Markus Armbruster <address@hidden>
Reviewed-by: Eric Blake <address@hidden>
Message-Id: <address@hidden>


  Commit: f69d20fa8badbd6b515cc3d9e0a95b36f0410a46
      
https://github.com/qemu/qemu/commit/f69d20fa8badbd6b515cc3d9e0a95b36f0410a46
  Author: Peter Maydell <address@hidden>
  Date:   2018-09-25 (Tue, 25 Sep 2018)

  Changed paths:
    M qobject/json-lexer.c
    M qobject/json-parser-int.h
    M tests/qmp-test.c

  Log Message:
  -----------
  Merge remote-tracking branch 'remotes/armbru/tags/pull-qobject-2018-09-24' 
into staging

QObject patches for 2018-09-24

# gpg: Signature made Mon 24 Sep 2018 17:09:58 BST
# gpg:                using RSA key 3870B400EB918653
# gpg: Good signature from "Markus Armbruster <address@hidden>"
# gpg:                 aka "Markus Armbruster <address@hidden>"
# Primary key fingerprint: 354B C8B3 D7EB 2A6B 6867  4E5F 3870 B400 EB91 8653

* remotes/armbru/tags/pull-qobject-2018-09-24:
  json: Eliminate lexer state IN_WHITESPACE, pseudo-token JSON_SKIP
  json: Eliminate lexer state IN_ERROR
  json: Nicer recovery from lexical errors
  json: Make lexer's "character consumed" logic less confusing
  json: Clean up how lexer consumes "end of input"
  json: Fix lexer for lookahead character beyond '\x7F'

Signed-off-by: Peter Maydell <address@hidden>


Compare: https://github.com/qemu/qemu/compare/2f831d04985f...f69d20fa8bad
      **NOTE:** This service has been marked for deprecation: 
https://developer.github.com/changes/2018-04-25-github-services-deprecation/

      Functionality will be removed from GitHub.com on January 31st, 2019.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]