Re: input.cpp, read_size() error reporting

groff
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: input.cpp, read_size() error reporting

From:	G. Branden Robinson
Subject:	Re: input.cpp, read_size() error reporting
Date:	Thu, 2 Apr 2020 08:31:09 +1100
User-agent:	NeoMutt/20180716
Hi Ingo,

tl; dr: I'm still here and find your dissection of the issue cogent and
reasonable.  I just didn't and don't have the cycles to invest in the
refactoring of read_size() that I was beginning to intuit was necessary.
I'm more concerned with the other patch I proposed, regarding the living
fossil of \sN where N is sometimes, but not always, 2 digits.  I invite
your commentary on it.

Very long response follows.

At 2020-04-01T18:13:50+0200, Ingo Schwarze wrote:
> > I reverted my recent changes to this file because Ingo hates them
> > and emailed me to tell me so.
> 
> Please note that i'm not trying to bash Branden.  Quite to the
> contrary, i very much appreaciate that he is working on improved
> error reporting: in my presentations at international BSD conferences,
> i often said that error reporting is important and needs to be
> accurate and as clear as possible without being excessively noisy,
> for the benefit of users.

Yes, I think we share similar values here.

> Also note that i'm not blaming him for committing a patch that had
> a number of problems.  It does sometimes happen that soemthing looks
> like an obvious improvement to a developer so they just commit it,
> and it only turns out later that it isn't quite right yet.  That
> can easily be fixed by reverting, then reconsidering, which is
> exactly what Branden did here: no lasting damage done.  Of course,
> if you are unsure whether something is right, asking for code review
> before committing is better; but in particular in a small project
> lacking manpower, sometimes being bold and just moving ahead with
> improvements may be the right thing to do.

groff mailing list participation, like its development, seems to proceed
in fits and starts.  It is difficult to predict the quantity of latent
resources available for code review.  That's just where the project is
right now, as far as I can tell.

> > IMO read_size() needs a refactor
> 
> I just scrutinized the code and i must say i find that particular
> function quite clear and straightforward.  To me, it feels like
> James Clark did a decent job here in 1991, which also stood the
> test of time and needed few fixes over the years.  The few that
> were needed came from Werner Lemberg, as far as i looked.

Yes, and I don't mean to denigrate James or Werner, whom I intuit are
both better programmers than I am; what makes me itch about read_size()
and about the parser generally is that it is not obvious how many input
tokens it is acceptable to discard when a parse goes wrong because the
input is malformed.  This is of course not a novel problem nor unique to
groff; parser state recovery in error conditions is an old and notorious
problem.  I'm pretty sure I first learned about it from Kernighan & Pike
1984, _The Unix Programming Environment_.

Part of my frustration arises from a fear that (1) this area is
underspecified and therefore that (2) _any_ proposed modification is
going to meet with resistance due to conservatism and habit.  My
frustration is compounded by the fact that (2) isn't always an
unreasonable stance, especially with "legacy" systems.  If *roff systems
are to have a future, they must cope with a corpus of existing
documents.  Perhaps not perfectly, but well enough.

Now is a good time to point to an IETF draft I encountered recently by
M. Thomson, "The Harmful Consequences of Postel's Maxim"[1].  *roff
interpreters had a design principle that input streams should be handled
on a best-effort basis, rather than "fail fast, fail hard".

The problem with a best-effort basis is that it can end up turning
implementation details, dubious decisions, and even bugs, into de facto
standards.  I think the recent \s39 vs. \s40 debacle is a fantastically
illustrative example.

At this point I'm going to snip your pseudocode summary and many of your
test case analyses not because I think they are bad (I don't), but
because I think they're find and I don't feel I can supply further
worthwhile commentary, and this mail is already quite long.

> Then again, if there are reasonable ideas for making it even simpler
> and even more readable, no objection to that, but so far, i don't
> see how, or in which way the design of the function might be
> defective.

I have no criticism at all of how it handles well-formed input.  I don't
have any ideas at all about how I could possibly improve it there.

My concern is with error handling, and doing that right, in my view,
means deciding a lot of tedious edge cases which as far as I know have
never been documented anywhere.  As noted above, my ideal is to fail
fast and fail hard.  In this specific context, that means that as soon
as an EOF or invalid byte is encountered in an \s escape escape
sequence, throw a diagnostic (as helpfully as possible, including the
offending token), and get the hell out of size-escape-parsing state,
re-interpreting the illegal token in the enclosing context.  Frequently,
we won't be in copy mode, so a newline would be interpreted as a word
break (if filling), an alphanumeric will be interpreted as text to be
rendered at the current point size, and so forth.

That sounds simple and easy.  As I understand read_size(), just not
calling tok.next() again suffices to leave the current token on the
input stack, and returning 0 will get us out of
point-size-escape-parsing state.

However...

$ nroff
.pl 1v
\sABC
troff: <standard input>:2: warning: numeric expression expected (got 'B')
C

...that's not where we are today.  We lost not just one but two tokens!

And someone is going to want us to keep it this way forever because of
some damned legacy document.

If I _didn't_ care about groff's ability to render legacy documents,
especially ones personally important to me like "Volume 2" Unix manuals,
CSTRs, and CACM journal articles written before the waters of TeX
inundated the earth, I would have no problem rejecting such protests.

But I do care, so my values are in tension and I am frustrated.  Even if
we were able to shunt all the irritating special parses into
compatibility mode, as I have proposed to do with \s39/\s40, it will
still make the parser code more complex and therefore harder to
understand and less maintainable, which are almost synonyms for "ugly"
to me.

I am grateful that my \s39/\s40 special case is about as minimally
complexity-adding as C code gets.

But I feel certain than worse cases are in store.

Another failure of software engineering, our lack of comprehensive test
cases and an in-tree corpus of model (not necessarily ideal!) real-world
documents--not written by our own implementors--against which to
regression test, compounds the pain of this situation.  Remedies to the
latter factor are largely foreclosed by effectively infinite copyright
durations and proprietary licensing schemes.

So if my commit message sounded piqued, please understand that in this
little exercise I managed to encounter a confluence of the things
that irritate me most about software development.

[...]
> In particular, note that when encountering a token that is invalid in
> the given context, as far as i can see, the function consistently
> discards that token

Yeah, I'm not sure that's the right thing to do, given *roff languages'
tradition of "best-effort" rendering.  But without a formal spec, what
are we to do?  Well, I do have a concrete idea, as it turns out.

> and consistently reports the error.

...not _perfectly_.  These three cases have silent token discards
without diagnostic:

\sA
\s[+
\s[-

[some test cases]
> Results in "cannot use newline as a starting delimiter".
> This might profit from saying that we are talking about \s.

I'm not certain, and I have not yet steeled my nerves to look again, but
I don't think the function that throws this diagnostic knows its calling
context with that much specificity.  It would have to be passed in.
That means extending the interface.

> Results in "missing number".
> This might profit from saying that the invalid token is 'A', '+',
> '-', newline, or newline, respectively.

Yeah, that's the direction I was trying to head with my patches.

> Results in "bad digit in point size".
> This might profit from saying that the bad digit is newline,
> newline, '+', '-', newline, or 'x', respectively.

I think that's the specific case I actually caught, albeit
imperfectly--you found '\%', which I didn't think to check.

> >     \s[++]
> >     \s[--]
> 
> Results in "numeric expression expected (got ']')"
> That sounds quite good already, i think.

The form and content are great.  Failing to reject "++" and "--" as
nonsense is not.

>       \s#1
> 
> Results in "missing closing delimiter".
> This might profit from saying that the missing delimiter is '#'.

Where are (almost) arbitrary delimiters on this form of escape
documented as accepted?  I'm only accustomed to them in, say, .tl
requests.  I don't remember this as a general syntactical feature from
either CSTR #54 or groff(7).

> I'm not sure i follow: the current code - after your patch got
> reverted - appears to report all errors, as far as i can see.
> What exactly do you consider "missed"?  I admit though that in
> some cases, there error messages could probably provide more detail.

I think I provided a much clearer statement of what I view as the
defects above.

> > and sometimes characters get eaten (instead of sent to output)
> > because of the lack of error-checking in read_size().
> 
> If a token is invalid at some point, it (almost?) always gets eaten,
> not just in this function.  I think that is a basic design principle
> of how roff(7) parsing is supposed to work.

I think that's a guess on your part.  And I don't blame you for
guessing.  I think the fault lies with "best-effort parsing" and "being
liberal in what you accept".  In my view those principles amount to the
implementor rejecting only input that is "too hard" for their experience
level or (always undocumented) cost/benefit analysis, which means many
accidental aspects of an implementation get de facto standardized, when
that implementation is successful.

> Also, even after scrutinizing the code and doing some testing,
> i fail to see any lack of error-checking.  Can you say which
> error exactly isn't checked for, i.e. which exact input results
> in which faulty output or deficient error reporting, and what
> exactly you would like to see instead?

I believe I did that above, but I acknowledge that I may be proceeding
from a different design philosophy than you are.

> Below is an example of a patch showing how error messages might be
> improved, in one specific respect.  My testing is limited so far:
> 
>    $ printf '\\s(\x02\n' | ./test-groff -T ascii       
>   troff: <standard input>:1: error: bad digit in point size: character code 2
>    $ printf '\\s(x\n' | ./test-groff -T ascii
>   troff: <standard input>:1: error: bad digit in point size: 'x'
>    $ printf '\\s(\\%%\n' | ./test-groff -T ascii 
>   troff: <standard input>:1: error: bad digit in point size
>    $ printf '\\s(\n' | ./test-groff -T ascii     
>   troff: <standard input>:1: error: bad digit in point size

This all looks fine to me.  I suppose we don't need to mention that it's
a point size _escape_ rather than a request, but I would lean toward
doing that if the location throwing the diagnostic is on a code path
that only escapes take.

Another personal commandment I have is to never key the same diagnostic
string literal into the source code twice.

> Of course, it would be possible to add more clauses like if (tok.eof())
> else if (tok.newline()) and so on, but given the wide variety of
> tokens that exist, i have a feeling this could easily be overdone.

Yes, I started to feel that way too, and the idea of defining a local
preprocessor macro to achieve this flitted through my head.  I intuited,
perhaps wrongly, this this would meet resistance.

I mean something like (please disregard C/C++ style issues):

#define FETCH_AND_VALIDATE_TOKEN(expected_stuff, error_template) do { \
   tok.next(); \
   /* yadda yadda yadda */ \
} while(0)

static int read_size(...)
{
...
   /* many uses of FETCH_AND_VALIDATE_TOKEN() */
...
}

#undef FETCH_AND_VALIDATE_TOKEN

where "expected_stuff" is a C++ expression to match on desirable values
of the token, and "error_template" is a string so we have maximal
context to tell the user what went wrong if the expectation fails.

The nice thing about embedding a macro rather than calling to another
function is that we don't have to pass any state around.

Was I wrong to conjecture that folks on this list would find the above
gross?

> Oh, is there a better function to use here than
> input_char_description(), that is maybe able to describe more tokens
> in a human-readable way?  I don't know the groff parsing code well
> enough to judge that.

Nor I.  Maybe one of the veterans can speak to this point.

Or I'll come back to it at some point maybe.  I still have a half-done
tbl(1) (the page, not the tool) rewrite to kick out of the nest--I've
been stalled on it since February.

> P.S.
> If you like this patch, if it survives full testing, and if it gets
> committed, one could reduce the indentation of the final code in
> the function, afterwards.  In general, it feels clearer to do error
> handling up front, then let normal processing proceed once errors
> have been dealt with, rather than handling errors in an else clause
> as if they were an afterthought, indenting all the normal processing
> one level for each kind of error that needs to be handled.

Yes, I generally endorse that approach.  Get a person used to it and you
can seduce them into languages using languages that support function
preconditions and postconditions as syntactical features.  :P

> diff --git a/src/roff/troff/input.cpp b/src/roff/troff/input.cpp
> index fb990bdf..6ecd2d52 100644
> --- a/src/roff/troff/input.cpp
> +++ b/src/roff/troff/input.cpp
> @@ -5106,7 +5106,13 @@ static int read_size(int *x)
>        return 0;
>      }
>    }
> -  if (!bad) {
> +  if (bad) {
> +    if (c)
> +      error("bad digit in point size: %1", input_char_description(c));
> +    else
> +      error("bad digit in point size");
> +    return 0;
> +  }
>      switch (inc) {
>      case 0:
>        if (val == 0) {
> @@ -5131,11 +5137,6 @@ static int read_size(int *x)
>        *x = 1;
>      }
>      return 1;
> -  }
> -  else {
> -    error("bad digit in point size");
> -    return 0;
> -  }
>  }
>  
>  static symbol get_delim_name()

The above looks like low churn for good return.  LGTM.

We should still work out the test cases, especially if you expect me to
write them.  ;-)  Did you notice I didn't use any bashisms in the last
one?  :-P

Regards,
Branden

[1] https://tools.ietf.org/html/draft-thomson-postel-was-wrong-00
signature.asc
Description: PGP signature
[Prev in Thread]
Current Thread
[Next in Thread]
input.cpp, read_size() error reporting, Ingo Schwarze, 2020/04/01
- Re: input.cpp, read_size() error reporting, G. Branden Robinson <=
  - Re: input.cpp, read_size() error reporting, Ingo Schwarze, 2020/04/03
  - Message not available
    - Re: input.cpp, read_size() error reporting, G. Branden Robinson, 2020/04/04
Prev by Date: input.cpp, read_size() error reporting
Next by Date: Re: weird \s
Previous by thread: input.cpp, read_size() error reporting
Next by thread: Re: input.cpp, read_size() error reporting
Index(es):
- Date
- Thread