lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LYNX-DEV error recovery for form parsing


From: Foteos Macrides
Subject: Re: LYNX-DEV error recovery for form parsing
Date: Mon, 07 Apr 1997 16:54:28 -0500 (EST)

Klaus Weide <address@hidden> wrote:
>On Sat, 5 Apr 1997, Foteos Macrides wrote:
>>[...]
>>      The current Lynx API uses ***TWO** stack-based parsers, one in
>> SGML.c, and another in HTML.c.  The one in HTML.c stacks "container"
>> HTML elements (ones not declared SGML_EMPTY in HTMLDTD.c), and depends
>> on the SGML.c parser to enforce valid (*strictly* embedded and *never*
>> interdigitated) nesting of them.  That is why the SGML.c functions
>> substitute the "expected" end tags for "container" HTML elements before
>> invoking HTML.c functions.  If you break that, as in your patch, in
>> Laura's original patch, and in her more recent "BETTER SOLUTION"
>> patch, you throw the HTML.c stack out of whack.  
>
>Although I think this was not the case with Hynek's patch - if it had
>worked the way he intended.
>
>An example he gave was
>   <B><A HREF="something"></B>something</A>
>
>Regular Lynx SGML.c processing would treat that as (== pass it down to HTML.c
>as if it were)
>   <B><A HREF="something"></A>something</B>
>giving a link that cannot be selected.
>
>With Hynek's patch instead:
>   <B><A HREF="something">something</A>
>The </B> is ignored (the SGML.c parser's stack is not changed when
></B> is encountered), and when the </A> is detected B is still on the
>stack (possibly until the end of the document).  But at least this 
>doesn't create out-of-order calls to HTML_start_element/HTML_end_element.

        The HTML stack will have pushed B then A onto it's stack, and
so expects SGML_character() to call HTML_end_element() with A preceding
B.  In that particular case (depending, though, on what preceded the
snippet of markup), skipping the B end tag does what's expected, with
the inocuous (but ugly) consequence of leaving the subsequent text
underlined (which will be dealt with in the worst case possibility by
the SGML_free() wind-down, or the HTML_free() "insurance check", to
make sure you don't bleed underlining into the next document).  But if
you are dealing with FORM markup and you skipped a SELECT or TEXTAREA
end tag, you're courting disaster.


>With Laura's "BETTER SOLUTION" patch (the first one was specific to FORM,
>but I think the principle was the same):
>   <B><A HREF="something"></B>something</A>
>I.e. generating calls to HTML_start_element/HTML_end_element in invalid
>order.  (Changing the order of stack elements, by using anything else than
>push on pop operations, of course makes the whole idea of having a
>stack structure pointless.)
>
>From Laura's description:
>"Strategy of fix:  If and end tag </xxx> is found that doesn't match the top
> element of the stack, search down the stack until you find a match.  If
> there's no match, ignore the end tag;"...
>
>Isn't this *first* part reasonable?  (just ignoring end tags that
>cannot possibly be right.)  It doesn't mess up the stacks (or so it seems
>to me).

        Guard against the trap of thinking that you can know reliably
what was intended by bad HTML, and making mods based on particular
cases encounted, that could in fact set you off going 'round in circles.
The seemingly spurious end tag is a classic case.  It might in fact
be that, or it might be a typo, e.g.:

        <H1>I am typo prone!</H2>

The current code would, fortuituously, correct that typo, whereas if you
search the SGML parser's stack for an H2 start tag, fail to find it, and
conclude that the end tag is spurious (should be ignored), you'd end up
making the header extend well into the document, perhaps to its end.
For that example, you end up with inocuous ugliness.  But if it were:

        <TEXTAREA>...</TEXTARA>

the rest of the document would be treated as the TEXTAREA content.


>>[...]
>>      Be that as it may, appended is a patch set for v2.7.1 which
>> achieves what you and Laura are attempting, and without throwing
>> the HTML.c stack out of whack.  It is also available (as a
>> formhack.patch text file and in a formhack.zip) in:
>> 
>>         http://www.slcc.edu/lynx/fote/patches/
>> or:      ftp://www.slcc.edu/pub/lynx/fote/patches
>> 
>
>Another step in making Lynx's parsing more like that of the abovementioned 
>vendor's(s') products, unfortunately.

        This is directed to people who posted "We want reverse-engineering
for Netscape with freedom to stay ignorant about valid HTML and URL
syntax." retorts to Klaus' last sentence.

        Klaus is the only currently active developer who is not
only a skilled programmer extensively knowledgeable about the Lynx
code, but *ALSO* someone who is knowledgeable and keeps informed about
the relevant RFCs and IDs out of an inherent interest in quality
development of the Web as a whole.  Also, despite his initial
statement of non-intention, he has largely accepted the mantle
of de-facto coordinator.  He got involved with Lynx initially out
of a "general" interest in content negotiation, and became heavily
involved in large part out of an interest in "sound" charset/language
handling.

        Everyone involved in "active" Lynx development (actually
working on the code) is doing so as a "spare time hobby".  People
pursue hobbies when they are fun, and stop if they become more
a chore than fun.

        I got heavily involved with Lynx back when HTML 3.0 was
viable, because is was fun implementing it's advanced features,
with the combined challange of adapting them to a character cell
client, ahead of the GUI pack.  It became progressively less fun,
and more just a chore, as the name of the game progressively
became reverse-engineering for Lynx users who want freedom to
remain ignorant.

        Think twice about creating the same situation for Klaus.
        
                                Fote

=========================================================================
 Foteos Macrides            Worcester Foundation for Biomedical Research
 address@hidden         222 Maple Avenue, Shrewsbury, MA 01545
=========================================================================
;
; To UNSUBSCRIBE:  Send a mail message to address@hidden
;                  with "unsubscribe lynx-dev" (without the
;                  quotation marks) on a line by itself.
;

reply via email to

[Prev in Thread] Current Thread [Next in Thread]