lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LYNX-DEV pre-announcing a new Lynx SGML.c parser


From: Foteos Macrides
Subject: Re: LYNX-DEV pre-announcing a new Lynx SGML.c parser
Date: Mon, 21 Apr 1997 20:59:55 -0500 (EST)

Klaus Weide <address@hidden> wrote:
>Exciting news (well, you may disagree...),
>
> I have finished modifying the first stage "SGML" parsing in lynx
>to be somewhat closer to a real SGML parser.  Essentially, I have
>extended the per-tag information (given in HTMLDTD.c) to include
>more of the content model info of a real DTD, and done away with the
>special treatment of some tags in SGML.c.  end_element and also
>start_element in SGML.c now do partial stack wind-downs, depending
>on whether an element is "allowed" to close another one (and, in some
>cases, whether the other element's end tag can be legally omitted).
>
> This delivers to the next stage (HTML.c) a series of
>HTML_{start,end}_element which are always correctly ordered, for all
>elements which are not declared as SGML_EMPTY, and I have removed the 
>SGML_EMPTY flags from a number of tags that were specially treated
>before, including P and (recently) FORM.  Note that I haven't made
>*any* changes to HTML.c to accomodate the changes in SGML.c and
>HTMLDTD.c.  It works with the unchanged HTML.c, which is great and
>shows that these modules have remained reasonably independent of
>each other; it does however not always give identical results
>(screen appearance) even for valid HTML, which shows that sometimes
>HTML.c is relying on specific hacks for specific elements in SGML.c
>and the old "DTD". (for example, declaring P as SGML_EMPTY *and*
>converting </P> to <P>).
>
> I would like to have HTML.c in a form that it could deal equally with
>being called from the modified SGML.c parser, as well as from the
>old-style parser (with a, possibly increasing, number of hacks).  This
>would allow testing of recovery heuristics with the new parser and
>comparison with the old way, without each time having to modify HTML.c.
>
> Fote, I would appreciate your help here :).  It would help if you at
>least did not make changes to HTML.c that depend on new hacks introduced
>in SGML.c and the HTMLDTD.  (I am not saying that you *did* make such 
>changes recently; this is just a just-in-case request, I still haven't
>checked whether the recent me->inUnderline changes fall in this
>category.  Your clarification sounded a bit like it, but I am not sure
>so will have to check the code.)

        I don't know the date of the FOTEMODS code to which you are
referring.  The last time I changed HTMLDTD.c with corresponding
mods in HTML.c was on the 18th, when I added tags for soft
hyphenation, and support for the WBR Netscapism.  Those are simple
mods (within the context of the current code, which already implements
truly soft hyphenation), and the tags are, and should be, SGML_EMPTY.


> (Also, I know and accept that you don't want to be considered an
>"active developer" at this point.  However, as long as you are often
>the first to make required and/or useful changes, and make them 
>available, you'll have to accept that your mods continue to be at least
>an important source of input for our development code :).  Given that,
>my request above could help cut down on my [not your] time.) 

        I can just do bug fixes, if I encounter any or people report
them, for the lynx2-7-1+FOTEMODS as it presently stands, until you've
assessed your parsing strategy.  It's not obvious to me how well
it can dealing with the "tag and attribute soup" handling of non-HTML
on the Web as it's become, at least based on the strategy as you've
described it, but you may as well enjoy yourself trying it, and decide
for yourself whether it's a promising approach.


> I will make the code available as a more-experimental-than-usual update
>to the devel code, as soon as I have considered some other misc. unrelated
>changes.  Still without adapting HTML.c, 'cause I want this to get out
>the door now, and would like people to test it... THe first goal then is
>to reproduce Lynx's current behavior (as far as it is correct :) ) for
>valid HTML, tweaking the recovery heuristics for invalid HTML will come
>later.  I am not sure whether there is any screwed-up HTML out there where
>my approach *already* gives better results, or whether it finally can be
>made sophisticated enough to generally improve treatment of bad HTML (over
>that already done by Fote's latest hacks).  Maybe a combination of 
>approaches will finally give best results.

        Well, for example, that recently posted bad HTML with both
explicitly and functionally interdigited "container" elements is
rendered and displayed by lynx2-7-1+FOTEMODS exactly as Christian
intended.  How is it handled with your mods?  How about that awful
businesswire page?  But even if, per chance, such non-HTML should
work better with my hacks, I must admit I have very mixed feelings
about making Lynx that much of a tag and attribute soup non-HTML
handler.  So, again, feel free to ignore my mods in any devel code
intended for an eventual "formal" release.

                                Fote

=========================================================================
 Foteos Macrides            Worcester Foundation for Biomedical Research
 address@hidden         222 Maple Avenue, Shrewsbury, MA 01545
=========================================================================
;
; To UNSUBSCRIBE:  Send a mail message to address@hidden
;                  with "unsubscribe lynx-dev" (without the
;                  quotation marks) on a line by itself.
;

reply via email to

[Prev in Thread] Current Thread [Next in Thread]