GNU recode 3.6: invalid HTML entity references (was: recode html..utf-8

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

GNU recode 3.6: invalid HTML entity references (was: recode html..utf-8

From:	Kevin Rodgers
Subject:	GNU recode 3.6: invalid HTML entity references (was: recode html..utf-8 fails to convert "' " and other XML issues)
Date:	Wed, 14 Jan 2004 11:09:23 -0700

I wrote:
> Also, I think there should be a way to preserve "&" in text when it does
> not introduce an entity reference -- in particular, when a character
> reference is not terminated by a ";".

I've researched this further, and here's what the specs for SGML (plus
Annexes K and L); HTML 2.0, HTML 3.2, and HTML 4.01 (plus their
corresponding RFCs); and XML 1.0 (and thus XHTML 1.0) say:

1. HTML 2.0 - 4.01 are all defined as SGML applications, and in SGML
   documents "&" is not recognized as an entity reference delimiter
   unless it is immediately followed by a name start character.
   Similarly, "&#" is not recognized as a character reference delimiter
   unless its followed by a name start character or a digit.

2. SGML defines name start characters as lowercase or uppercase letters,
   but XML (and thus XHTML) adds underscore and colon (solidus).  The
   XML additions aren't relevant, though, because that spec also
   requires "&" to be interpreted as a markup delimiter (except within
   comments, processing instructions, and CDATA sections).

3. HTML 4.01 is based on SGML Technical Corrigendum 2 (Annexes K and L),
   which allows hexadecimal character references.  But as with SGML,
   "&#x" is only recognized as a delimiter if it is immediately followed
   by a hexadecimal digit (a decimel digit, lowercase letters "a-f", or
   uppercase letters "A-F").  In XML generally, the "x" must be
   lowercase (from SGML Annex L.2 and its own spec), but the HTML 4.01
   SGML declaration allows "X" as well, so it's not clear whether XHMTL
   allows "X" or just "x".

4. Besides numeric character references, SGML defines named function
   character references: "&#RE;", "&#RS;", and "&#SPACE;" for record
   end (ASCII CR), record start (ASCII LF), and space (ASCII SP)
   respectively; plus any functions defined in the SGML declaration.
   The HTML 2.0 - 4.01 SGML declarations define the TAB function for
   ASCII HT; but for some reason the HTML 3.2 working draft on lexical
   analysis deprecates all named (function) character references and
   none of the HTML specs mention them.  XML does not define them.

5. SGML allows a (numeric) character reference and a (named) entity
   reference to be terminated by either a ";" or a record end (ASCII CR)
   which is thereby suppressed (i.e. consumed by the parser).  But
   either delimiter can be omitted if the reference is not followed by a
   character that could occur in the reference or be interpreted as the
   omitted delimiter.  Basically, that means that anything except a name
   character (letters, digits, period, and hyphen) terminates an SGML
   entity reference (which applies to HTML 2.0 - 4.01).  However,
   character and entity references must tbe terminated by ";" in XML (and
   thus XHTML).

6. All of the HTML and XML specs specify the document character set to
   be ISO 10646.  However, the specs for HTML 2.0 and 3.2 contradict
   their SGML declaration, which only defines code points 0 - 255 as
   valid characters, by allowing (numeric) character references to ISO
   10646 code points.  The SGML declaration for HTML 4.01 allows 17 *
   65,536 = 1,114,112 ISO 10646 code points, which is the subset of
   Unicode characters that can be represented in SGML, which restricts
   numeric character references to 8 digits.  XML (and thus XHTML) allow
   all ISO 10646 code points in a document.

7. HTML 2.0 - 4.01 prohibit all ASCII control characters, plus the
   undefined ISO 8859-* control characters now know as the Unicode C1
   Controls block, except for return, newline, and tab (ASCII CR, LF,
   and HT).  There seems to be some question as to whether form feed
   (ASCII VT) is allowed as well.  XML also prohibits the surrogate
   blocks, FFFE, and FFFF.

The hard part about implementing these rules in recode is figuring out
what to do when they are violated, under the 4 combinations of the
--strict and --force command line options.  Also, it's not obvious how
to report errors, nor even how to detect the command line setttings
(after all, --strict and --force unconditionally set
task_option.fail_level and task_option.abort_level to conflicting
values).

-- 
Kevin

[Prev in Thread]

Current Thread

[Next in Thread]

GNU recode 3.6: invalid HTML entity references (was: recode html..utf-8 fails to convert "' " and other XML issues), Kevin Rodgers <=
- Re: GNU recode 3.6: invalid HTML entity references, Kevin Rodgers, 2004/01/14
- Re: GNU recode 3.6: invalid HTML entity references (was: recode html..utf-8 fails to convert "' " and other XML issues), Kevin Rodgers, 2004/01/15

Prev by Date: Re: gettext-0.13.1 -- envsubst manpage and gettext.info
Next by Date: Re: gettext-0.13.1 -- envsubst manpage and gettext.info
Previous by thread: gettext-0.13.1 -- make check [Was:] Re: gettext-0.13 -- 'make check' problems
Next by thread: Re: GNU recode 3.6: invalid HTML entity references
Index(es):
- Date
- Thread