[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
GNU recode 3.6: invalid HTML entity references (was: recode html..utf-8
From: |
Kevin Rodgers |
Subject: |
GNU recode 3.6: invalid HTML entity references (was: recode html..utf-8 fails to convert "' " and other XML issues) |
Date: |
Wed, 14 Jan 2004 11:09:23 -0700 |
I wrote:
> Also, I think there should be a way to preserve "&" in text when it does
> not introduce an entity reference -- in particular, when a character
> reference is not terminated by a ";".
I've researched this further, and here's what the specs for SGML (plus
Annexes K and L); HTML 2.0, HTML 3.2, and HTML 4.01 (plus their
corresponding RFCs); and XML 1.0 (and thus XHTML 1.0) say:
1. HTML 2.0 - 4.01 are all defined as SGML applications, and in SGML
documents "&" is not recognized as an entity reference delimiter
unless it is immediately followed by a name start character.
Similarly, "&#" is not recognized as a character reference delimiter
unless its followed by a name start character or a digit.
2. SGML defines name start characters as lowercase or uppercase letters,
but XML (and thus XHTML) adds underscore and colon (solidus). The
XML additions aren't relevant, though, because that spec also
requires "&" to be interpreted as a markup delimiter (except within
comments, processing instructions, and CDATA sections).
3. HTML 4.01 is based on SGML Technical Corrigendum 2 (Annexes K and L),
which allows hexadecimal character references. But as with SGML,
"&#x" is only recognized as a delimiter if it is immediately followed
by a hexadecimal digit (a decimel digit, lowercase letters "a-f", or
uppercase letters "A-F"). In XML generally, the "x" must be
lowercase (from SGML Annex L.2 and its own spec), but the HTML 4.01
SGML declaration allows "X" as well, so it's not clear whether XHMTL
allows "X" or just "x".
4. Besides numeric character references, SGML defines named function
character references: "&#RE;", "&#RS;", and "&#SPACE;" for record
end (ASCII CR), record start (ASCII LF), and space (ASCII SP)
respectively; plus any functions defined in the SGML declaration.
The HTML 2.0 - 4.01 SGML declarations define the TAB function for
ASCII HT; but for some reason the HTML 3.2 working draft on lexical
analysis deprecates all named (function) character references and
none of the HTML specs mention them. XML does not define them.
5. SGML allows a (numeric) character reference and a (named) entity
reference to be terminated by either a ";" or a record end (ASCII CR)
which is thereby suppressed (i.e. consumed by the parser). But
either delimiter can be omitted if the reference is not followed by a
character that could occur in the reference or be interpreted as the
omitted delimiter. Basically, that means that anything except a name
character (letters, digits, period, and hyphen) terminates an SGML
entity reference (which applies to HTML 2.0 - 4.01). However,
character and entity references must tbe terminated by ";" in XML (and
thus XHTML).
6. All of the HTML and XML specs specify the document character set to
be ISO 10646. However, the specs for HTML 2.0 and 3.2 contradict
their SGML declaration, which only defines code points 0 - 255 as
valid characters, by allowing (numeric) character references to ISO
10646 code points. The SGML declaration for HTML 4.01 allows 17 *
65,536 = 1,114,112 ISO 10646 code points, which is the subset of
Unicode characters that can be represented in SGML, which restricts
numeric character references to 8 digits. XML (and thus XHTML) allow
all ISO 10646 code points in a document.
7. HTML 2.0 - 4.01 prohibit all ASCII control characters, plus the
undefined ISO 8859-* control characters now know as the Unicode C1
Controls block, except for return, newline, and tab (ASCII CR, LF,
and HT). There seems to be some question as to whether form feed
(ASCII VT) is allowed as well. XML also prohibits the surrogate
blocks, FFFE, and FFFF.
The hard part about implementing these rules in recode is figuring out
what to do when they are violated, under the 4 combinations of the
--strict and --force command line options. Also, it's not obvious how
to report errors, nor even how to detect the command line setttings
(after all, --strict and --force unconditionally set
task_option.fail_level and task_option.abort_level to conflicting
values).
--
Kevin
- GNU recode 3.6: invalid HTML entity references (was: recode html..utf-8 fails to convert "' " and other XML issues),
Kevin Rodgers <=