bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU recode 3.6: invalid HTML entity references


From: Kevin Rodgers
Subject: Re: GNU recode 3.6: invalid HTML entity references
Date: Wed, 14 Jan 2004 16:27:58 -0700

> 1. HTML 2.0 - 4.01 are all defined as SGML applications, and in SGML
>    documents "&" is not recognized as an entity reference delimiter
>    unless it is immediately followed by a name start character.
>    Similarly, "&#" is not recognized as a character reference delimiter
>    unless its followed by a name start character or a digit.
> 
> 2. SGML defines name start characters as lowercase or uppercase letters,
>    but XML (and thus XHTML) adds underscore and colon (solidus).  The
>    XML additions aren't relevant, though, because that spec also
>    requires "&" to be interpreted as a markup delimiter (except within
>    comments, processing instructions, and CDATA sections).

recode-3.6/src/html.c:transform_html_ucs2() contains this code to check
the character following '&'

        else if ((input_char >= 'A' && input_char <= 'Z')
                 || (input_char >= 'a' && input_char <= 'z'))

which isn't correct on systems whose execution character set doesn't
assign consecutive integers to letters, e.g. EBCDIC (see 2.1.3 Character
Encoding, C: A Reference Manual).  The usual way around that is to use
the isalpha() etc. predicate declared in <ctype.h>, but POSIX defines
those functions to be dependent on the locale.  So should recode use the
code above or isalpha(), and should it call setlocale (LC_CTYPE, "C")
right off the bat to make sure non-ASCII characters aren't considered to
be letters?

-- 
Kevin





reply via email to

[Prev in Thread] Current Thread [Next in Thread]