Re: GNU recode 3.6: invalid HTML entity references

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU recode 3.6: invalid HTML entity references

From:	Kevin Rodgers
Subject:	Re: GNU recode 3.6: invalid HTML entity references
Date:	Wed, 14 Jan 2004 16:27:58 -0700

> 1. HTML 2.0 - 4.01 are all defined as SGML applications, and in SGML
>    documents "&" is not recognized as an entity reference delimiter
>    unless it is immediately followed by a name start character.
>    Similarly, "&#" is not recognized as a character reference delimiter
>    unless its followed by a name start character or a digit.
> 
> 2. SGML defines name start characters as lowercase or uppercase letters,
>    but XML (and thus XHTML) adds underscore and colon (solidus).  The
>    XML additions aren't relevant, though, because that spec also
>    requires "&" to be interpreted as a markup delimiter (except within
>    comments, processing instructions, and CDATA sections).

recode-3.6/src/html.c:transform_html_ucs2() contains this code to check
the character following '&'

        else if ((input_char >= 'A' && input_char <= 'Z')
                 || (input_char >= 'a' && input_char <= 'z'))

which isn't correct on systems whose execution character set doesn't
assign consecutive integers to letters, e.g. EBCDIC (see 2.1.3 Character
Encoding, C: A Reference Manual).  The usual way around that is to use
the isalpha() etc. predicate declared in <ctype.h>, but POSIX defines
those functions to be dependent on the locale.  So should recode use the
code above or isalpha(), and should it call setlocale (LC_CTYPE, "C")
right off the bat to make sure non-ASCII characters aren't considered to
be letters?

-- 
Kevin

[Prev in Thread]

Current Thread

[Next in Thread]

GNU recode 3.6: invalid HTML entity references (was: recode html..utf-8 fails to convert "' " and other XML issues), Kevin Rodgers, 2004/01/14
- Re: GNU recode 3.6: invalid HTML entity references, Kevin Rodgers <=
- Re: GNU recode 3.6: invalid HTML entity references (was: recode html..utf-8 fails to convert "' " and other XML issues), Kevin Rodgers, 2004/01/15

Prev by Date: Re: gettext-0.13.1 -- envsubst manpage and gettext.info
Next by Date: grep bug?
Previous by thread: GNU recode 3.6: invalid HTML entity references (was: recode html..utf-8 fails to convert "' " and other XML issues)
Next by thread: Re: GNU recode 3.6: invalid HTML entity references (was: recode html..utf-8 fails to convert "' " and other XML issues)
Index(es):
- Date
- Thread