lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Latin-9 coding rules test


From: Greg Chicares
Subject: Re: [lmi] Latin-9 coding rules test
Date: Tue, 13 Nov 2018 23:34:00 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.0

On 2018-11-13 21:30, Vadim Zeitlin wrote:
> 
>  I'm looking at assay_non_latin() function in test_coding_rules.cpp because
> the regex used there doesn't work with libstdc++ implementation of
> std::regex, which converts \x9f to -97 and then throws std::regex_error
> because it's not less than \x7f (127). I have my doubts about the
> correctness of this code and it definitely doesn't make any sense to
> handle char as signed in this context to me, but independently of whether
> libstdc++ is correct or not, I wonder if this whole test makes much sense.

Maybe it's time to forget ISO-8859-15 and just use UTF-8, replacing
assay_non_latin() with a new assay_non_utf_8(), which perhaps you
could write because I don't know how.

>  The reason for this question is that lmi source files use UTF-8 encoding
> and not ISO-8859-15, so why do we bother checking for the latter? I think

We made up some such rule ages ago. It seemed to matter for wxHtmlHelp:
  https://lists.nongnu.org/archive/html/lmi/2008-03/msg00019.html
where using the "windows-1252" charset solved some problem; but nowadays
we just open the user manual in a browser, so that no longer matters:
  https://lists.nongnu.org/archive/html/lmi/2014-03/msg00001.html

It also seemed to matter for filenames:
  https://lists.nongnu.org/archive/html/lmi/2010-05/msg00030.html
  https://lists.nongnu.org/archive/html/lmi/2010-05/msg00032.html
but perhaps that was because of apache fop.

As for "windows-1252", numerous html files still contain this:
  <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
and I suppose we ought to change that to
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
as we usher in the twenty-first century, right? At the same time, I
suppose we should change 'test_coding_rules.cpp' thus:

-    // This IANA-approved charset is still useful for html.
-    if(!f.is_of_phylum(e_html))
-        {
        taboo(f, "windows-1252");
-        }

and outdent the non-deleted line.

A restricted charset may still be useful for proprietary files into
which data from corporate users are pasted. Not too long ago, we
had some awful problem because data we pasted from some spreadsheet
contained exotic invisible characters. However, that needn't be a
general lmi restriction: the proprietary file in question says:

// Imperatively run
//   iconv -f UTF-8 -t UTF-8 my_fund.cpp &>/dev/null
// after pasting from that spreadsheet. In 2016, it contained a non-
// breaking space encoded as ISO8859-1, which was invisible without
// special tools but spoiled the system test.

and pasting that 'iconv' command into a terminal once a year is an
adequate safeguard--we don't need a global lmi rule for this.

> that we ought to check that the file contains valid UTF-8, as this is
> actually useful,

Okay.

> and maybe that it doesn't contain any control characters
> except "\n" and, possibly, "\r" (why should "\f" or "\v" be allowed is not
> very clear to me).

'\v' appears not to occur (and 'test_coding_rules.cpp' seems to
forbid it everywhere):
  /opt/lmi/src/lmi[0]$grep --recursive $'\v' * |sed -e'/^Binary.*matches$/d' 
|less -S
but '\f' does occur:
  /opt/lmi/src/lmi[0]$grep --recursive $'\f' * |sed -e'/^Binary.*matches$/d' 
|less -S 
and I'd rather not alter the original GPL text and, worse, modify
old CGI stuff just to remove it. We want to allow '\n' in general,
and '\t' in specific cases, so it should cost nothing to retain
the existing assay_whitespace(), right?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]