bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#31526: Fwd: bug#31526: Range [a-z] does not follow collate order fro


From: Bize Ma
Subject: bug#31526: Fwd: bug#31526: Range [a-z] does not follow collate order from locale.
Date: Wed, 23 May 2018 19:13:55 -0400

Following your request:

>  From: Assaf Gordon
*> *(adding debbugs mailing list, please use "reply all" to
>  ensure the thread is public and archived).

 I am sending the message to which you just have answered
to the debbugs mailing list, Sorry for my mistake.



---------- Forwarded message ----------
From: Bize Ma <address@hidden>
Date: 2018-05-22 21:48 GMT-04:00
Subject: Re: bug#31526: Range [a-z] does not follow collate order from
locale.
To: Assaf Gordon <address@hidden>


> 2018-05-19 22:13 GMT-04:00 Assaf Gordon <address@hidden>:
> Hello,

Hi!, thanks for your answer, time and detailed references.

In range definitions I believe that there are two goals in conflict:

    - An stable, simple, range description for programmers.
    - A clear descrition (even if long) for multilanguage users.

For a programmer:
    The old wisdom is that [a-d] should match only `abcd` (in C locale).
    The usual recommendation is: "do not use other locales".
    That is making the use of any other locale almost invalid.
    However, [a-z] may also match many accented (Latin) characters.

For a multi language user:
    But if other locales are used, as is a must to allow for most languages
used
    on this world, the range has never been clearly defined, much less the
order
    in which a range will match. There are some clues about "collation
order" in
    GNU sed, but it remains unclear as which collation sort order apply to
that.

    Using a range in other locale does not follow ASCII numeric order:

        printf '%b' "$(printf '\\U%x\\n' {32..255})" |
            LC_ALL=C sort |
        tr -d '\n' |
            sed 's/[^a-ä]//g'; echo

        abcdªàáâãäåæç

    The result above should have ended in a `d`, but `d` falls in the
middle.
    Nor it follows the locale collate order in effect (it should end in ä):

        printf '%b' "$(printf '\\U%x\\n' {32..255})" |
            LC_ALL=en_CA.utf8 sort |
        tr -d '\n' |
                    sed 's/[^a-ä]//g'; echo

        aáàâäãª

Then, the real question is: What order does sed follow?



> On Fri, May 18, 2018 at 05:58:05PM -0400, Bize Ma wrote:
> >
> >     $ printf '%b' $(printf '\\U%x\\n' {32..127}) | sort | tr -d '\n'
> >     `^~<=>| _-,;:!?/.'"()address@hidden&#%+0123456789aAbBcCdDeEfFgGhHiIjJ
> > kKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

> While in practice this is correct on all GNU/linux systems which
> use glibc, there is no officially documented collation order for
> punctuation marks - it might differ on other systems. Please see here:
> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=23677#14

**********************************************************************
1.- About ASCII character numeric ranges:

Yes, I agree that it may be conceptually unnecessary to give a collation
order to "punctuation marks".
However, that it may be "conceptually unnecessary" does not mean that
such order is "invalid". A practical inplementation may define some
such order.
Please understand that the goal of the code above is to show the practical
result of using some (locale defined) collation order equivalent to what
is given by the c function strcoll().

The range may be more limited to only letters and numbers:
{48..57} {65..90} {97..122} (in hex: 0x30-0x39 0x41-0x5a 0x61-0x7a).

Let us define and use a function that should work on bash 4.2+:

collorder(){
    a=$1; shift 1;
    until (($#<2)); do
        printf '%b' $(printf '\\U%x\\n' $(seq "$1" "$2"))
shift 2
    done | sort | tr -d '\n' | sed 's/'"$a"'//g'
    echo
    }

That function will allow us to do:

    $ LC_ALL=en_CA.utf8   collorder   ' '    48 57   65 90   97 122
    0123456789AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz

And (In C locale the sort is identical to ASCII numeric sort):

    $ LC_ALL=C            collorder   ' '    48 57   65 90   97 122
    0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

And filtering by a bracket range:

    $ LC_ALL=C           collorder  '[^a-z]' 48 57   65 90   97 122
    abcdefghijklmnopqrstuvwxyz

But those ranges avoid the character that you use latter (`[`).
Including the characters between Upper-Case and lowercase ASCII:

    $ LC_ALL=C   collorder   '[^Y-d]'  48 57   65 122
    YZ[\]^_`abcd

That was the reason to include all 95 (126-32+1) ASCII that are not control.
One simple range. Including such characters allow (perfectly valid) mixed
bracket ranges:

    $ LC_ALL=C   collorder   '[^+-d]'  32 126
    +,-./0123456789:;<=>address@hidden

Not because I was interested to deviate the discusion to "punctuation
marks". Just because it was one simple character numeric range.
That is all, the bash function defined here: collorder, is a tool to reveal
the (practical) collation order valid for the applied locale.


**********************************************************************
2.- About using collating order.

> > It is expected that a range [a-z] will match 'aAbBcCdD…', all lower and
> > upper letters.
> > But it isn't:
>
> It should not be "expected". I don't think it is documented to be
> so anywhere in GNU programs.

Well, yes, 'info sed', in section `5 Regular Expressions: selecting text`
sub-section `5.5 Character Classes and Bracket Expressions` include:

    Within a bracket expression, a "range expression" consists of two
    characters separated by a hyphen.  It matches any single character
    that sorts between the two characters, inclusive.  In the default
    C locale, the sorting sequence is the native character order; for
    example, '[a-d]' is equivalent to '[abcd]'.

>From 'info sed' (not man sed) sub-section `5.9 Locale Considerations`:

    In other locales, the sorting sequence is not specified, and '[a-d]'
    might be equivalent to '[abcd]' or to '[aBbCcDd]', or it might fail
    to match any character, or the set of characters that it matches
    might even be erratic.

So, the `[a-d]` expression match characters that sort between `a` and `d`.
That is defined above for the C locale. In other locales the sorting is
"undefined".


> … Both sed's and grep's manuals contain
> the following text:
>
>     In other locales, the sorting sequence is not specified, and ‘[a-d]’
>     might be equivalent to ‘[abcd]’ or to ‘[aBbCcDd]’, or it might fail to
>     match any character, or the set of characters that it matches might
>     even be erratic.

Yes, It is the exact same text that I also quoted above. But all it
clearly defines is that the order is based on the definition of each
locale "in some unspecified way". When the locale change, the order
may also change.

> https://www.gnu.org/software/sed/manual/sed.html#Multibyte-
regexp-character-classes

Yes, At the same page, but at Reporting-Bugs, under the heading
     [a-z] is case insensitive

  https://www.gnu.org/software/sed/manual/sed.html#Reporting-Bugs

We can read:

    [a-z] is case insensitive
    You are encountering problems with locales. POSIX mandates that [a-z]
    uses the current locale’s collation order – in C parlance, that means
    using strcoll(3) instead of strcmp(3). Some locales have a case-
    insensitive collation order, others don’t.

It seems to say: "current locale's collation order" !!


> https://www.gnu.org/software/grep/manual/html_node/
Character-Classes-and-Bracket-Expressions.html
>
> Furthermore, in POSIX 2008 standard range expressions are
> undefined for locales other than "C/POSIX", see this comment by Eric Blake
> (also the entire bug report might be of interest to this topic):
> https://bugzilla.redhat.com/show_bug.cgi?id=583011#c24

Yes, however: Does undefined also mean invalid, forbidden, banned or
illegal?

At the moment, it is not illegal to use a bracket range in some other
locale.
Such use does not raise any error (or even warning). As it is not illegal,
the
only aspect that remains to be clearly defined is what is the range order
that
we should expect in every other locale than C.

Also, We rely everyday on "not specified" behavior (for some spec):

The -E option is not (yet) defined in current POSIX (The Open Group
Base Specifications Issue 7, 2018 edition) for sed.
Yes, It is believed that it will be accepted for the next POSIX version.

    http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html

But it is defined (and used) in GNU sed.

Some elements are undefined in POSIX just to allow implementations to be
diverse:

    http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xcu_chap02.html

    The results of giving <tilde> with an unknown login name are undefined
    because the KornShell "˜+" and "˜-" constructs make use of this
condition …

Read carefully: undefined because it is used !.
That is, it is undefined in the spec to allow implementations to resolve in
practical ways that might be diferent than the specification (or other
implementations).



In the same "comment by Eric Blake" we can read this:

    The behavior of [A-z] in en_US.UTF-8 is "unspecified", but _not_
"undefined".
    A compliant app cannot guarantee what the behavior will be, but the
behavior
    should at least be explainable, and as a QoI point, glibc should
document
    and define this behavior as an extension to POSIX, so that apps relying
on
    glibc can take advantage of this extension for known behavior.

Exactly the same I was meaning:  "unspecified", but _not_ "invalid".

And, exactly, what I am asking for: "glibc should document and define this
behavior"

>
> > However, the range [a-Z] does match all letters, lower or upper:
> >
> >     $ printf '%b' $(printf '\\U%x' {32..127}) | sed 's/[^a-Z]//g'
> >     ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
>
> I would recommend avoiding mixing upper-lower case in regex
> ranges, as the result might be unexpected. Compare the following:

In the "comment by Eric Blake" we can also read:

    That is, [A-z] is well-defined in the POSIX locale, and in all other
    locales where A collates before z (which includes en_US.UTF-8)

Again: "[A-z] is well-defined … "

Frankly, if I were to follow both main recommendations:

    - Any other locale than C is unspecified: do not use them.
    - Any range that does not match the previously known ranges:
      "recommend avoiding mixing upper-lower case in regex ranges"

The usefulness of a bracket range is reduced to almost nothing.
Only C and only either [a-z] or [A-Z].

Is it not possible to declare and document what the collation
order is/should be for other locales?

**********************************************************************
3.- Corect exactly how.

> > If this is the correct way in which sed should work, then, if you
please:
>
> Yes, it is.

Thanks, but: What does it mean exactly?   My opinion in the right.

  - That [a-z] will always mean 'abcdefghijklmnopqrstuvwxyz' in the C
locale?. (Yes)
  - That the order in C locale follows the ASCII numeric order?.
   (Yes)
  - That no other locale should be used?
   (No?)
  - That the order in any other locale is secret?
    (Yes)
  - That ranges like [A-z] (valid in C) can not be used in other locales?
    (No?)
  - That other ranges like [*-d] (valid in C) are a crazy idea?
    (No?)
  - References to collation order in the manuals must be stricken out?
   (No?)

And we have not even started with more characters as they are possible in
UNICODE.

   - Is this valid:
   $ LC_ALL=en_CA.utf8 ./collorder '[^a-z]' 32 255
   abcdefghijklmnopqrstuvwxyzªºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ

   Does it mean that [a-z] is closer to [[:lower:]] than ASCII a-z?

   - Is this expected? (phonetic symbols)
   $ LC_ALL=en_CA.utf8 ./collorder '[^a-z]' 0x250 0x2af
   ɓɔɖɗəɛɠɵ

   - Should this work? In what order? (phonetic symbols)
   $ LC_ALL=en_CA.utf8 ./collorder '[^ɖ-ɛ]' 0x250 0x2af
   ɖɗəɛ

   - Why all Latin characters are being included? (Latin extended)
   $ LC_ALL=en_CA.utf8 ./collorder '[^a-z]' 0x1e00 0x1fff
   ḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝḟḡḣḥḧḩḫḭḯḱḳḵḷḹḻḽḿṁṃṅṇṉṋṍṏṑṓṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṳṵṷṹṻṽṿẁẃẅẇẉ
   ẋẍẏẖẗẘẙẚẛạảấầẩẫậắằẳẵặẹẻẽếềểễệỉịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹ

> >     - What is the rationale leading to such decision?.
>
> The bug reports linked above contain long discussions about it.

Yes, there are discussions about what was relevant at the time.
But none explain in clear simple words what order the characters
in a bracket range will follow in a locale that is NOT C. (see
some simple examples above).

> Please also see the following thread, which promoted the restriction
> of "sane regex ranges" - meaning ASCII order alone (and applies to gawk,
> grep, sed and other programs using gnulib's regex engine):
>
> https://lists.gnu.org/archive/html/bug-gnulib/2011-06/msg00200.html

ASCII order alone? Only for characters in numeric range 0x00-0x7f ????

    - How comes that an á gets included in the very limited [a-b]?
    $ LC_ALL=en_CA.utf8 ./collorder '[^a-b]' 0x00 0xff
    abªàáâãäåæ

> >     - Where is it documented?.
>
> The links above to the sed and grep manuals.

None of the linked documents explain the above result for [^a-b].

> >     - Where is it implemented in the code?.
>
> I think a good place to start is gnulib's DFA regex engine,
> here:
> https://opengrok.housegordon.com/source/xref/gnulib/lib/dfa.c
> or here:
> http://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/dfa.c

I have to recognize that I am unable to understand any of those
4000 lines of code without some detailed help of how it works.
I am really sorry.

> Search for the comment 'build range characters' for a starting point.
>
> Both gnu grep and sed use this code.
>
> >     - Why does the manual document otherwise?.
>
> Errors in the manual are always a possibility.
> If you spot such an error, or an example showing incorrect
> usage/output - please let us know where it is (e.g. a link
> to a manual page  / section).

I have provided a couple of points where "collating order" is used.
But I suspect that those are not mistakes from your point of view and
that what is missing is a more detailed description of which collating
order is being used. I may be perfectly wrong, of course.

> As such, I'm marking this as "not a bug" and closing the ticket,
> but discussion can continue by replying to this thread.

I still remain in doubt, at the very minimum.

> regards,
>  - assaf

Many thanks and regards
- Bize


reply via email to

[Prev in Thread] Current Thread [Next in Thread]