sed-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU grep,awk,sed: support \u and \U for unicode


From: Assaf Gordon
Subject: Re: GNU grep,awk,sed: support \u and \U for unicode
Date: Thu, 19 Jan 2017 00:46:47 -0500

Hello all,

Thank you for your comments and feedback.

Attached are improved patches, one for each program.
This time, they all use a common, identical module (unicode-escape.{c,h}),
and only the glue code is different.
I hope this will help in testing and debugging.

Addressing some issues:

> On Jan 11, 2017, at 04:00, address@hidden wrote:
> FSF paperwork for gawk, you should do so, even if I don't adopt \u / \U.
done

> I've been (successfully) avoiding something like this for many years and
> was hoping to continue to use the "somebody else's problem field" on it
> for a while longer yet. :-)

I can see how this can be considered 'bloat',
and how sticking to ASCII when dealing with text files is saner
(or at least ASCII with awk/sed scripts).
On the coreutils' front, I see more interest in unicode support,
which is what prompted me to investigate this.
It would be nice if all gnu programs were able to do unicode in the same manner.
But, I certainly understand if people prefer to keep the code smaller.

> 1. What should gawk(/sed/grep) do upon encountering \u/\U in a non-UTF locale?

The current patch always assume UTF-8 locale for simplicity.

If we go forward, I will implement the same mechanism as in coreutils' printf:
In C local, no conversion is done.
In other locales, convert to appropriate code if possible.
In the current patch, the change will be in one function
unicode-escape.c:store_unicode(), which at the moment calls gnulib's u8-uctomb.


> 2. Do we even have a fool-proof way to know what we're in a UTF locale?

Yes, I believe a reliable method is already implemented in other parts of 
gnulib.

> 3. This feature further distances GNU tools from standard practice,
> decreasing portability of programs that depend upon it.

Indeed.
It is a judgement call, which is why I think this discussion is useful
even if we decide not to implement it.

\uHHHH and \UHHHHHHHH were added in C99 (in C source code), so it is not 
completely
unreasonable to accept them elsewhere.

coreutils' printf(1) already has them.

On the flip-side, it could be argued that if need be, one can always use
  printf '\uXXXX' | od -to1
and then put the multibyte octets in awk/sed/grep etc with the
already-supported escape octal sequences.


> 4. I think that it's not hard to use current dfa/regex - just convert
> the hex to a wchar_t string and then from there back to multibyte characters,
> but maybe I'm wrong about that. Paul? Jim?

awk,sed were straight-forward, as they already contain backslash escaping code.

For grep, I took a different approach: just before adding a pattern,
I modify the pattern buffer in-place and replace the escapes with with 
multibyte equivalent - as if the user entered them. I'm happy to hear comments 
about this approach.

In all programs, after escaping it's as if the user entered a multibyte 
sequence directly in the program. The dfa/regex will see multibyte octets, and 
there was no need to modify gnulib.

> 5. How do we handles MinGW and Cygwin where wchar_t is 16 bits, vs. 32
> bits just about everywhere else?

The parsing is the same (i.e. "\uHHHH" to internal 'unsigned int' or 'ucs4_t' 
from gnulib).

The conversion to multibyte will use gnulib's (or the system's native)
widechar-to-multibyte functions.

In case of cygwin/mingw, an extra step of converting the 'uint32' to two 
'uint16' is needed,
and then two calls for wctomb are needed.

I haven't implemented it yet, but if we decide to continue - I will.
I have some familiarity with this setup from the on-going unicode work in 
coreutils,
and I will also add sufficient tests to ensure it works.

> For gawk, assuming you can convince me to go with this (:-) I will also
> need documentation updates.

Of course, as is for grep and sed.

But before I get there, I'd like to get consensus on the implementation.


> On Jan 11, 2017, at 01:19, Paul Eggert <address@hidden> wrote:
> 
> It should work for grep. Though, to be honest I don't find \u and \U escapes 
> to be all that useful except for East Asian languages. A better syntax is the 
> Emacs \N escape, e.g., \N{EN DASH} which you can also write as \N{U+2013} if 
> you know the codes by heart. Admittedly \N requires more runtime support.

\N support has two parts: the parsing and the actual conversion.

I've implemented the parsing in the attached patches.
That is, the following work:
    gawk 'BEGIN { printf "\N{U+1234}\n"}'
    grep '\N{U+1234}'

If you use a named character, it is parsed, but then an error message is 
displayed:

    ./src/grep '\N{EN DASH}'
    grep: Named conversion not implemented yet. Please use \N{U+XXXX}

gnulib has the module to perform this conversion, but it adds ~300KB to the 
binary.

So if this feature is desired, it can be relatively easily added.

Or, we can think of more complicated schemes (like lazy-loading of the 
conversion table form a file, etc.). But this is even more bloat....

Alternatively, it'll be easy to add \N to printf, and then everyone can use 
printf for all their unicode needs.


Any any case,
thanks for reading so far,
comments very welcomed,
  -assaf


Attachment: 0001-sed-add-support-for-unicode-escapes-sequences-u-U-N.patch.xz
Description: Binary data

Attachment: 0001-awk-add-support-for-unicode-escape-sequences-u-U-N.patch.xz
Description: Binary data

Attachment: 0001-grep-add-support-for-unicode-escape-sequences-u-U-N.patch.xz
Description: Binary data





reply via email to

[Prev in Thread] Current Thread [Next in Thread]