bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gawk: Wrong behavior in binary mode


From: Manuel Collado
Subject: Re: gawk: Wrong behavior in binary mode
Date: Tue, 16 Dec 2008 10:27:27 +0100
User-agent: Thunderbird 2.0.0.14 (Windows/20080421)

John Cowan escribió:
Eli Zaretskii scripsit:

I think you are missing the fact that LC_ALL=C has broad effects other
than just disabling multibyte characters.  For example, it also causes
Gawk to speak US English when displaying messages, and use US format
for dates and currency.  What do I do if I want my error messages in
Hebrew, but need to work with raw binary data that is not a character
string?

Quite so.  There should be some way to specify the encoding of Gawk's
input and output files independent of the locale (IMHO, encoding the
character encoding into the local identifier was just a botch.)


I strongly agree! In the worldwide environment of nowadays text-processing utilities should be able to cope with files from different sources with different encodings, and combine them in a single run. This implies having independent encodings for:

- each source file
- each input data file
- each output data file

SGML/XML utilities already do that. For AWK a possible approach could be:
- Use a fixed implementation-chosen encoding for internal processing (covering UNICODE) - On-the-fly convert each source or input data to the internal encoding before processing.
- On-the-fly convert output data to the external encoding before printing.

This approach requires a method for specifying file encodings. Examples:
-  source files: explicit @encoding directive, use locale just as default.
- input and output data: use the value of a predefined ENCODING variable at open time, or the locale as default.

Is it OK to discuss this topic in this forum?
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado


reply via email to

[Prev in Thread] Current Thread [Next in Thread]