[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#29606: Command 'fold' dangerous with utf-8 input
From: |
Pádraig Brady |
Subject: |
bug#29606: Command 'fold' dangerous with utf-8 input |
Date: |
Sat, 9 Dec 2017 15:50:36 -0800 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 |
On 08/12/17 19:15, Assaf Gordon wrote:
> Hello Mark,
>
> First,
> thank you for taking the time and effort
> to test our development snapshot, and reporting results back.
> This kind of feedback is critical in getting multibyte support ready.
>
>
> Second,
> I can confirm the behavior you are observing, reproduced here
> with 'od' for easier output:
>
> ## POSIX single-byte locale:
>
> $ echo "ß" | LC_ALL=C src/fold --bytes --width 1 | od -tc -An
> 303 \n 237 \n
> $ echo "ß" | LC_ALL=C src/fold --width 1 | od -tc -An
> 303 \n 237 \n
>
> ## UTF8 locale:
>
> $ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --bytes --width 1 | od -tc -An
> 303 237 \n
>
> $ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --width 1 | od -tc -An
> 303 237 \n
>
>
> On 2017-12-08 05:04 AM, Mark Roberts wrote:
>> When --bytes is not specified, the program treats '\b', '\r' and '\t'
>> specially. It assumes a tab width of eight (compile-time #define) and
>> attempts to keep track of what the output will look like.
>>
>> This is absolutely not what I expected.
>
> That is correct, and I share your sentiment: it also took me some time
> to try and track down why it behaves this way, and whether it's by
> design or a bug.
>
>> But of course, when the program
>> was first written, the words byte and character meant the same thing for
>> printable characters. Printable bytes.
>
> The reasoning for this behavior is explained in the OpenGroup's POSIX
> standard page for fold, in the "RATIONAL" section:
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/fold.html#tag_20_48_18
>
> There, it is made clear:
> "Historical versions of the fold utility assumed 1 byte was one
> character and occupied one column position when written out. This is
> no longer always true.
> [....]
> Note that although the width for the -b option is in bytes, a line is
> never split in the middle of a character."
>
> Therefore, the current implementation (of the development version) is
> correct.
>
>> I will attempt to suggest an improved text for the man-page so that
>> others will not be surprised.
>
> I agree that once multibyte support is added to fold(1), the man pages,
> the help screen and texi manual must be updated to clearly
> indicate the "-b/--bytes" only applies to \b \t \r and never to
> multibyte characters.
>
> If you find the time to send such a patch - great!
> If not, I will add it sooner or later (hopefully sooner).
>
> As such I'm closing this bug report, but further discussion (and
> patches) are welcomed by replying to this thread.
Note while splitting in the middle of a character is incorrect,
it doesn't preclude approximate counting in --bytes.
This is the approach the current i18n patch takes:
$ export LC_ALL=en_CA.UTF-8
$ echo "ßß" | fold-i18n --bytes --width 1 | od -tc -An
303 237 \n 303 237 \n \n
$ echo "ßß" | fold-i18n --bytes --width 2 | od -tc -An
303 237 \n 303 237 \n \n
$ echo "ßß" | fold-assaf --bytes --width 2 | od -tc -An
303 237 303 237 \n
The i18n version of fold also has a --characters option
to operate in the current fold-assaf mode.
I'm not convinced we want to be different from the i18n patch in this regard at
least.
cheers,
Pádraig.