[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
wget prints out information in unicode characters where ASCII could suff
From: |
ah |
Subject: |
wget prints out information in unicode characters where ASCII could suffice |
Date: |
Sat, 21 Mar 2020 14:40:40 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.1.1 |
Hello,
When wget gets a page successfully (consider for example: wget
www.gnu.org), it reports something like this:
...output omitted...
2020-03-21 14:00:41 (1.43 MB/s) - ‘index.html’ saved [1114171/1114171]
Please notice the two apostrophes enclosing the fetched filename are in
unicode (U+2018 and U+2019, I guess?) whereas the ASCII apostrophe
character ' is completely sufficient.
What inplications does that have, except from polluting the terminal?
For one, when a user tries to copy+paste the fetched filename (e.g.
index.html) from wget's output, the apostrophes are either copied into
the buffer and that messes up further commands or the apostrophes are
not copied and the user needs to add apostrophes manually when pasting),
e.g. try
ls ‘index.html’
it fails with
ls: cannot access '‘index.html’': No such file or directory
However, the single (ASCII) quotes are very important for a lot of users
in the case where filenames contain spaces or other characters that the
shell does not like and need escaping. So it's a good idea to have them,
but who would have thought that the devil is idle and decided to replace
all apostrophes in GNU software with unicode!
So, ideally (AFAIC) wget, on successful completion, should have printed
this:
2020-03-21 14:00:41 (1.43 MB/s) - 'index.html' saved [1114171/1114171]
(notice the single ASCII apostrophe for opening AND closing the filename)
and then the user could just copy that string and the apostrophes for
further copy+paste.
I understand that there is danger in copy+paste-ing information from a
program's output. But this is not relevant here as it is none of wget's
business to deter users from copy-pasting its output. If that's a real
concern then consider printing the filename in hex or as an image or
call the copy-paste police and snitch the user when he/she attempts to
use it.
But copy-paste is not the real issue here. There is another issue, far
more important: shell scripts processing wget's output.
That brings us to yet another case-in-point where this behaviour of wget
makes our lives more difficult: using wget's output in a shell script in
order to find out the name of the fetched filed. Now, all of a sudden
our shell scripts must deal with unicode characters too. This is a no-go
scenario in many industrial places. A shell script may be classified as
sub-standard if it has to deal with unicode because of the cans of worms
that opens.
In conclusion, my opinion is that this bug is one of the most unpleasant
and dangerous bugs in wget as it pollutes the terminal with UTF
characters when ASCII characters are more than enough to convey the
information to the user. It opens not one but a tonne of cans of worms
and can have serious side effects to script processing in industry.
I would therefore URGE you to reconsider the use of unicode characters
for mere aesthetic reasons especially when ASCII characters can be used
for the same purpose. Aesthetics is a very subjective criterion as you know.
There must be serious reasons to give the KISS principle the capital
punishment. Is this what GNU come to?
On a parallel note, please accept my congratulations for the very good,
otherwise, software wget is. I am using it daily and I thank you (and I
too have contributed to public domain software and with GNU licencing,
spreading the karma of GNU)
bw,
- wget prints out information in unicode characters where ASCII could suffice,
ah <=