[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] bad filename
From: |
Tim Ruehsen |
Subject: |
Re: [Bug-wget] bad filename |
Date: |
Wed, 23 Apr 2014 14:43:21 +0200 |
User-agent: |
KMail/4.11.5 (Linux/3.13-1-amd64; KDE/4.11.5; x86_64; ; ) |
On Wednesday 23 April 2014 13:57:15 Andries E. Brouwer wrote:
> On Wed, Apr 23, 2014 at 12:59:43PM +0200, Darshit Shah wrote:
> > On Tue, Apr 22, 2014 at 10:57 PM, Andries E. Brouwer wrote:
> >> If I ask wget to download the wikipedia page
> >>
> >> http://he.wikipedia.org/wiki/ש._שפרה
> >>
> >> then I hope for a resulting file ש._שפרה.
> >> Instead, wget gives me ש._שפר\327%94, where the \327
> >> is an unpronounceable byte that cannot be typed
> >> (This is an UTF-8 system and the filename
> >> that wget produces is not valid UTF-8.)
> >>
> >> Maybe it would be better if wget by default used the original filename.
> >> This name mangling is a vestige of old times, it seems to me.
> >
> > This is a commonly reported grievance and as you correctly mention a
> > vestige of old times. With UTF-8 supported filesystems, Wget should
> > simply write the correct characters.
> >
> > I sincerely hope this issue is resolved as fast as possible, but I
> > know not how to. Those who understand i18n should work on this.
>
> It is very easy to resolve the issue, but I don't know how backwards
> compatible the wget developers want to be.
I guess this is the #1 question ;-)
> The easiest solution is to change the line (in init.c:defaults())
> opt.restrict_files_ctrl = true;
> into
> opt.restrict_files_ctrl = false;
>
> That is what I would like to see:
> the default should be to preserve the name as-is,
> and there should be options "escape_control" or so
> to force the current default behaviour.
You know that you can override default behaviour in ~/.wgetrc (or globally in
/etc/wgetrc) !? Normally, the distributions package maintainers should care
about reasonable defaults in /etc/wgetrc. E.g. they could set
restrictfilenames=nocontrol for UTF-8 environments.
But I understand them being conservative with changes.
See also 'man wget':
"If you specify nocontrol, then the escaping of the control characters is also
switched off. This option may make sense when you are downloading URLs whose
names contain UTF-8 characters, on a system which can save and display
filenames in UTF-8 (some possible byte values used in UTF-8 byte sequences
fall in the range of values designated by Wget as "controls")."
> There are also more complicated solutions.
> One can ask for LC_CTYPE or LANG or some such thing,
> and try to find out whether the current system is UTF-8,
> and only in that case set restrict_files_ctrl to false.
You can also use
--local-encoding
and
--remote-encoding
for more control over the character encoding.
But what if you have an UTF-8 environment and want to use --input-file,
reading URL's with a ISO-whatever encoding ?
--remote-encoding is not the right one...
Yes, we would need a --input-encoding=...
OT:
Talking about i18n, another point is punycode representation. Meanwhile there
is IDNA2003, IDNA2008 und the newest TR46 (which mainly cares for some
incompatibilities between IDNA2003 and IDNA2008). Wget only supports IDNA2003.
> I don't know anything about the Windows environment.
That is a damn good argument not to change the default behaviour... who knows
exactly all environments where Wget is installed and who is able to code a
"works everywhere" routine for i18n ?
Back to the top... let the user and/or maintainer configure it - they know
best what they want.
Can you live with that answer ?
BTW:
We are talking since years about Wget2... having a second tool named 'wget2'
would allow us to change defaults or correct historically imposed glitches.
I would like to transfer lots of code from my project Mget
(https://github.com/rockdaboot/mget) into it (i'am tired of maintaining ;-)
Tim
- [Bug-wget] bad filename, Andries E. Brouwer, 2014/04/22
- Re: [Bug-wget] bad filename, Darshit Shah, 2014/04/23
- Re: [Bug-wget] bad filename, Andries E. Brouwer, 2014/04/23
- Re: [Bug-wget] bad filename,
Tim Ruehsen <=
- Re: [Bug-wget] bad filename, Andries E. Brouwer, 2014/04/23
- Re: [Bug-wget] bad filename, Tim Ruehsen, 2014/04/24
- Re: [Bug-wget] bad filename, Andries E. Brouwer, 2014/04/25
- Re: [Bug-wget] bad filename, Tim Ruehsen, 2014/04/24
- Re: [Bug-wget] bad filename, Andries E. Brouwer, 2014/04/24
- Re: [Bug-wget] bad filename, Tim Rühsen, 2014/04/24