[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] [PATCH] Bug 40426 follow-up
From: |
Darshit Shah |
Subject: |
Re: [Bug-wget] [PATCH] Bug 40426 follow-up |
Date: |
Sun, 15 Mar 2015 01:30:59 +0530 |
User-agent: |
Mutt/1.5.23 (2014-03-12) |
Hi Ander,
This is my first attempt to fix 40426. I'm expecting to continue to work on it
until it's fixed, unless someone has objections.
Source of the problem:
When -r and -O - are specified together, Wget hangs. Well, it really doesn't
hang, it's actually waiting for input. When Wget downloads a file it saves it
to the disk first and then reads it again from there to parse it and get more
URLs, if any. But when '-O -' was specified, Wget reads from stdin. This is
done in wget_read_file(). As a funny game, when this happens, type some HTML,
and then hit Ctrl-D or any other key sequence equivalent to EOF, and you'll
effectively trick Wget into inserting arbitrary HTML.
That's an interesting observation about getting Wget to insert arbitrary HTML. I
would wary about this because it could even cause Wget to download an arbitrary
file.
Solution:
A patch was already proposed that simply avoided '-r' and '-O -' to be set together. But that would
void a clause in the documentation that states that "wget -O file http://foo" is intended to
work like "wget -O - http://foo > file". So, my approach is to maintain that.
Initially I thought of redirecting stdout to stdin, but that would be an ugly
hack that would probably make things difficult in the future. What's more,
there are options that rely on stdin, such as '--input-file=-'.
So, what I did is to write to a regular file in /tmp/.wget.stdout instead of in
stdout when '-O -' is passed, and dump everything in the end. This, apart from
fixing the bug, maintains the documented behavior.
I think this approach opens a can of worms. Multiple Wget processes, support
with other existing options and worst of all, people exercising this option in
ways I can't currently think of.
In such a scenario, I don't mind editing the documentation to add an exception.
But the solution presented above would cause too much pain in the long run to
maintain.
Although this is a good workaround for the short term, IMO the best approach is
to keep everything in memory and write stuff out in the end. Something similar
was already reported at 20714.
If someone is willing to work on a streaming model of the HTML Parser, this
approach might actually make some sense.
I don't expect this patch to be definite. This is just a follow-up to my work.
I'm looking forward to your feedback as I sure haven't taken into account all
the consequences.
So far, the following are still to be done:
- /tmp/.wget.stdout won't work on Windows, so port it.
- FOPEN_OPT_ARGS is not taken into account.
- I only took into account HTTP. The same workaround should be applied to FTP
too.
As you've already realized, the approach you propose creates a whole bunch of
corner cases that need to covered. At this moment, it looks like implementing
this approach for -r -O- would be large enough for an entire GSoC Project.
Instead I think it is better to stick to the basics and simply reject such
combinations of options
--
Thanking You,
Darshit Shah
pgpaOh1DHrHVD.pgp
Description: PGP signature