[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-wget] Combining --output-document with --recursive
From: |
Gijs van Tulder |
Subject: |
[Bug-wget] Combining --output-document with --recursive |
Date: |
Thu, 24 May 2012 23:45:20 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 |
Hi,
There's a problem if you combine --output-document with --recursive or
--page-requisites. --output-document breaks the recursion.
First you get a warning:
WARNING: combining -O with -r or -p will mean that all downloaded
content will be placed in the single file you specified.
That is what you'd expect, no problem there.
However, there is a problem with the recursion. Because Wget *appends*
all downloaded content in the same file, the HTML and CSS parsers get
confused. The same content is parsed over and over again, each time with
a different URL context.
Example:
1. You run wget -O out.tmp -r http://example.com/index.html
2. http://example.com/index.html is written to out.tmp.
URLs are extracted from out.tmp relative to
http://example.com/index.html. Suppose that there is a link to a
subdirectory test/index.html, which is added to the download queue
as http://example.com/test/index.html (correct).
3. http://example.com/test/index.html is appended to out.tmp.
Now, again, Wget extracts URLs from out.tmp. It parses the whole
file, so it first finds the contents of /index.html, with the link
to test/index.html. Because Wget thinks it is now parsing
http://example.com/test/index.html, it will enqueue this as
http://example.com/test/test/index.html (wrong).
One obvious solution, which I've added to this email, is to clear the
output document before downloading the next file. This breaks the
current behaviour, so maybe it's not a good idea. Is there a better
solution?
Regards,
Gijs
--
index 8d4edba..502b68f 100644
--- a/src/http.c
+++ b/src/http.c
@@ -2888,7 +2888,18 @@ read_header:
}
}
else
- fp = output_stream;
+ {
+ fp = output_stream;
+ rewind (fp);
+ if (ftruncate (fileno (fp), 0) == -1)
+ {
+ logprintf (LOG_NOTQUIET, "Could not truncate output file:
%s\n", strerror (errno));
+ CLOSE_INVALIDATE (sock);
+ xfree (head);
+ xfree_null (type);
+ return FOPENERR;
+ }
+ }
/* Print fetch message, if opt.verbose. */
if (opt.verbose)
- [Bug-wget] Combining --output-document with --recursive,
Gijs van Tulder <=