[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-wget] Recursive retrieval
From: |
Dale R. Worley |
Subject: |
[Bug-wget] Recursive retrieval |
Date: |
Wed, 02 Nov 2016 12:24:03 -0400 |
In regard to my difficulties with recursively retrieving
http://www.iana.org/assignments/index.html: I discovered that one URL
(http://www.iana.org/assignments/forces/forces.xhtml) is pointed to by
no less than three different URLs:
http://www.iana.org/assignments/forces/forces.xhtml
http://www.iana.org/assignments/forces-parameters/forces-parameters.xhtml
http://www.iana.org/assignments/forces
The first is the proper URL for it, and the second two are redirected to
the first URL.
There are several other occurrences of this situation.
And I discovered that if I specify --trust-server-names, then wget will
realize that the redirection URL can be retrieved once, and links to the
other two URLs can be directed to that one file. Without
--trust-server-names, wget considers all three URLs to be different,
despite that they are redirected to the same URL, and dutifully stores
essentially the same content three times. With --trust-server-names,
wget understands that all three URLs are the same.
It turns out that this provides me with a much better mirror of the web
site.
I've attached a patch that improves the documentation of
--trust-server-names, to clarify that if -nd is not in effect, then the
file name is constructed from the entire redirection URL, not just the
last component.
(--trust-server-names is also mentioned in doc/metalink-standard.txt,
but that text does not seem to me to have the problem the patch
corrects.)
Dale
0001-Improve-documentation-of-trust-server-names.patch
Description: Text Data
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Bug-wget] Recursive retrieval,
Dale R. Worley <=