[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] Bug-wget Digest, Vol 99, Issue 10: regarding wget not con
From: |
Tim Ruehsen |
Subject: |
Re: [Bug-wget] Bug-wget Digest, Vol 99, Issue 10: regarding wget not converting links correctly |
Date: |
Tue, 31 Jan 2017 16:16:10 +0100 |
User-agent: |
KMail/5.2.3 (Linux/4.9.0-1-amd64; KDE/5.28.0; x86_64; ; ) |
On Tuesday, January 31, 2017 2:28:46 AM CET Kun Zhou wrote:
> I am replying to this mailing list regarding to the second issue: wget not
> converting links correctly. I installed alpha release of wget ,
> 1.18.109-4734, on Arch Linux. When I run `wget -H -r -k -l 1
> econ.ucsb.edu/~tedb/Courses/GraduateTheoryUCSB/TheoryF
>
> 16.html`, an excerpt of the output from wget to the terminal is
>
>
>
> _\--2017-01-30 21:19:03--
> http://econ.ucsb.edu/~tedb/Courses/GraduateTheoryUCSB/Bernoulli.pdf _
>
> _ Reusing existing connection to www.econ.ucsb.edu:80.
> HTTP request sent, awaiting response... 200 OK
> Cookie coming from econ.ucsb.edu attempted to set domain to
> faculty.econ.ucsb.edu
Just a side-note: The server not configured correctly... one site tries to set
a cookie for a different site.
> Length: 479295 (468K) [application/pdf]
> www.econ.ucsb.edu/~tedb/Courses/GraduateTheoryUCSB: Not a
> directorywww.econ.ucsb.edu/~tedb/Courses/GraduateTheoryUCSB/Bernoulli.pdf:
> Not a directory
>
> Cannot write to
> ‘www.econ.ucsb.edu/~tedb/Courses/GraduateTheoryUCSB/Bernoulli.pdf’ (Not a
> directory)._
Your page references ‘www.econ.ucsb.edu/~tedb' as a link, it is downloaded
(html content) and saved as file name 'www.econ.ucsb.edu/~tedb'. Any further
attempt to create files as 'www.econ.ucsb.edu/~tedb/*' will show this error.
You can circumvent this in some cases using the -E option, This will save the
file as 'www.econ.ucsb.edu/~tedb.html' and doesn't block further downloads.
> I can confirm that `Bernoulli.pdf` is still not downloaded and a number of
> links not converted.
Works with -E.
> Another relevant issue is that the host name `econ.ucsb.edu` and
> `www.econ.ucsb.edu` resolves to the same ip address, verified by the `dig`
> command on linux. However, wget fail to detect this fact and list list the
> two host names seperately, maybe this is a bug, or maybe just a feature. I
> have attached the complete wget output as a textfile in case it is useful.
Wget (or any other web client I know of) will make assumptions about site
relationships by using dig or DNS. Such assumptions would be often wrong and
turn out as a huge security issue. There are no rules in the DNS about trust
relationship between two sites and the IP being the same for two sites doesn't
tell you anything.
This gave me good results:
wget -d -olog -H -E -r -k -l 1 -D 'www.econ.ucsb.edu,econ.ucsb.edu' http://
econ.ucsb.edu/~tedb/Courses/GraduateTheoryUCSB/TheoryF16.html
The -D option reduces -H to the sites/domains given.
Tim
signature.asc
Description: This is a digitally signed message part.