[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] --page-requisites and robot exclusion issue
From: |
markk |
Subject: |
Re: [Bug-wget] --page-requisites and robot exclusion issue |
Date: |
Mon, 5 Dec 2011 13:28:45 -0000 |
User-agent: |
SquirrelMail/1.4.21 |
Hi,
Paul Wratt wrote:
> if it does not obey - server admins will ban it
>
> the work around:
> 1) get single html file first - edit out meta tag - re-get with
> --no-clobber (usually only in landing pages)
> 2) empty robots.txt (or allow all - search net)
>
> possible solutions:
> A) command line option
> B) ./configure --disable-robots-check
>
> Paul
The best solution is surely for wget, when fetching page requisites, to
always ignore robots.txt (and <META NAME="ROBOTS"... in the HTML). (It
would still by default obey robots.txt when downloading anything other
than page requisites.)
After all, if you go to the URL using a web browser, the browser fetches
all page requisites. So wget wouldn't be downloading any more than the web
site owner expects.
-- Mark