[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] wget not stop when using -e robots=off option
From: |
Tim Ruehsen |
Subject: |
Re: [Bug-wget] wget not stop when using -e robots=off option |
Date: |
Wed, 30 Nov 2016 10:11:15 +0100 |
User-agent: |
KMail/5.2.3 (Linux/4.8.0-1-amd64; KDE/5.28.0; x86_64; ; ) |
On Sunday, November 27, 2016 5:40:09 PM CET Sethi Badhan wrote:
> Hello
>
> when i try to run simply wget in for loop it works fine but when i try to
> run using -e robots=off it not stopping and it downloading pages
> recursively even i have set the limit for 'for ' loop it is not stoping
> after that limit here is my code
>
> #!/bin/bash
>
> lynx --dump https://en.wikipedia.org/wiki/Cloud_computing |awk
> '/http/{print $2}'| grep https://en. | grep -v
> '.svg\|.png\|.jpg\|.pdf\|.JPG\|.php' >Pages.txt
> grep -vwE "(
> http://www.enterprisecioforum.com/en/blogs/gabriellowy/value-data-platform-s
> ervice-dpaas)" Pages.txt > newpage.txt
> rm Pages.txt
> egrep -v "#|$^" newpage.txt>try.txt
> awk '!a[$0]++' try.txt>new.txt
> rm newpage.txt
> rm try.txt
> mkdir -p htmlpagesnew
> cd htmlpagesnew
> j=0
> for i in $( cat ../new.txt );
> do
> if [ $j -lt 10 ];
> then
> let j=j+1;
> echo $j
> wget -N -nd -r $i -e robots=off --wait=.25 ;
> fi
> done
Maybe you don't want '-r' ?
robots=off circumvents the robots.txt exclusion list... so it might download
much more (and thus perhaps 'never' stops).
Tim
signature.asc
Description: This is a digitally signed message part.