[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-wget] [bug #20398] Save a list of the links that were not followed
From: |
Jookia |
Subject: |
[Bug-wget] [bug #20398] Save a list of the links that were not followed |
Date: |
Thu, 07 May 2015 15:58:53 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:37.0) Gecko/20100101 Firefox/37.0 |
Follow-up Comment #5, bug #20398 (project wget):
I've found myself in need of this feature. I'm trying to download a website
recursively without pulling in every single ad and its HTML. I'd like to be
able to find out which URLs were rejected, why, and information about the
domains (host, port, etc.)
I've patched my copy of Wget to dump all of this in to a CSV file which I can
then tool through to get my desired results:
% grep "DOMAIN" rejected.csv | head -1
DOMAIN,http://c0059637.cdn1.cloudfiles.rackspacecloud.com/flowplayer-3.2.6.min.js,SCHEME_HTTP,c0059637.cdn1.cloudfiles.rackspacecloud.com,80,flowplayer-3.2.6.min.js,(null),(null),(null),http://redated/,SCHEME_HTTP,redacted,80,,(null),(null),(null)
% grep "DOMAIN" rejected.csv | cut -d"," -f4 | sort | uniq
0.gravatar.com
1.gravatar.com
c0059637.cdn1.cloudfiles.rackspacecloud.com
lh3.googleusercontent.com
lh4.googleusercontent.com
lh5.googleusercontent.com
lh6.googleusercontent.com
I've included a patch made in a few hours that does this.
(file #33955)
_______________________________________________________
Additional Item Attachment:
File name: 0001-rejected-log-Add-option-to-dump-URL-rejections-to-a-.patch
Size:14 KB
_______________________________________________________
Reply to this item at:
<http://savannah.gnu.org/bugs/?20398>
_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
- [Bug-wget] [bug #20398] Save a list of the links that were not followed,
Jookia <=