bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Filter question: Downloading only L2 and deeper?


From: wesley
Subject: [Bug-wget] Filter question: Downloading only L2 and deeper?
Date: Fri, 16 Dec 2011 03:46:47 +0000

I'm trying to figure out if there is any way to setup directory include/excludes or filters to recursively download files that are at level 2 or deeper from the base url but drop any nonhtml L1 links.
In other words, if I pass wget http://example.com/stuff/ I only want to 
download files where --include-directories=/stuff/*/* holds true.
The problem I run into when using --include-directories=/stuff/*/* is 
that when wget fetches the index at example.com/stuff/ it dequeues it 
and thus never recurses into the subdirectories I'm interested in. The 
second issue is that the index pages at '/' boundries are all 
auto/dynamically generated. There is no "index.html" or file-extension I 
can add as a filter rule (unless there is a syntax for doing so I'm not 
aware of).
And while I'm on the topic, to be clear, --accept="/stuff/*.html" is not 
a valid syntax correct? Accept filters don't accept path components from 
my understanding, they only operate on the filename.
What I'm trying to accomplish could easily be solved if there was a way 
to combine path + filename filters into atomic groupings (or with full 
url regex parsing :). In the meantime however, if there is any hackish 
way to accomplish what I'm trying to do I would appreciate any pointers 
in the right direction. This basically came about because I already did 
a very large crawl at L1 and would now like to continue the crawl from 
L2 links and deeper. I don't want to wait on tens of thousands of head 
requests for files I already know are up to date just to be able to get 
to the L2+ links.
Thanks :)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]