[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] mirroring a Blogger blog without the comments
From: |
Gisle Vanem |
Subject: |
Re: [Bug-wget] mirroring a Blogger blog without the comments |
Date: |
Fri, 25 Apr 2014 10:55:43 +0200 |
<address@hidden> wrote:
Even more general would be something like --next-urls-cmd=<CMD>, where you
could supply a command that accepts an HTTP response on stdin, and then
writes the set of URLs to stdout which should be crawled based on it.
You could use Lynx to extract all the links with:
lynx -dump --listonly URL > urls.file
Edit / grep the 'urls.file' and use the 'wget -i' option to download what
you want. From 'man wget':
-i file
--input-file=file
Read URLs from a local or external file. If - is specified as
file, URLs are read from the standard input. (Use ./- to read from
a file literally named -.)
If this function is used, no URLs need be present on the command
line. If there are URLs both on the command line and in an input
file, those on the command lines will be the first ones to be
retrieved. If --force-html is not specified, then file should con-
sist of a series of URLs, one per line.
However, if you specify --force-html, the document will be regarded
as html. In that case you may have problems with relative links,
which you can solve either by adding "<base href="url">" to the
documents or by specifying --base=url on the command line.
If the file is an external one, the document will be automatically
treated as html if the Content-Type matches text/html. Further-
more, the file's location will be implicitly used as base href if
none was specified.
--gv