On 2008-12-12 09:03 +0100, Morten Lemvigh wrote:
No links on a page with a missing last-modified header are
scanned, if the page is on the disk already. If I run:
wget -r -N http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
--08:51:24--
http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
=> `eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML'
Resolving eur-lex.europa.eu... 147.67.136.2, 147.67.136.102,
147.67.119.2, ...
Connecting to eur-lex.europa.eu|147.67.136.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9.709 (9.5K) [text/html]
Last-modified header missing -- time-stamps turned off.
08:51:24 (82.42 KB/s) -
`eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML' saved
[9709/9709]
[....]
wget will retrieve the page and continue recursively getting all the
linked pages, as I would expect.
OK. This is normal.
If I issue this command a second time, all I get is this:
wget -r -N http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
--08:53:18--
http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
=> `eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML'
Resolving eur-lex.europa.eu... 147.67.119.2, 147.67.119.102,
147.67.136.2, ...
Connecting to eur-lex.europa.eu|147.67.119.2|:80... connected.
HTTP request sent, awaiting response... 500 Internal Server Error
08:53:18 ERROR 500: Internal Server Error.
FINISHED --08:53:18--
Downloaded: 0 bytes in 0 files
So all the pages linked from this page are ignored to. It's fine
if wget skips the problematic document, but I would prefer wget
to continue the recursive scan.
The first time, the local file doesn't exist so Wget issues a GET
request, which succeeds (200).
The second time, the local file exists so Wget must first check
whether the resource has changed. To that end, it issues a HEAD
request. The server apparently doesn't know when the document was
last modified. It could fullfill the HEAD request without a
Last-modified header. Instead, it rejects it with a 500.
It's not that that missing Last-modified header causes Wget to
"ignore the links". It's that there is no document to scan for
links because, when queried about it, the server replied 500.
To work around that kind of brokenness, Wget would have to ignore
the 500 error and fall back on parsing the local file. That should
probably not be made the default behaviour, though.