wget-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Wget-dev] wget2 | Incompatible Behavior: -p (--page-requisites) and -np


From: Tsukasa OI
Subject: [Wget-dev] wget2 | Incompatible Behavior: -p (--page-requisites) and -np (--no-parent) (#379)
Date: Thu, 03 May 2018 10:13:27 +0000

New Issue was created.

Issue 379: https://gitlab.com/gnuwget/wget2/issues/379
Author:    Tsukasa OI
Assignee:  

Wget1 and Wget2 behaves differently when:

1. Both `-p` and `-np` are given
2. A page requisite (images/CSS etc.) and the original page are on the same 
host (shares the same domain)
3. A page requisite exists outside the directory that contains the original 
page (HTML file)

Especially, this behavior affects recursive downloading.  For instance, on a 
website (`http://example.com/`) with following files:

* `/style.css`: global style for a website
* `/category/index.html`: local page index (refers `/style.css` and links to 
`/category/page.html`)
* `/category/page.html`:  local page (but refers `/style.css`)

`wget -r -l 0 -p -np http://example.com/category/index.html` downloads all 
three files but `wget2 -p -r -l 0 -p -np 
http://example.com/category/index.html` doesn't download global `style.css`. 
This is the simple example but the website I want to crawl is far more complex 
(which makes `--accept-regex` and `--reject-rejex` nearly unusable).

While this behavior is consistent in _some_ way (works just like `-H` 
[`--span-hosts`]) but not being able to retrieve page requisites in the 
recursive download is not desirable for me (and in general).


I think it can be resolved by using `link_inline` somehow but I'm not sure:

1. Whether using `link_inline` can fix the issue
2. Whether changing the behavior of Wget2 just like Wget1 is good or not (is 
there any better behavior than Wget1 [and current Wget2]? can we have a 
command-line option?)

...partly because I first saw the source code of wget (1 and 2) today.

---
Reply to this email directly or view it on GitLab: 
https://gitlab.com/gnuwget/wget2/issues/379
You're receiving this email because of your account on gitlab.com.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]