[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: wget skips some local files for link updates in recursive mode if th
From: |
Florian Rosenauer |
Subject: |
Re: wget skips some local files for link updates in recursive mode if there are many files to be downloaded |
Date: |
Sun, 28 Feb 2021 11:51:23 +0100 |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.7.1 |
Hello,
I was able to track the problem down a bit more. I excluded the biggest
part which are the image downloads and played around with downloading
different sets of pages:
The following command downloads pages A-b (about 140 files in total) and
works fine:
wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
--domains=fonts.gstatic.com,static.wikia.nocookie.net,vignette.wikia.nocookie.net,vignette3.wikia.nocookie.net,www.fastly-insights.com,www.googletagmanager.com,xwing-miniatures.fandom.com
--reject-regex
".*xwing-miniatures.fandom.com/f/.*|.*xwing-miniatures.fandom.com/wiki/[C-Zc-z].*|.*static.wikia.nocookie.net/xwing-miniatures/images/.*|.*Template.*|.*action=edit.*|.*action=history.*|.*oldid=.*"
https://xwing-miniatures.fandom.com
The following command downloads pages A-c (about 540 files in total) and
fails to update the link rel="stylesheet" href to the locally downloaded
file:
wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
--domains=fonts.gstatic.com,static.wikia.nocookie.net,vignette.wikia.nocookie.net,vignette3.wikia.nocookie.net,www.fastly-insights.com,www.googletagmanager.com,xwing-miniatures.fandom.com
--reject-regex
".*xwing-miniatures.fandom.com/f/.*|.*xwing-miniatures.fandom.com/wiki/[D-Zd-z].*|.*static.wikia.nocookie.net/xwing-miniatures/images/.*|.*Template.*|.*action=edit.*|.*action=history.*|.*oldid=.*"
https://xwing-miniatures.fandom.com
I thought, maybe a "C" page triggers a bug in wget, but also downloading
only F-z fails the link-update so to me it looks like it is really
related to the number of pages beeing downloaded. The more pages beeing
downloaded, the more likely it is that wget doesn't update css
stylesheet links.
I think I will write a small hardcoded sed script to update the links in
my files to the local ones.
Maybe someone has an idea about it or other ideas to narrow down the
original problem.
Thanks
Florian
On 21.02.2021 17:49, Florian Rosenauer wrote:
Hello!
I do have the following Problem: The Page
https://xwing-miniatures.fandom.com/wiki/X-Wing_Miniatures_Wiki contains
a link element referencing a stylesheet:
<link rel="stylesheet"
href="/load.php?lang=en&modules=ext.categorySelect.runtimeStyles%7Cext.fandom.ArticleInterlang.css%7Cext.fandom.CreatePage.css%7Cext.fandom.DesignSystem.css%7Cext.fandom.Thumbnails.css%7Cext.fandom.UserPreferencesV2.runtime.css%7Cext.fandom.bannerNotifications.css%7Cext.fandom.coreRuntimeStyles%2CwikiaBarRuntimeStyles%7Cext.fandom.mainPageTag.css%7Cext.staffSig.css%7Cext.visualEditor.desktopArticleTarget.noscript%7Cmediawiki.legacy.commonPrint%2Cshared%7Cskin.oasis.css%7Cskin.oasis.fanFeed.css%2CdiscussionsRuntimeStyles%7Cskin.oasis.pageheader.Share.css&only=styles&skin=oasis"/>
During a single page download the link is updated with the local file,
during a recursive download the file is downloaded, but the link is only
updated if the download is limited to a few pages.
To reproduce:
1a. run wget to load a single page using --page-requisites:
wget --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
https://xwing-miniatures.fandom.com -d -o singlepage.log
1b. Result: a fully working local version in
xwing-miniatures.fandom.com\index.html, the link element is replaced
with the local file
<link rel="stylesheet"
href="load.php@lang=en&modules=ext.categorySelect.runtimeStyles%257Cext.fandom.ArticleInterlang.css%257Cext.fandom.CreatePage.css%257Cext.fandom.DesignSystem.css%257Cext.fandom.Thumbnails.css%257Cext.fandom.UserPreferencesV2.css"/>
2a. run wget to load recursive but limit it with --reject-regex to very
few pages:
wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
--domains=fonts.gstatic.com,static.wikia.nocookie.net,vignette.wikia.nocookie.net,vignette3.wikia.nocookie.net,www.fastly-insights.com,www.googletagmanager.com,xwing-miniatures.fandom.com
--reject-regex ".*xwing-miniatures.fandom.com/.*/.*|.*action=edit.*"
https://xwing-miniatures.fandom.com
2b. Result: a fully working local version in
xwing-miniatures.fandom.com\index.html, the link element is replaced
with the local file:
<link rel="stylesheet"
href="load.php@lang=en&modules=ext.categorySelect.runtimeStyles%257Cext.fandom.ArticleInterlang.css%257Cext.fandom.CreatePage.css%257Cext.fandom.DesignSystem.css%257Cext.fandom.Thumbnails.css%257Cext.fandom.UserPreferencesV2.css"/>
3a. Attention: this downloads about 3300 Pages / 8500 Files!
run wget to load recursive but less limits in --reject-regex
(reject only the forum and the wiki edit/history pages)
wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
--domains=fonts.gstatic.com,static.wikia.nocookie.net,vignette.wikia.nocookie.net,vignette3.wikia.nocookie.net,www.fastly-insights.com,www.googletagmanager.com,xwing-miniatures.fandom.com
--reject-regex
"https://xwing-miniatures.fandom.com/f/.*|.*action=edit.*|.*action=history.*|.*@oldid=.*"
https://xwing-miniatures.fandom.com
3b. Result: after everything has finished, the link is altered to refer
to the online page (!) althought it was downloaded locally and is needed
for a working local page:
<link rel="stylesheet"
href="https://xwing-miniatures.fandom.com/load.php?lang=en&modules=ext.categorySelect.runtimeStyles%7Cext.fandom.ArticleInterlang.css%7Cext.fandom.CreatePage.css%7Cext.fandom.DesignSystem.css%7Cext.fandom.Thumbnails.css%7Cext.fandom.UserPreferencesV2.runtime.css%7Cext.fandom.bannerNotifications.css%7Cext.fandom.coreRuntimeStyles%2CwikiaBarRuntimeStyles%7Cext.fandom.mainPageTag.css%7Cext.staffSig.css%7Cext.visualEditor.desktopArticleTarget.noscript%7Cmediawiki.legacy.commonPrint%2Cshared%7Cskin.oasis.css%7Cskin.oasis.fanFeed.css%2CdiscussionsRuntimeStyles%7Cskin.oasis.pageheader.Share.css&only=styles&skin=oasis"/>
If you open the page offline (without anything cached!), it renders
totally unusable as the main CSS is missing.
Note 1: the --domains list was built by looking at the result of command
#1a to limit the downloads
Note 2: in Windows, make sure to put the download to a short directory
path, as it exceeds 256 chars soon due to the long names, and neither
Firefox nor Chrome can open file paths > 256 chars in Windows xD
Version Information:
$ wget -V
GNU Wget 1.21.1 built on cygwin.
+cares +digest +gpgme +https +ipv6 +iri +large-file +metalink +nls
+ntlm +opie +psl +ssl/gnutls
Wgetrc:
/etc/wgetrc (system)
Locale:
/usr/share/locale
Compile:
gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc"
-DLOCALEDIR="/usr/share/locale" -I.
-I/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1/src
-I../lib
-I/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1/lib
-I/usr/include/uuid -DNDEBUG -ggdb -O2 -pipe -Wall
-Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
-fstack-protector-strong --param=ssp-buffer-size=4
-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/build=/usr/src/debug/wget-1.21.1-1
-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1=/usr/src/debug/wget-1.21.1-1
Link:
gcc -I/usr/include/uuid -DNDEBUG -ggdb -O2 -pipe -Wall
-Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
-fstack-protector-strong --param=ssp-buffer-size=4
-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/build=/usr/src/debug/wget-1.21.1-1
-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1=/usr/src/debug/wget-1.21.1-1
-lmetalink -lcares -lpcre2-8 -luuid -lidn2 -lnettle -lgnutls -lz
-lpsl -lgpgme ftp-opie.o gnutls.o http-ntlm.o ../lib/libgnu.a
-liconv -lintl -lunistring
Should I submit a bug? Do I miss something?
Thank you
Florian