[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
wget skips some local files for link updates in recursive mode if there
From: |
Florian Rosenauer |
Subject: |
wget skips some local files for link updates in recursive mode if there are many files to be downloaded |
Date: |
Sun, 21 Feb 2021 17:49:26 +0100 |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.7.1 |
Hello!
I do have the following Problem: The Page
https://xwing-miniatures.fandom.com/wiki/X-Wing_Miniatures_Wiki contains
a link element referencing a stylesheet:
<link rel="stylesheet"
href="/load.php?lang=en&modules=ext.categorySelect.runtimeStyles%7Cext.fandom.ArticleInterlang.css%7Cext.fandom.CreatePage.css%7Cext.fandom.DesignSystem.css%7Cext.fandom.Thumbnails.css%7Cext.fandom.UserPreferencesV2.runtime.css%7Cext.fandom.bannerNotifications.css%7Cext.fandom.coreRuntimeStyles%2CwikiaBarRuntimeStyles%7Cext.fandom.mainPageTag.css%7Cext.staffSig.css%7Cext.visualEditor.desktopArticleTarget.noscript%7Cmediawiki.legacy.commonPrint%2Cshared%7Cskin.oasis.css%7Cskin.oasis.fanFeed.css%2CdiscussionsRuntimeStyles%7Cskin.oasis.pageheader.Share.css&only=styles&skin=oasis"/>
During a single page download the link is updated with the local file,
during a recursive download the file is downloaded, but the link is only
updated if the download is limited to a few pages.
To reproduce:
1a. run wget to load a single page using --page-requisites:
wget --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
https://xwing-miniatures.fandom.com -d -o singlepage.log
1b. Result: a fully working local version in
xwing-miniatures.fandom.com\index.html, the link element is replaced
with the local file
<link rel="stylesheet"
href="load.php@lang=en&modules=ext.categorySelect.runtimeStyles%257Cext.fandom.ArticleInterlang.css%257Cext.fandom.CreatePage.css%257Cext.fandom.DesignSystem.css%257Cext.fandom.Thumbnails.css%257Cext.fandom.UserPreferencesV2.css"/>
2a. run wget to load recursive but limit it with --reject-regex to very
few pages:
wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
--domains=fonts.gstatic.com,static.wikia.nocookie.net,vignette.wikia.nocookie.net,vignette3.wikia.nocookie.net,www.fastly-insights.com,www.googletagmanager.com,xwing-miniatures.fandom.com
--reject-regex ".*xwing-miniatures.fandom.com/.*/.*|.*action=edit.*"
https://xwing-miniatures.fandom.com
2b. Result: a fully working local version in
xwing-miniatures.fandom.com\index.html, the link element is replaced
with the local file:
<link rel="stylesheet"
href="load.php@lang=en&modules=ext.categorySelect.runtimeStyles%257Cext.fandom.ArticleInterlang.css%257Cext.fandom.CreatePage.css%257Cext.fandom.DesignSystem.css%257Cext.fandom.Thumbnails.css%257Cext.fandom.UserPreferencesV2.css"/>
3a. Attention: this downloads about 3300 Pages / 8500 Files!
run wget to load recursive but less limits in --reject-regex
(reject only the forum and the wiki edit/history pages)
wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
--domains=fonts.gstatic.com,static.wikia.nocookie.net,vignette.wikia.nocookie.net,vignette3.wikia.nocookie.net,www.fastly-insights.com,www.googletagmanager.com,xwing-miniatures.fandom.com
--reject-regex
"https://xwing-miniatures.fandom.com/f/.*|.*action=edit.*|.*action=history.*|.*@oldid=.*"
https://xwing-miniatures.fandom.com
3b. Result: after everything has finished, the link is altered to refer
to the online page (!) althought it was downloaded locally and is needed
for a working local page:
<link rel="stylesheet"
href="https://xwing-miniatures.fandom.com/load.php?lang=en&modules=ext.categorySelect.runtimeStyles%7Cext.fandom.ArticleInterlang.css%7Cext.fandom.CreatePage.css%7Cext.fandom.DesignSystem.css%7Cext.fandom.Thumbnails.css%7Cext.fandom.UserPreferencesV2.runtime.css%7Cext.fandom.bannerNotifications.css%7Cext.fandom.coreRuntimeStyles%2CwikiaBarRuntimeStyles%7Cext.fandom.mainPageTag.css%7Cext.staffSig.css%7Cext.visualEditor.desktopArticleTarget.noscript%7Cmediawiki.legacy.commonPrint%2Cshared%7Cskin.oasis.css%7Cskin.oasis.fanFeed.css%2CdiscussionsRuntimeStyles%7Cskin.oasis.pageheader.Share.css&only=styles&skin=oasis"/>
If you open the page offline (without anything cached!), it renders
totally unusable as the main CSS is missing.
Note 1: the --domains list was built by looking at the result of command
#1a to limit the downloads
Note 2: in Windows, make sure to put the download to a short directory
path, as it exceeds 256 chars soon due to the long names, and neither
Firefox nor Chrome can open file paths > 256 chars in Windows xD
Version Information:
$ wget -V
GNU Wget 1.21.1 built on cygwin.
+cares +digest +gpgme +https +ipv6 +iri +large-file +metalink +nls
+ntlm +opie +psl +ssl/gnutls
Wgetrc:
/etc/wgetrc (system)
Locale:
/usr/share/locale
Compile:
gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc"
-DLOCALEDIR="/usr/share/locale" -I.
-I/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1/src
-I../lib
-I/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1/lib
-I/usr/include/uuid -DNDEBUG -ggdb -O2 -pipe -Wall
-Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
-fstack-protector-strong --param=ssp-buffer-size=4
-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/build=/usr/src/debug/wget-1.21.1-1
-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1=/usr/src/debug/wget-1.21.1-1
Link:
gcc -I/usr/include/uuid -DNDEBUG -ggdb -O2 -pipe -Wall
-Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
-fstack-protector-strong --param=ssp-buffer-size=4
-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/build=/usr/src/debug/wget-1.21.1-1
-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1=/usr/src/debug/wget-1.21.1-1
-lmetalink -lcares -lpcre2-8 -luuid -lidn2 -lnettle -lgnutls -lz
-lpsl -lgpgme ftp-opie.o gnutls.o http-ntlm.o ../lib/libgnu.a
-liconv -lintl -lunistring
Should I submit a bug? Do I miss something?
Thank you
Florian
- wget skips some local files for link updates in recursive mode if there are many files to be downloaded,
Florian Rosenauer <=