[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug #64203] wget --warc-dedup has bogus behavior against duplicated dig
From: |
anonymous |
Subject: |
[bug #64203] wget --warc-dedup has bogus behavior against duplicated digest |
Date: |
Tue, 16 May 2023 21:03:49 -0400 (EDT) |
URL:
<https://savannah.gnu.org/bugs/?64203>
Summary: wget --warc-dedup has bogus behavior against
duplicated digest
Group: GNU Wget
Submitter: None
Submitted: Wed 17 May 2023 01:03:48 AM UTC
Category: Program Logic
Severity: 3 - Normal
Priority: 5 - Normal
Status: None
Privacy: Public
Assigned to: None
Originator Name: plcp
Originator Email: contact@plcp.me
Open/Closed: Open
Release: trunk
Discussion Lock: Any
Operating System: GNU/Linux
Reproducibility: Every Time
Fixed Release: None
Planned Release: None
Regression: None
Work Required: None
Patch Included: No
_______________________________________________________
Follow-up Comments:
-------------------------------------------------------
Date: Wed 17 May 2023 01:03:48 AM UTC By: Anonymous
When wget encounter several urlA,urlB,urlC sharing the same digest hash1 in
the input .cdx file of --warc-dedup, it only registers the last (urlC,hash1)
pair as potential candidate for deduplication.
When crawling, it rejects (urlA,hash1) and (urlB,hash1) as candidates for a
revisit record, even if found as-is during the crawl.
Some steps to reproduce against a toy example, we first build a cdx file:
% wget 'http://perdu.com' 'https://perdu.com' --delete-after --warc-file=testA
--warc-cdx=on
We can visualize the two (url,digest) pairs sharing the same digest:
% cat testA.cdx
Then we can verify that deduplication works for the second pair:
% wget 'https://perdu.com' --delete-after --warc-file=testB
--warc-dedup=testA.cdx
And we see that deduplication fails for the first pair:
% wget 'http://perdu.com' --delete-after --warc-file=testC
--warc-dedup=testA.cdx
We can confirm that only testB got its revisit record:
% zgrep revisit *.warc.gz
In practice, this cause that most frequent files are the ones most commonly
NOT deduplicated. This is noticeable during repeated crawls of websites that
mostly don't change: the first crawl will download the whole website, and all
subsequent crawls will only download files for which deduplication failed
(typically fonts, http/https siblings, resources present at several URLs of
the website...)
AFAIK current implementation use digest as key of an in-memory hash table,
instead of using the (url,digest) pair.
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/bugs/?64203>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [bug #64203] wget --warc-dedup has bogus behavior against duplicated digest,
anonymous <=