bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #64203] wget --warc-dedup has bogus behavior against duplicated dig


From: anonymous
Subject: [bug #64203] wget --warc-dedup has bogus behavior against duplicated digest
Date: Tue, 16 May 2023 21:03:49 -0400 (EDT)

URL:
  <https://savannah.gnu.org/bugs/?64203>

                 Summary: wget --warc-dedup has bogus behavior against
duplicated digest
                   Group: GNU Wget
               Submitter: None
               Submitted: Wed 17 May 2023 01:03:48 AM UTC
                Category: Program Logic
                Severity: 3 - Normal
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
         Originator Name: plcp
        Originator Email: contact@plcp.me
             Open/Closed: Open
                 Release: trunk
         Discussion Lock: Any
        Operating System: GNU/Linux
         Reproducibility: Every Time
           Fixed Release: None
         Planned Release: None
              Regression: None
           Work Required: None
          Patch Included: No


    _______________________________________________________

Follow-up Comments:


-------------------------------------------------------
Date: Wed 17 May 2023 01:03:48 AM UTC By: Anonymous
When wget encounter several urlA,urlB,urlC sharing the same digest hash1 in
the input .cdx file of --warc-dedup, it only registers the last (urlC,hash1)
pair as potential candidate for deduplication.

When crawling, it rejects (urlA,hash1) and (urlB,hash1) as candidates for a
revisit record, even if found as-is during the crawl.

Some steps to reproduce against a toy example, we first build a cdx file:

% wget 'http://perdu.com' 'https://perdu.com' --delete-after --warc-file=testA
--warc-cdx=on

We can visualize the two (url,digest) pairs sharing the same digest:

% cat testA.cdx

Then we can verify that deduplication works for the second pair:

% wget 'https://perdu.com' --delete-after --warc-file=testB
--warc-dedup=testA.cdx

And we see that deduplication fails for the first pair:

% wget 'http://perdu.com' --delete-after --warc-file=testC
--warc-dedup=testA.cdx

We can confirm that only testB got its revisit record:

% zgrep revisit *.warc.gz

In practice, this cause that most frequent files are the ones most commonly
NOT deduplicated. This is noticeable during repeated crawls of websites that
mostly don't change: the first crawl will download the whole website, and all
subsequent crawls will only download files for which deduplication failed
(typically fonts, http/https siblings, resources present at several URLs of
the website...)

AFAIK current implementation use digest as key of an in-memory hash table,
instead of using the (url,digest) pair.







    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?64203>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]