[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
wget2 | Possibility to use several accept-regex or reject-regex (#557)
From: |
Ailothaen (@Ailothaen) |
Subject: |
wget2 | Possibility to use several accept-regex or reject-regex (#557) |
Date: |
Tue, 07 Sep 2021 13:42:02 +0000 |
Ailothaen created an issue: https://gitlab.com/gnuwget/wget2/-/issues/557
Hello,
**TL;DR**: I think it should be nice to be able to specify several
`--accept-regex` or `--reject-regex` options (so an URL could be tested against
each regex), to allow complex website mirrorings without writing a very
complicated regex.
---
One of the uses I have with wget/wget2 is to download whole websites off the
Internet, so I can have them offline in case the site goes down.
If the site only consists of static pages, it is an easy task; however, it is
rarely the case, and websites/blogs usually have backend code.
I will take the example of phpBB: I know that when I try to archive a phpBB
website, I have to prevent the software from going to archive pages like
`posting.php`, `ucp.php` and everything containing `&p=` or an anchor (`#`),
since they may confuse the software and sometimes even cause infinite loops.
Therefore, it would be great if it was possible to specify several
`--accept-regex` or `--reject-regex` options, so that we would be able to write
one regex for every "condition", instead of writing a single regex which can be
very cumbersome if there is a lot of conditions involved (my phpBB filters
would include 26 conditions, for example...), and even impossible in some
cases.
Then, when an URL would be tested in wget2, it would be checked across all
regexes to see if one `reject-regex` is matching, for example. If one is
matching, then dismiss the URL.
Thank you!
PS/FYI: In the past, I was using HTTrack to mirror websites for offline use,
but I stopped using it since it has been having issues for years [like this
one](https://forum.httrack.com/readmsg/25040/25037/index.html) that prevent me
doing a proper mirror, which are unlikely to be addressed (since the
development looks inactive as now)
--
Reply to this email directly or view it on GitLab:
https://gitlab.com/gnuwget/wget2/-/issues/557
You're receiving this email because of your account on gitlab.com.
- wget2 | Possibility to use several accept-regex or reject-regex (#557),
Ailothaen (@Ailothaen) <=