wget-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

wget2 | Possibility to use several accept-regex or reject-regex (#557)


From: Ailothaen (@Ailothaen)
Subject: wget2 | Possibility to use several accept-regex or reject-regex (#557)
Date: Tue, 07 Sep 2021 13:42:02 +0000


Ailothaen created an issue: https://gitlab.com/gnuwget/wget2/-/issues/557



Hello,

**TL;DR**: I think it should be nice to be able to specify several 
`--accept-regex` or `--reject-regex` options (so an URL could be tested against 
each regex), to allow complex website mirrorings without writing a very 
complicated regex.

---

One of the uses I have with wget/wget2 is to download whole websites off the 
Internet, so I can have them offline in case the site goes down.

If the site only consists of static pages, it is an easy task; however, it is 
rarely the case, and websites/blogs usually have backend code.  
I will take the example of phpBB: I know that when I try to archive a phpBB 
website, I have to prevent the software from going to archive pages like 
`posting.php`, `ucp.php` and everything containing `&p=` or an anchor (`#`), 
since they may confuse the software and sometimes even cause infinite loops.

Therefore, it would be great if it was possible to specify several 
`--accept-regex` or `--reject-regex` options, so that we would be able to write 
one regex for every "condition", instead of writing a single regex which can be 
very cumbersome if there is a lot of conditions involved (my phpBB filters 
would include 26 conditions, for example...), and even impossible in some 
cases.  
Then, when an URL would be tested in wget2, it would be checked across all 
regexes to see if one `reject-regex` is matching, for example. If one is 
matching, then dismiss the URL.

Thank you!

PS/FYI: In the past, I was using HTTrack to mirror websites for offline use, 
but I stopped using it since it has been having issues for years [like this 
one](https://forum.httrack.com/readmsg/25040/25037/index.html) that prevent me 
doing a proper mirror, which are unlikely to be addressed (since the 
development looks inactive as now)

-- 
Reply to this email directly or view it on GitLab: 
https://gitlab.com/gnuwget/wget2/-/issues/557
You're receiving this email because of your account on gitlab.com.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]