wget-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wget2 | Add support for pre/post download scripts (#80)


From: Andrew White (@awhite27)
Subject: Re: wget2 | Add support for pre/post download scripts (#80)
Date: Mon, 06 May 2024 12:47:16 +0000



Andrew White commented: 
https://gitlab.com/gnuwget/wget2/-/issues/80#note_1894095767


I wrote a wget2 plugin that runs a python script to do the filtering. I've 
attached it here in case it is useful for anyone else. There is likely to be a 
few bugs in it. It is good enough for what I want to do.

[wget2-python-plugin.tgz](/uploads/98344c996c00b75ea7a1f7bd9c13a456/wget2-python-plugin.tgz)

Run wget2 with `--local-plugin=libwget-python-plugin.so`. It runs a python 
script in the current directory at startup. After startup the callbacks will be 
called for each event. The script is currently hardcoded as `wget_plugin.py` 
due to a problem with the plugin options (see below). I've included a python 
script as an example. The plugin creates a python module "wget" which the 
script imports. This module provides the methods:

```
log_info()
log_error()
log_debug()

register_exit_callback()
register_url_filter_callback()
register_post_processor_callback()
```

The url filter callback is called with a Filter class object. This object has 
the methods:

```
get_url()
get_local_filename()
accept()
reject()
set_alt_url()
set_local_filename()
```

The post processor callback is called with a PostProcess class object. This 
object has the methods:

```
get_url()
get_local_filename()
get_data()
get_recurse()
add_recurse_url()
```

While working on this I found the following additional problems with the wget2 
plugin API.

The options call back is useless. After registering the callback in the 
initializer, the option callback is called once for each option. The problem is 
most plugins are going to need the options during initialization. The only 
workaround is to defer initialization until the first call to url_filter but 
that will cause other issues since the initialization and finalization may be 
on different threads. Also if the plugin name does not start with "lib" the 
options are ignored.

The filter local filename is of the form `dir/file`. The post processor 
filename is of the form `example.com/dir/file`. The filter local filename 
should also include the hostname directory. It would also be helpful if the API 
provided a function to return the download directory so the full pathname of 
the local file can be obtained.

It would be useful if the post processor also provided functions that returned 
the content type and charset. There is a function that indicates if the 
download will be recursed. It would be useful to also provide a function that 
can disable recursion of the downloaded object. Also, the docs for the 
add_recurse_url() function states that it has no effect if get_recurse() 
returns false. I'm not sure why. It would be useful to be able to add URLs 
regardless of whether the current download will be recursed.

The plugin API is multi-threaded. This increases the complexity of plugins and 
makes them less portable as they need to know what threading API wget2 is 
using. I put a mutex in my python plugin to make the python script single 
threaded. I wanted to keep the script simple and I'm not even sure what would 
be involved in making the plugin fully multi-threaded.

-- 
Reply to this email directly or view it on GitLab: 
https://gitlab.com/gnuwget/wget2/-/issues/80#note_1894095767
You're receiving this email because of your account on gitlab.com.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]