[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] Why does -A not work?
From: |
Tim Rühsen |
Subject: |
Re: [Bug-wget] Why does -A not work? |
Date: |
Wed, 20 Jun 2018 16:58:28 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0 |
Hi Niels,
please always answer to the mailing list (no problem if you CC me, but
not needed).
It was just an example for POSIX regexes - it's up to you to work out
the details ;-) Or maybe there is a volunteer reading this.
The implicitly downloaded HTML pages should be removed after parsing
when you use --accept-regex. Except the explicitly 'starting' page from
your command line.
Regards, Tim
On 06/20/2018 04:28 PM, Nils Gerlach wrote:
> Hi Tim,
>
> I am sorry but your command does not work. It only downloads the thumbnails
> from the first page
> and follows none of the links. Open the link in a browser. Click on the
> pictures to get a larger picture.
> There is a link "high quality picture" the pictures behind those links are
> the ones i want to download.
> Regex being ".*little-nemo.*n\l.jpeg". And not only the first page but from
> the other search result pages, too.
> Can you work that one out? Does this work with wget? Best result would be
> if the visited html-pages were
> deleted by wget. But if they stay I can delete them afterwards. But
> automatism would be better, that's why I am
> trying to use wget ;)
>
> Thanks for the information on the filename and path, though.
>
> Greetings
>
> 2018-06-20 16:13 GMT+02:00 Tim Rühsen <address@hidden>:
>
>> Hi Nils,
>>
>> On 06/20/2018 06:16 AM, Nils Gerlach wrote:
>>> Hi there,
>>>
>>> in #wget on freenode I was suggested to write this to you:
>>> I tried using wget to get some images:
>>> wget -nd -rH -Dcomicstriplibrary.org -A
>>> "little-nemo*s.jpeg","*html*","*.html.*","*.tmp","*page*","*display*"
>> -p -e
>>> robots=off 'http://comicstriplibrary.org/search?search=little+nemo'
>>> I wanted to download the images only but wget was not following any of
>> the
>>> links so I got that much more into -A. But it still does not follow the
>>> links.
>>> Page numbers of the search result contain "page" in the link, links to
>> the
>>> big pictures i want wget to download contain "display". Both are given in
>>> -A and are seen in the html-document wget gets. Neither is followed by
>> wget.
>>>
>>> Why does this not work at all? Website is public, anybody is free to
>> test.
>>> But this is not my website!
>>
>> -A / -R works only on the filename, not on the path. The docs (man page)
>> is not very explicit about it.
>>
>> Instead try --accept-regex / --reject-regex which acts on the complete
>> URL - but shell wildcard's won't work.
>>
>> For your example this means to replace '.' by '\.' and '*' by '.*'.
>>
>> To download those nemo jpegs:
>> wget -d -rH -Dcomicstriplibrary.org --accept-regex
>> ".*little-nemo.*n\.jpeg" -p -e robots=off
>> 'http://comicstriplibrary.org/search?search=little+nemo'
>> --regex-type=posix
>>
>> Regards, Tim
>>
>>
>
signature.asc
Description: OpenPGP digital signature
Re: [Bug-wget] Why does -A not work?, Nils Gerlach, 2018/06/20