Re: Not getting the wildcards to work in wget

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Not getting the wildcards to work in wget

From:	Felix Dietrich
Subject:	Re: Not getting the wildcards to work in wget
Date:	Fri, 05 Feb 2021 06:25:37 +0100
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)

Hello,

Cherise Haywood <Cherise.Haywood@metoffice.gov.tt> writes:

> I am trying to download specific .zip files from this website:
> https://www2.census.gov/geo/tiger/TIGER2012/ROADS/
>
> I have used several iterations of wget to yield only the folders (
> directories) being formed, but absolutely no data being downloaded.
>
> Here are copies of the code I have used:
>
> OPTION 1: wget --no-parent --relative --recursive --level=2
> --accept=zip --mirror -A .zip
> https://www2.census.gov/geo/tiger/TIGER2012/ROADS/
>
> Can you assist?

It seems that wget has problems with parsing the /robots.txt correctly:
the empty record for “User-Agent: *” appears to cause it to consider all
paths disallowed.  To work around the issue you may disable honouring
the /robots.txt by adding “--execute robots=off” to your command-line.

> OPTION 2: wget --no-parent --relative --recursive --level=2
> --accept=zip --mirror -A *_72*.zip --time-stamps
> https://www2.census.gov/geo/tiger/TIGER2012/ROADS/

--time-stamps should probably have been --timestamping.

--mirror sets an infinite recursion depth (--level=inf).  You may limit
the depth when using --mirror by specifying --level after --mirror (I
believe).

> OPTION 3: wget --no-parent --relative --recursive --level=2
> --accept=zip --mirror -A _72
> https://www2.census.gov/geo/tiger/TIGER2012/ROADS

Having multiple patterns specified with -A, --accept either using
separate arguments or comma separated patterns will accept a file if
*any one* of the patterns matches.

> I only want the files with *_72*.zip to be downloaded to a copy of the
> directories it comes from on my system.

This is the invocation I have come up with (backslash used as line
continuation marker):

  wget --execute robots=off --timestamping \
       --no-parent --recursive --level=1 \
       --accept '*_72*.zip' \
       'https://www2.census.gov/geo/tiger/TIGER2012/ROADS/'

Make sure to quote strings containing characters with special meaning to
your shell (like the ‘*’ often used for globing).  --level=1 seems to be
enough to get the .zip files: they are all in the directory your URL
points to – but you should check that.

> I have attached error imgs, I captured!

It would have been better, had you provided a log in text form.  Wget
can be instructed to output to a log file using --output-file or
--append-output; if you still want to see the progress bar also add
--show-progress.  You may also use the Windows’ command-prompt
redirection operator “> /path/to/file” to write wget’s output to a file.

Happy data analysing, I presume.

-- 
Felix Dietrich

[Prev in Thread]

Current Thread

[Next in Thread]

Not getting the wildcards to work in wget, Cherise Haywood, 2021/02/03
- Re: Not getting the wildcards to work in wget, Felix Dietrich <=

Prev by Date: Re: Wget passes Authorization header cross-domain upon redirect
Next by Date: [bug #60017] Italian translation error in help text
Previous by thread: Not getting the wildcards to work in wget
Next by thread: Re: Wget passes Authorization header cross-domain upon redirect
Index(es):
- Date
- Thread