[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] really no "wget --list http://..." ?
From: |
Ben Smith |
Subject: |
Re: [Bug-wget] really no "wget --list http://..." ? |
Date: |
Sun, 22 Mar 2009 13:46:34 -0700 (PDT) |
You can run the downloaded file through the following command (replacing
index.html with the appropriate name if necessary).
cat index.html | sed 's/<a href="/\n<ahref="/g' | sed -e '/^[^<a href]/d' | sed
's/<html>.*//' | sed 's/<a href="//' | sed 's/".*//'
All on one line. It works for www.google.com
----- Original Message ----
> From: Micah Cowan <address@hidden>
> To: Denis <address@hidden>
> Cc: address@hidden
> Sent: Friday, March 20, 2009 1:14:44 PM
> Subject: Re: [Bug-wget] really no "wget --list http://..." ?
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Denis wrote:
> > Micah,
> > not to be dense, but is there really no way to "wget --list http://..."
> > a directory without downloading all its files ?
> > To browse any file system, local or remote, I want to be able to LIST it
> first.
> > I gather that there's no www variant of a Unix-like file system
> > (tree structure independent of file contents => very fast ls -R)
> > but a WFS, web file system, would sure simplify life
>
> HTTP has no concept of a directory, and provides no way to list it, so
> no. The WebDAV extensions _do_ provide such a thing, but they're not
> commonly implemented on web servers (especially without authentication),
> so there'd be little point in making Wget use that.
>
> It _could_ be useful for wget to download a given URL, parse out its
> links, and spit them out as a list, but wget doesn't currently do that
> either. Even if it did, there could be no way to guarantee that that
> list represents the complete contents of the "directory", as all wget
> will see is whatever links happen to be on that one single page, so if
> it's not an automatically-generated index page, it's unlikely to be a
> very good representation of directory contents. But implementing that
> would not be a high priority for me at this time (patch, anyone?).
>
> In the meantime, the usual suggestion is to have wget download the
> single HTML page, and then parse out the links yourself with a suitable
> perl/awk/sed script.
>
> - --
> Micah J. Cowan
> Programmer, musician, typesetting enthusiast, gamer.
> Maintainer of GNU Wget and GNU Teseq
> http://micah.cowan.name/
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAknD3RQACgkQ7M8hyUobTrHFlgCfQTcSoCAkgVGPEcnBMI0GlojL
> jqAAn0cK+PcKDEuZwFKyEdCoA9EFQn3N
> =ujth
> -----END PGP SIGNATURE-----