Re: [Bug-wget] really no "wget --list http://..." ?

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] really no "wget --list http://..." ?

From:	Ben Smith
Subject:	Re: [Bug-wget] really no "wget --list http://..." ?
Date:	Sun, 22 Mar 2009 13:46:34 -0700 (PDT)

You can run the downloaded file through the following command (replacing 
index.html with the appropriate name if necessary).  

cat index.html | sed 's/<a href="/\n<ahref="/g' | sed -e '/^[^<a href]/d' | sed 
's/<html>.*//' | sed 's/<a href="//' | sed 's/".*//'

All on one line.  It works for www.google.com



----- Original Message ----
> From: Micah Cowan <address@hidden>
> To: Denis <address@hidden>
> Cc: address@hidden
> Sent: Friday, March 20, 2009 1:14:44 PM
> Subject: Re: [Bug-wget] really no "wget --list http://..."; ?
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Denis wrote:
> > Micah,
> >   not to be dense, but is there really no way to "wget --list http://...";
> > a directory without downloading all its files ?
> > To browse any file system, local or remote, I want to be able to LIST it 
> first.
> > I gather that there's no www variant of a Unix-like file system
> > (tree structure independent of file contents => very fast ls -R)
> > but a WFS, web file system, would sure simplify life
> 
> HTTP has no concept of a directory, and provides no way to list it, so
> no. The WebDAV extensions _do_ provide such a thing, but they're not
> commonly implemented on web servers (especially without authentication),
> so there'd be little point in making Wget use that.
> 
> It _could_ be useful for wget to download a given URL, parse out its
> links, and spit them out as a list, but wget doesn't currently do that
> either. Even if it did, there could be no way to guarantee that that
> list represents the complete contents of the "directory", as all wget
> will see is whatever links happen to be on that one single page, so if
> it's not an automatically-generated index page, it's unlikely to be a
> very good representation of directory contents. But implementing that
> would not be a high priority for me at this time (patch, anyone?).
> 
> In the meantime, the usual suggestion is to have wget download the
> single HTML page, and then parse out the links yourself with a suitable
> perl/awk/sed script.
> 
> - --
> Micah J. Cowan
> Programmer, musician, typesetting enthusiast, gamer.
> Maintainer of GNU Wget and GNU Teseq
> http://micah.cowan.name/
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iEYEARECAAYFAknD3RQACgkQ7M8hyUobTrHFlgCfQTcSoCAkgVGPEcnBMI0GlojL
> jqAAn0cK+PcKDEuZwFKyEdCoA9EFQn3N
> =ujth
> -----END PGP SIGNATURE-----

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] really no "wget --list http://..." ?, Denis, 2009/03/20
- Re: [Bug-wget] really no "wget --list http://..." ?, Micah Cowan, 2009/03/20
  - Re: [Bug-wget] really no "wget --list http://..." ?, Ben Smith <=

Prev by Date: Re: [Bug-wget] downloading all webpages, recursively, that have a given pattern
Next by Date: [Bug-wget] wget logging
Previous by thread: Re: [Bug-wget] really no "wget --list http://..." ?
Next by thread: [Bug-wget] downloading all webpages, recursively, that have a given pattern
Index(es):
- Date
- Thread