cat index.html | sed 's/<a href=""/g' | sed -e '/^[^<a href]/d' | sed 's/<html>.*//' | sed 's/<a href="".*//'
The first part prints the html file to the standard output.
The second takes that output and puts a line feed in front of each hyperlink
The third deletes any line that doesn't start with <a href (i.e., isn't a hyperlink)
The fourth gets rid of the initial line (not sure why the third doesn't get rid of it, but adding this part was easier than troubleshooting).
The fifth removes the initial part of the hyperlink (i.e., before the URL starts)
The sixth removes everything on each line after the " closes the URL
It doesn't seem possible to get wget to
output to sed, but it's still only two steps.
First, wget http://www.foo.comThen the above command, assuming index.html is the downloaded file.
It works for www.google.com...haven't tested anything else.
From: Mark - augustine.com <address@hidden>
To: address@hidden
Sent: Thursday, February 26, 2009 10:25:01 AM
Subject: [Bug-wget] Simple web page listing
Hello,
I'm looking for a way to use
wget to report a list of URL's of all the web pages on a given web site.
Not interested in the actual code, or content, just the names of the web
pages. Also I would prefer it in a very simple format, just the URLs
seperated by return characters.
e.g.
-------------
-------------
Ideas? Another
program/service that offers this already?
Please CC your request to
me. Thankyou!
Best Regards,
Mark
Mahon