[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] WARC output
From: |
Patrick Steil |
Subject: |
Re: [Bug-wget] WARC output |
Date: |
Tue, 9 Aug 2011 17:42:10 -0500 |
That sounds awesome! You have my vote... :)
On Tue, Aug 9, 2011 at 4:49 AM, Gijs van Tulder <address@hidden> wrote:
> Hi,
>
> I'd like to propose a new feature that allows Wget to make WARC files.
>
> Perhaps you're already familiar with it, but in short: WARC is a file
> format for web archives. In a single WARC file, you can store every file of
> the website, plus the HTTP request and response headers and other metadata.
> This makes it a very useful format for web archivists: you keep everything
> together, in the most detailed and original form.
>
> The WARC format (an ISO standard, ISO 28500) has been developed by the
> International Internet Preservation Consortium, which includes the Internet
> Archive and many national libraries. It is supposed to become *the* standard
> file format for web archives. For example, it is used in the Internet
> Archive's Wayback Machine and its Heritrix crawler. There are several
> projects building tools to work with WARC files.
>
>
> It would be cool if Wget could become one of these tools. Already the Swiss
> army knife for mirroring websites, the one thing that Wget is missing is a
> good way to store these mirrors. The current output of --mirror is not
> sufficient for archival purposes:
>
> - it throws away the HTTP headers (of the request and response);
> - it doesn't keep 404 pages and redirects;
> - it doesn't store the original urls but mangles the filenames;
> - and, if you're not careful, it even rewrites the links inside
> the documents that it has downloaded.
>
> The WARC format supports these things.
>
>
> With some help from others, I've added WARC functions to Wget. With the
> --warc-file option you can specify that the mirror should also be written to
> a WARC archive. Wget will then keep everything, including the HTTP request
> and response headers, redirects and 404 pages.
>
> Do you think this is something that could be included in the main Wget
> version? If that's the case, what should be the next step?
>
> Description, links to more information about WARC:
>
> http://www.archiveteam.org/**index.php?title=Wget_with_**WARC_output<http://www.archiveteam.org/index.php?title=Wget_with_WARC_output>
>
> Code:
> https://github.com/alard/wget-**warc/<https://github.com/alard/wget-warc/>
> https://github.com/downloads/**alard/wget-warc/wget-warc-**
> 20110809.tar.bz2<https://github.com/downloads/alard/wget-warc/wget-warc-20110809.tar.bz2>
>
> The implementation makes use of the open source WARC Tools library
> (Apache License 2.0):
> http://code.google.com/p/warc-**tools/<http://code.google.com/p/warc-tools/>
>
>
> I look forward to your response.
>
> Kind regards,
>
> Gijs van Tulder
>
>
--
**
*Patrick Steil | ChurchBuzz.org*
Church Website Optimization <http://www.churchbuzz.org/>
Like us on Facebook <http://facebook.com/churchbuzz>!
Mobile: 940-391-9250