[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files
From: |
Gijs van Tulder |
Subject: |
Re: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files |
Date: |
Sun, 31 Mar 2013 00:46:00 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130308 Thunderbird/17.0.4 |
Hi,
> It appears wget may be creating slightly malformed GZIP skip-length
> fields
I think that's correct: Wget doesn't write the subfield length in the
"extra field" section of the header. After the subfield ID "sl" it
should write the length LEN (see RFC 1952 [1]), but it doesn't.
Luckily, it does write the correct length of all extra fields (XLEN in
the RFC 1952), so Gzip implementations that just ignore the extra field
can skip it without problems. This is the case for the GNU Gzip utility.
But it should be fixed. I've attached a patch.
> It's likely that we'll need to make the warc.gz parsers a bit more
> robust, but I thought I'd mention it here in case this is
> actually a bug in wget.
When I wrote the code for the extra field I used the old Hanzo
warc-tools [2] as an example. That implementation has the same problem:
it doesn't write the field length [3]. This means there's at least one
other tool that writes these off-spec warc.gz files, so it's probably
useful to make the parser a bit more robust.
Thanks,
Gijs
[1] http://www.gzip.org/zlib/rfc-gzip.html
[2] https://code.google.com/p/warc-tools/
[2]
https://code.google.com/p/warc-tools/source/browse/trunk/lib/private/wgzip.c#314
warc-gzip-write-length-of-extra-field.patch
Description: Text Data