[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] wget 1.12 utf-8 webpage with convert-links generate illeg
From: |
Ángel González |
Subject: |
Re: [Bug-wget] wget 1.12 utf-8 webpage with convert-links generate illegail utf-8 sequence |
Date: |
Sun, 10 Jun 2012 00:32:20 +0200 |
User-agent: |
Thunderbird |
On 09/06/12 21:03, Micah Cowan wrote:
> Could you attach an example of the broken file contents? ...the full
> file itself is perhaps a bit large to attach in a mailing list (~85k?),
> but perhaps you could use a pastebin, or otherwise throw it up on a
> server, or just post a snippet that illustrates exactly what sort of
> corruption is taking place in your setup.
>
> Good luck,
> -mjc
That wikipedia page hasn't been edited since April 7th, so we are all
probably working with the same content.
These are the md5sums of the files I worked with:
6d887f5796a00a24e8fb284d6f78791c Without-k
341611e10271ffa117f873a56a467960 With-k
Hitoshi, if the md5 of the corrupted file is 3416... then I missed the
corruption. A simple wget should be 6d88... though.
A fragment of the relevant bytes (eg. hexdump -C) from both the original
and transformed (broken) file could be enough for finding out the cause.
The latest big change to convert.c was the CSS wonder-patch of 2008,
available in 1.12, so there shouldn't be any difference in the
conversion with the latest one.
Still, I built and tried with ftp://ftp.gnu.org/gnu/wget/wget-1.12.tar.bz2
I did found an interesting issue:
Where the file converted with current wget shows:
<!-- logo -->
<div id="p-logo"><a style="background-image:
url(http://upload.wikimedia.org/...
<!-- /logo -->
<!-- navigation -->
<div class="portal" id='p-navigation'>
<h5>...
<div class="body">
<ul>
<li id="n-mainpage">...
<li id="n-portal">...
<li id="n-currentevents">...
<li id="n-newpages">...
<li id="n-recentchanges">...
The one converted with 1.12 shows:
<!-- panel -->
<div id="mw-panel" class="noprint">
<!-- logo -->
<div id="p-logo"><a style="background-image:
url(//upload.wikimedia.org/....
<!-- /logo -->
<!-- navigation -->
<div class="portal" id='p-navigation'>
<h5>...
<div class="body">
<ul>
<li id="n-mainpage">...
<li id="n-portal">...
<li id="n-currentevents"><a
href="/wiki/Portal:http://upload.wikimedia...
<!-- /logo -->
<!-- navigation -->
<div class="portal" id='p-navigation'>
<h5>...
<div class="body">
<ul>
<li id="n-mainpage"><a href="http://ja.wikipedia.org/wiki/...
<li id="n-portal"><a href="http://ja.wikipedia.org/wiki/....
<li id="n-currentevents"><a
href="http://ja.wikipedia.org/wiki/Portal:%E6%9C%80%E8...
<li id="n-newpages"><a href="http://ja.wikipedia.org/...
<li id="n-recentchanges"><a href="http://ja.wikipedia.org/...
In summary, the relative protocol link is not converted inside the
inline CSS (not a big bug), then the following 9 lines of the
unconverted are copied, and then the rest of the converted file
including those 9 lines again.
On a different fetch, I get an slightly differently corrupted file along
the same lines. It is likely that depending on the way the pieces
happened to copy, the UTF-8 bytes got invalid.
So there was indeed a bug on 1.12 link conversion, which seems to have
been fixed in the meantime.