[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] wget 1.12 utf-8 webpage with convert-links generate illeg
From: |
Ángel González |
Subject: |
Re: [Bug-wget] wget 1.12 utf-8 webpage with convert-links generate illegail utf-8 sequence |
Date: |
Sat, 09 Jun 2012 20:39:12 +0200 |
User-agent: |
Thunderbird |
On 08/06/12 18:26, address@hidden wrote:
> Hi,
>
> I have a problem when using --convert-links (-k) on a utf-8 encoded web page.
>
> How to reproduce is:
>
> wget -k --restrict-file-names=nocontrol
> http://ja.wikipedia.org/wiki/%E3%81%A4%E3%81%8B%E3%81%93%E3%81%86%E3%81%B8%E3%81%84
> (This is a Japanese wiki page.)
>
> The file name is utf-8. To check the utf-8 sequence.
>
> iconv -f utf-8 -t utf-8 [downloadedfile(replaced for non-utf-8 env)]
>> /dev/null
> iconv: illegal input sequence at position 77822
> (or open with gedit show the corruption.)
>
> If I don't have -k option, there is no broken file. This usually happens
> near end of the file. Typically only one or two bytes illegal utf-8
> characters. And at near the illegal characters, some of the data is
> missing. Added illegal characters are typically 0xe3, or 0xe383, but not
> limited to. This problem happens depends on the input file, around 20% of
> Japanese wiki pages show this problem.
>
> I have not yet tried wget 1.13 and I could not find any regarding
> information on the web. I looked up the convert.c, but, I am not familiar
> with the code.
I'm not seeing that error (wget 1.13.4).
I ran:
> wget
> http://ja.wikipedia.org/wiki/%E3%81%A4%E3%81%8B%E3%81%93%E3%81%86%E3%81%B8%E3%81%84
> -O Without-k
> wget -k
> http://ja.wikipedia.org/wiki/%E3%81%A4%E3%81%8B%E3%81%93%E3%81%86%E3%81%B8%E3%81%84
> -O With-k
A comparison of the changes between both files seem to be the expected ones.
(I found it is converting <a href="#cite_ref-0"> to <a
href="With-k#cite_ref-0">, which is unneeded, but that'd be a different
bugfix)
Iconv conversion doesn't show any error either:
> iconv -f utf-8 -t utf-8 < With-k > /dev/null
> iconv -f utf-8 -t utf-8 < Without-k-k > /dev/null