[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] Save 3 byte utf8 url
From: |
Ángel González |
Subject: |
Re: [Bug-wget] Save 3 byte utf8 url |
Date: |
Sat, 16 Feb 2013 23:23:36 +0100 |
User-agent: |
Thunderbird |
On 16/02/13 02:50, L Walsh wrote:
> Ángel González wrote:
>> On 07/02/13 15:06, bes wrote:
>>> Hi,
>>>
>>> i found some bug in wget with interpreting and save percent-encoding
>>> 3 byte
>>> utf8 url
>>>
>>> example:
>>> 1. Create url with "—". This is U+2014 (EM DASH). Percent-encoding
>>> UTF-8 is
>>> "%E2%80%94"
>>> 2. Try wget it: wget "http://example.com/abc—d" or wget "
>>> http://example.com/abc%E2%80%94d" directly
>>> 3. Wget save this URL to file "abc\342%80%94d". Expected is
>>> "abc%E2%80%94d". This is a bug.
>>
>> The problem is that it checks if it's a printable character in latin1.
> Do you mean printable character in the current locale?
No, I mean in latin1 (ISO-8859-1). If it founds a ‘character’ like bell
(0x07), wget doesn't try
to put that in the filename but to left it as %07
The codepoints 7F-9F are defined in ISO-8859 as control codes (the C1
set: Start of Selected Area, Partial Line Forward...) so wget also does
that. However, this has an implicit reasoning that the url is in the
iso-8859 family, something which used to be common, but using utf-8
there is quite usual nowadays.
BUT some utf-8 characters use bytes in the 7F-9F range, so wget leaves
them as %xy while leaving some others as the referenced bytes. Breaking
the utf-8 encoding even for systems using filenames in utf-8.
> Or can it not do UTF-8 at all?
>
> latin1 is going the way of the dodo...most sites still use it, but
> HTML5 is supposed to be UTF8..
http://www.whatwg.org/specs/web-apps/current-work/#urls refers to
http://url.spec.whatwg.org/ and it does set the encoding by default to
utf-8. But I think it refers to /encoding/ a character, not to figure
out which encoding was used in a url.
We could assume it's the same charset as the document, but what to do
with documents with no charset (by wrong configuration, or for being
scripts, images...) ?
Seems easier to treat as utf-8 if it contains utf-8 sequences. That
still needs a transformation of filenames, though.
> If it found "González" on a file would it be able to save it correctly?
wget is always able to download the urls, the only difference is if they
"look nice" in your system.
A url like http://example.org/González in utf-8 would be encoded as
http://example.org/Gonz%c3%a1lez so wget would think those are the
characters à (0xC3) and ¡ (0xA1), saving it "as is". So if my filenames
are utf-8 (eg. Linux) I will see it as González, if they are latin1 (eg.
Windows, using windows-1252) I will see it as González.