wget-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wget2 | "utf8" charset breaks urls with invalid utf-8 (#523)


From: Tim Rühsen
Subject: Re: wget2 | "utf8" charset breaks urls with invalid utf-8 (#523)
Date: Wed, 22 Apr 2020 10:41:17 +0000



Tim Rühsen commented:


Wget simply ignores <meta charset=...> while wget2 takes it (correctly) into 
account.

"test%E4.jpg" contains a character that is invalid utf-8.
Wget2 has a short circuit, if source and destination charset is the same, in 
this case "utf-8", the conversion is skipped. That's why 'charset=utf-8' 
continues without conversion error.

But "utf8" differs from "utf-8" and thus a conversion is applied, which fails 
with errno 84 (EILSEQ 84 Invalid or incomplete multibyte or wide character).

#### What can we do ?

Of course we can treat "utf8" as "utf-8". That would help in some situations, 
but that disguises the real problem.

We can use the URL as-is whenever a conversion error occurs (maybe converting 
to percent-encoded ASCII, if needed). That is a 'best try' strategy, and not 
guaranteed to succeed.

The real issue is broken page content, but we never ever can fix that for the 
whole web.

-- 
Reply to this email directly or view it on GitLab: 
https://gitlab.com/gnuwget/wget2/-/issues/523#note_328897404
You're receiving this email because of your account on gitlab.com.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]