[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wget2 | "utf8" charset breaks urls with invalid utf-8 (#523)

From: Tim Rühsen
Subject: Re: wget2 | "utf8" charset breaks urls with invalid utf-8 (#523)
Date: Wed, 22 Apr 2020 10:41:17 +0000

Tim Rühsen commented:

Wget simply ignores <meta charset=...> while wget2 takes it (correctly) into 

"test%E4.jpg" contains a character that is invalid utf-8.
Wget2 has a short circuit, if source and destination charset is the same, in 
this case "utf-8", the conversion is skipped. That's why 'charset=utf-8' 
continues without conversion error.

But "utf8" differs from "utf-8" and thus a conversion is applied, which fails 
with errno 84 (EILSEQ 84 Invalid or incomplete multibyte or wide character).

#### What can we do ?

Of course we can treat "utf8" as "utf-8". That would help in some situations, 
but that disguises the real problem.

We can use the URL as-is whenever a conversion error occurs (maybe converting 
to percent-encoded ASCII, if needed). That is a 'best try' strategy, and not 
guaranteed to succeed.

The real issue is broken page content, but we never ever can fix that for the 
whole web.

Reply to this email directly or view it on GitLab: 
You're receiving this email because of your account on gitlab.com.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]