bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#64420: string-width of … is 2 in CJK environments


From: Dmitry Gutov
Subject: bug#64420: string-width of … is 2 in CJK environments
Date: Thu, 27 Jul 2023 04:52:57 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0

On 13/07/2023 08:23, Eli Zaretskii wrote:
From: Yuan Fu<casouri@gmail.com>
Date: Wed, 12 Jul 2023 14:11:14 -0700
Cc: Eli Zaretskii<eliz@gnu.org>,
  64420@debbugs.gnu.org

Here’s what I know: In a CJK “context”, “…” is supposed to be one ideograph 
wide (like all CJK punctuation), ie, width=2.

However, it’s not as simple as “they used the wrong font”, because both Latin 
and CJK use the same Unicode code point for “…”, but expect different glyphs. 
In publication, this is solved by manually marking the text with style or font, 
so the software uses the desired glyph. Terminals and editors don’t have this 
luxury.

BTW it’s not just ellipses, CJK and Latin shares the same code points for 
quotes, em dash and middle dot while expecting different glyphs for them.

Since most terminal and editor (especially terminal) quires ASCII/Latin font 
before falling back to CJK fonts, I expect most terminal and editor to show the 
Latin glyph for “…” (width=1) most of the time.

So practically, it would be correct most of the time if we assume the following 
code points have a width of 1, regardless of locale:

– HORIZONTAL ELLIPSIS …
– LEFT/RIGHT DOUBLE QUOTATION MARK “”
– LEFT/RIGHT SINGLE QUOTATION MARK ‘’
– EM DASH —
– MIDDLE DOT ·

But obviously if someone configures their terminal or editor to use CJK font 
first, these characters MIGHT have width = 2. I said MIGHT because there are 
plenty CJK fonts that uses the 1-width Latin glyph for these characters by 
default.

It might be helpful to have a wrapper string-width that considers heuristics 
like this, while string-width goes strictly by Unicode and locale.
Thanks.  My conclusion from the above is a bit different: we should
introduce a user option to modify the behavior of
use-cjk-char-width-table, such that users who have fonts where these
characters are not double-width could have the width of these
characters left at their Unicode values.

We could add an option, and then go with the default value which corresponds to whatever seems the common opinion here.

Anyway, it doesn't seem like anybody else in this discussion is better equipped to choose that user option's name, or write the rest of the patch.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]