[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#48192: forward-word and friends have inconsistent behaviour with Uni
From: |
Daphne Preston-Kendal |
Subject: |
bug#48192: forward-word and friends have inconsistent behaviour with Unicode and ASCII punctuation |
Date: |
Mon, 3 May 2021 16:37:51 +0200 |
forward-word, backward-word etc. have inconsistent behaviour when
applied to text containing ASCII straight quotation marks vs. Unicode
quotation marks. The word
don't
with a straight quote (U+0027) counts as a single word, and forward-word
and backward-word will move over the whole thing. Meanwhile,
don’t
with a curly quote (U+2019) counts as two words, and the cursor will
stop at ‘don’ and ‘t’ separately. (Fundamental mode, Emacs 27.2.)
This also means count-words/count-words-region give surprising results
when applied to text containing Unicode curly apostrophes, since they
work by counting the number of times the cursor can move
forward-word-strictly between given start and end points. (Since it uses
forward-word-strictly and not forward-word, the problem can’t be solved
by customizing find-word-boundary-function-table.)
The Right Thing in my view would be for Emacs to use the Unicode TR29
word boundary rules to work out where to put the cursor when
forward-word and backward-word are invoked. They handle punctuation
characters correctly, and rules are not too complicated.
<http://www.unicode.org/reports/tr29/#Word_Boundaries>
However, how this would interact with the existing
find-word-boundary-function-table customization method, I don’t know.
CLDR makes customizations of the rules for specific (human) languages;
perhaps they could be ported into Emacs somehow.
As a temporary workaround to get correct-ish word counts for my
documents, I’ve hacked up a function that uses how-many instead of
forward-word to count the number of words in a region.
<https://gitlab.com/dpk/dotfiles/-/blob/master/.emacs.d/lisp/wc-mode.el>
- bug#48192: forward-word and friends have inconsistent behaviour with Unicode and ASCII punctuation,
Daphne Preston-Kendal <=