[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
'wc -m' and combining characters
From: |
Nick |
Subject: |
'wc -m' and combining characters |
Date: |
Sun, 10 Mar 2024 12:16:06 -0300 |
User-agent: |
Mutt/2.2.12 (2023-09-09) |
I'm attempting to learn about UTF-8. My question is about how wc
counts "combining characters", as discussed here
<https://www.cl.cam.ac.uk/~mgk25/unicode.html#comb>.
I made two files, one with "LATIN CAPITAL LETTER A WITH DIAERESIS"
called p1.txt. The other with "LATIN CAPITAL LETTER A" followed by
"COMBINING DIAERESIS", called p2.txt. Neither file contained a
newline or any other bytes.
$ od --format=x1 p1.txt
0000000 c3 84
0000002
$ od --format=x1 p2.txt
0000000 41 cc 88
0000003
My question is: why does wc say that p2.txt contains two characters?
$ wc -m -c p?.txt
1 2 p1.txt
2 3 p2.txt
3 5 total
I'd naively expected that second line of output to start with 1,
i.e. saying the file p2.txt has one character. Markus Kuhn's FAQ says
"A combining character is not a full character by itself" but wc is
saying that it is, no?
Sorry if this has already been done to death. My search of the archives
failed to find a previous discussion but perhaps I missed them.
Thanks
--
Nick
Asunción 12:04 PYST ► 37°C ◆ nubes ◆ 3Km/h NE ◆ 52% HR
- 'wc -m' and combining characters,
Nick <=