'wc -m' and combining characters

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

'wc -m' and combining characters

From:	Nick
Subject:	'wc -m' and combining characters
Date:	Sun, 10 Mar 2024 12:16:06 -0300
User-agent:	Mutt/2.2.12 (2023-09-09)

I'm attempting to learn about UTF-8.  My question is about how wc
counts "combining characters", as discussed here
<https://www.cl.cam.ac.uk/~mgk25/unicode.html#comb>.

I made two files, one with "LATIN CAPITAL LETTER A WITH DIAERESIS"
called p1.txt.  The other with "LATIN CAPITAL LETTER A" followed by
"COMBINING DIAERESIS", called p2.txt.  Neither file contained a
newline or any other bytes.

   $ od --format=x1 p1.txt
   0000000 c3 84
   0000002
   $ od --format=x1 p2.txt
   0000000 41 cc 88
   0000003

My question is: why does wc say that p2.txt contains two characters?

   $ wc -m -c p?.txt
   1 2 p1.txt
   2 3 p2.txt
   3 5 total

I'd naively expected that second line of output to start with 1,
i.e. saying the file p2.txt has one character.  Markus Kuhn's FAQ says
"A combining character is not a full character by itself" but wc is
saying that it is, no?

Sorry if this has already been done to death.  My search of the archives
failed to find a previous discussion but perhaps I missed them.

Thanks
-- 
Nick
Asunción 12:04 PYST ►  37°C  ◆  nubes  ◆  3Km/h NE  ◆  52% HR

[Prev in Thread]

Current Thread

[Next in Thread]

'wc -m' and combining characters, Nick <=
- Re: 'wc -m' and combining characters, Pádraig Brady, 2024/03/11
  - Re: 'wc -m' and combining characters, Nick, 2024/03/11
    - Re: 'wc -m' and combining characters, enh, 2024/03/11

Prev by Date: RFE: enable buffering on null-terminated data
Next by Date: Re: RFE: enable buffering on null-terminated data
Previous by thread: stdbuf feature request - line buffering but for null-terminated data
Next by thread: Re: 'wc -m' and combining characters
Index(es):
- Date
- Thread