coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wc: expand help of '-L' (and a question)


From: Stephane Chazelas
Subject: Re: wc: expand help of '-L' (and a question)
Date: Wed, 13 May 2015 13:01:12 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

2015-05-13 03:00:48 +0100, Pádraig Brady:
[...]
> Yes. You could filter with sed to adjust:
> 
>          sed 's/././g' | wc -L    # count chars
> LC_ALL=C sed 's/././g' | wc -L    # count bytes
[...]

Note that unicode code points D800 to DFFF (reserved for UTF-16
encoding) and 110000 to 7FFFFFFF now that they've given up on
ever having anything above 10FFFF) are not characters.

Still GNU sed considers their UTF-8 encodings (as per the
original UTF-8 encoding, before it got limited to 4 bytes)
as characters.

$ printf '\ud800\udfff\U110000\U7fffffff\n' | sed s/././g | wc -L
4

(I'm not sure I'd object to that though).

Other byte sequences that don't form valid characters are not:

$ printf '\x80\xff' | sed s/././g | wc -L
0

-- 
Stephane




reply via email to

[Prev in Thread] Current Thread [Next in Thread]