[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: horrible utf-8 performace in wc
From: |
Pádraig Brady |
Subject: |
Re: horrible utf-8 performace in wc |
Date: |
Wed, 7 May 2008 12:11:34 +0100 |
User-agent: |
Thunderbird 2.0.0.6 (X11/20071008) |
Jan Engelhardt wrote:
>
> https://bugzilla.novell.com/show_bug.cgi?id=381873
>
> Forwarding this because it is a GNU issue, not specifically a Novell one.
> I reproduced this myself with the latest coreutils from git
> (BTW: You might want to repack that repo, "counting objects" during the
> clone was rather slow in the initial counting.)
>
> Could it be a libiconv problem?
Accounting for multibyte characters is what's taking the time:
~/git/coreutils/src$ time ./wc -m long_lines.txt
13357046 long_lines.txt
real 0m1.860s
~/git/coreutils/src$ time ./wc -c long_lines.txt
13538735 long_lines.txt
real 0m0.002s
Now that is a _lot_ of extra time. libiconv could probably be
made more efficient. I've never actually looked at it.
However wc calls mbrtowc() for each multibyte character.
It would probably be a lot more efficient to use mbstowcs()
to convert the whole read buffer.
Note mbstowcs doesn't handle embedded NULs so one would
need to find these first, and iterate over each substring,
as I did in my version of uniq previously mentioned.
Also mbstowcs doesn't canonicalize equivalent multibyte sequences,
and so therefore functions the same in this regard as our
processing of each wide character separately.
This could be considered a bug actually- i.e. should -m give
the number of wide chars, or the number of multibyte chars?
With the attached patch, `wc -m` gives 23 chars for both these lines.
canonically équivalent
canonically équivalent
Pádraig.
p.s. I Notice that gnome-terminal still doesn't handle
combining characters correctly, and my mail client thunderbird
is putting the accent on the q rather than the e, sigh.
diff --git a/src/wc.c b/src/wc.c
index 61ab485..f7f7109 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -368,6 +368,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
linepos += width;
if (iswspace (wide_char))
goto mb_word_separator;
+ else if (width == 0)
+ chars--; /* don't count combining chars */
in_word = true;
}
break;
- horrible utf-8 performace in wc, Jan Engelhardt, 2008/05/06
- Re: horrible utf-8 performace in wc,
Pádraig Brady <=
- Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
- Re: horrible utf-8 performace in wc, Jim Meyering, 2008/05/07
- Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
- Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
- Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
- Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
- Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/08
- Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08
- Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
- Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08