[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: inconsistency with counting characters vs bytes for multi-byte chara
From: |
arnold |
Subject: |
Re: inconsistency with counting characters vs bytes for multi-byte characters |
Date: |
Tue, 12 Sep 2023 15:38:33 -0600 |
User-agent: |
Heirloom mailx 12.5 7/5/10 |
OOOPS! That diff had debugging stuff in it. Please ignore. Attached
is the correct diff.
Arnold
arnold@skeeve.com wrote:
> Hi Ed.
>
> Thank you for reporting this. It is most definitely a bug. This is Yet
> Another Interesting Corner Case. I guess as UTF-8 becomes more and more
> common, these bugs will get shaken out.
>
> I have attached a fix below, which passes the test suite and seems to
> fix the problem. I'm going to let it stew for a day or two before pushing
> it out to the Git repo.
>
> Thanks!
>
> Arnold
>
> Ed Morton <mortoneccc@comcast.net> wrote:
>
> > Arnold et al - someone on a forum just pointed out this:
> >
> > $ awk 'BEGIN{str="abc"; n=gsub(//,"X",str); print n, str }'
> > 4 XaXbXcX
> >
> > $ awk 'BEGIN{str="\342\200\257"; n=gsub(//,"X",str); print n, str }'
> > 4 X▒X▒X▒X
> >
> > i.e. gsub() with an empty regexp matches around each byte in that 3-byte
> > character. I don't recall ever having wanted to match an empty regexp
> > and can't find a reference to that in documentation so I don't know if
> > that's expected behavior or undefined behavior or a similar issue to the
> > match() issue below so thought it best to just pass it along so you can
> > decide what, if anything, to do about it.
> >
> > In case some background would be useful, there's a discussion on this at
> > the bottom of https://stackoverflow.com/a/77010950/1745001 - the person
> > whose login there is "RARE Kpop Manifesto" advocating for not changing
> > match() is the same Jason Kwan you've interacted with previously in this
> > mailing list, e.g. at
> > https://lists.gnu.org/archive/html/bug-gawk/2021-09/msg00073.html.
> >
> > Ed.
gsub-fix.diff
Description: Text document