Re: inconsistency with counting characters vs bytes for multi-byte chara

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: inconsistency with counting characters vs bytes for multi-byte chara

From:	arnold
Subject:	Re: inconsistency with counting characters vs bytes for multi-byte characters
Date:	Tue, 12 Sep 2023 15:38:33 -0600
User-agent:	Heirloom mailx 12.5 7/5/10

OOOPS! That diff had debugging stuff in it. Please ignore. Attached
is the correct diff.

Arnold

arnold@skeeve.com wrote:

> Hi Ed.
>
> Thank you for reporting this. It is most definitely a bug. This is Yet
> Another Interesting Corner Case.  I guess as UTF-8 becomes more and more
> common, these bugs will get shaken out.
>
> I have attached a fix below, which passes the test suite and seems to
> fix the problem. I'm going to let it stew for a day or two before pushing
> it out to the Git repo.
>
> Thanks!
>
> Arnold
>
> Ed Morton <mortoneccc@comcast.net> wrote:
>
> > Arnold et al - someone on a forum just pointed out this:
> >
> >      $ awk 'BEGIN{str="abc"; n=gsub(//,"X",str); print n, str }'
> >      4 XaXbXcX
> >
> >      $ awk 'BEGIN{str="\342\200\257"; n=gsub(//,"X",str); print n, str }'
> >      4 X▒X▒X▒X
> >
> > i.e. gsub() with an empty regexp matches around each byte in that 3-byte 
> > character. I don't recall ever having wanted to match an empty regexp 
> > and can't find a reference to that in documentation  so I don't know if 
> > that's expected behavior or undefined behavior or a similar issue to the 
> > match() issue below so thought it best to just pass it along so you can 
> > decide what, if anything, to do about it.
> >
> > In case some background would be useful, there's a discussion on this at 
> > the bottom of https://stackoverflow.com/a/77010950/1745001 - the person 
> > whose login there is "RARE Kpop Manifesto" advocating for not changing 
> > match() is the same Jason Kwan you've interacted with previously in this 
> > mailing list, e.g. at 
> > https://lists.gnu.org/archive/html/bug-gawk/2021-09/msg00073.html.
> >
> >      Ed.

gsub-fix.diff
Description: Text document

[Prev in Thread]

Current Thread

[Next in Thread]

Re: inconsistency with counting characters vs bytes for multi-byte characters, arnold, 2023/09/01
- Re: inconsistency with counting characters vs bytes for multi-byte characters, Ed Morton, 2023/09/01
- Re: inconsistency with counting characters vs bytes for multi-byte characters, Miguel Pineiro Jr., 2023/09/01
  - Re: inconsistency with counting characters vs bytes for multi-byte characters, arnold, 2023/09/01
    - Re: inconsistency with counting characters vs bytes for multi-byte characters, Ed Morton, 2023/09/12
    - Re: inconsistency with counting characters vs bytes for multi-byte characters, arnold, 2023/09/12
    - Re: inconsistency with counting characters vs bytes for multi-byte characters, arnold <=
- Re: inconsistency with counting characters vs bytes for multi-byte characters, J Naman, 2023/09/12
  - Re: inconsistency with counting characters vs bytes for multi-byte characters, arnold, 2023/09/12

Prev by Date: Re: inconsistency with counting characters vs bytes for multi-byte characters
Next by Date: Re: inconsistency with counting characters vs bytes for multi-byte characters
Previous by thread: Re: inconsistency with counting characters vs bytes for multi-byte characters
Next by thread: Re: inconsistency with counting characters vs bytes for multi-byte characters
Index(es):
- Date
- Thread