sed-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] sed: fix minor multibyte parsing bug


From: Jim Meyering
Subject: Re: [PATCH] sed: fix minor multibyte parsing bug
Date: Sun, 3 Jul 2016 22:24:45 -0700

On Sun, Jul 3, 2016 at 8:53 PM, Assaf Gordon <address@hidden> wrote:
> Hello Jim,
>
> Attached is an updated patch with improved wording, let me know if you spot 
> other issues.

Nice work.
Typo: s/octect/octet:

+# TODO: Find a locale in which ':.=' can be part of a valid multibyte octect.
+#
+# snarf_char_class specifically tests for five characters: ':.=[]'
+# Characters [] are tested above, yet '.:=' is not valid as part of a multibyte
+# shift-is sequence.

Should the above be "shift-jis"?
Also, perhaps s/five characters/five bytes/?

> Regarding your check for false-positive: sadly I don't think it works:
> Even coreutils' wc returns "1" on this invalid sequence, so there's no easy 
> way to detect it:
>   $ printf '\203\060' | LC_ALL=ja_JP.sjis ltrace -e mbrtowc wc -m
>   wc->mbrtowc(0x7fff906a513c, 0x7fff906a5150, 2, 0x7fff906a5140) = -1
>   wc->mbrtowc(0x7fff906a513c, 0x7fff906a5151, 1, 0x7fff906a5140) = 1
>   1
>   +++ exited (status 0) +++
>
>   $ printf '\203\060' | iconv -f SHIFT-JIS -t UTF-8
>   iconv: illegal input sequence at position 0
>
>   $ wc --version
>   wc (GNU coreutils) 8.25.7-f91f2
>
> And I suspect the reason is this conditional in 'wc':
> http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/wc.c#n378
> with the following explanation:
>                       /* Remember that we read a byte, but don't complain
>                          about the error.  Because of the decoding error,
>                          this is a considered to be byte but not a
>                          character (that is, chars is not incremented).  */
>
> Perhaps a more robust way to detect this is to add an auxiliary test program 
> in C
> (like testsuite/get-mb-cur-max.c) that will detect and report the exact 
> results
> as we need them.
> I'm happy to write such a patch if you think this is the way to go.
> This should also solve the false-positive of 'invalid-mb-seq-UMR.sh' test.

That sounds like the right approach, indeed.
Thanks for volunteering.

> Lastly,
> You've mentioned the coding-style (and space-before-parens specifically),
> and I noticed that sed's "make syntax-check" does not check for them.
> After adding 'sc_space_before_open_paren' to cfg.mk (copied from coreutils),
> I saw that many lines do not adhere to the style.
> I'll add that to my TODO list...

Yes, there are numerous style issues that make it hard to provide
feedback: the existing code is not self-consistent.
Please do not introduce new instances of space-before-paren. This
patch still does introduce three.

Eventually, once we've made a few releases and have reduced the number
of patches distributions choose to apply, I would like to mechanically
format all .c and .h files, so that we can trivially/mechanically
verify that proposed patches adhere to the desired formatting style.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]