sed-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] sed: fix minor multibyte parsing bug


From: Assaf Gordon
Subject: Re: [PATCH] sed: fix minor multibyte parsing bug
Date: Sun, 3 Jul 2016 23:53:27 -0400

Hello Jim,

Attached is an updated patch with improved wording, let me know if you spot 
other issues.

Regarding your check for false-positive: sadly I don't think it works:
Even coreutils' wc returns "1" on this invalid sequence, so there's no easy way 
to detect it:
  $ printf '\203\060' | LC_ALL=ja_JP.sjis ltrace -e mbrtowc wc -m
  wc->mbrtowc(0x7fff906a513c, 0x7fff906a5150, 2, 0x7fff906a5140) = -1
  wc->mbrtowc(0x7fff906a513c, 0x7fff906a5151, 1, 0x7fff906a5140) = 1
  1
  +++ exited (status 0) +++

  $ printf '\203\060' | iconv -f SHIFT-JIS -t UTF-8
  iconv: illegal input sequence at position 0

  $ wc --version
  wc (GNU coreutils) 8.25.7-f91f2

And I suspect the reason is this conditional in 'wc':
http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/wc.c#n378
with the following explanation:
                      /* Remember that we read a byte, but don't complain
                         about the error.  Because of the decoding error,
                         this is a considered to be byte but not a
                         character (that is, chars is not incremented).  */

Perhaps a more robust way to detect this is to add an auxiliary test program in 
C
(like testsuite/get-mb-cur-max.c) that will detect and report the exact results
as we need them.
I'm happy to write such a patch if you think this is the way to go.
This should also solve the false-positive of 'invalid-mb-seq-UMR.sh' test.

Lastly,
You've mentioned the coding-style (and space-before-parens specifically),
and I noticed that sed's "make syntax-check" does not check for them.
After adding 'sc_space_before_open_paren' to cfg.mk (copied from coreutils),
I saw that many lines do not adhere to the style.
I'll add that to my TODO list...


regards,
 - assaf


Attachment: 0001-sed-fix-minor-multibyte-parsing-bug.patch.xz
Description: Binary data




reply via email to

[Prev in Thread] Current Thread [Next in Thread]