bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Texinfo 7.0.93 pretest available


From: Gavin Smith
Subject: Re: Texinfo 7.0.93 pretest available
Date: Sun, 8 Oct 2023 20:21:44 +0100

On Sun, Oct 08, 2023 at 08:45:11PM +0300, Eli Zaretskii wrote:
> > From: Gavin Smith <gavinsmith0123@gmail.com>
> > Date: Sun, 8 Oct 2023 18:29:23 +0100
> > Cc: bug-texinfo@gnu.org
> > 
> > On Sun, Oct 08, 2023 at 07:31:12PM +0300, Eli Zaretskii wrote:
> > > I see a very large diff, full of non-ASCII characters.  A typical hunk
> > > is below:
> > > 
> > >   -(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ
> > >   -(ȷ) ‘@H{a}’ a̋ ‘@dotaccent{a}’ ȧ (ȧ) ‘@ringaccent{a}’ å (å)
> > >   -‘@tieaccent{a}’ a͡ ‘@u{a}’ ă (ă) ‘@ubaraccent{a}’ a̲ ‘@udotaccent{a}’ ạ
> > >   -(ạ) ‘@v{a}’ ǎ (ǎ) @,c ç (ç) ‘@,{c}’ ç (ç) ‘@ogonek{a}’ ą (ą)
> > >   +(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ 
> > > (ȷ)
> > >   +‘@H{a}’ a̋ ‘@dotaccent{a}’ ȧ (ȧ) ‘@ringaccent{a}’ å (å) 
> > > ‘@tieaccent{a}’ a͡
> > >   +‘@u{a}’ ă (ă) ‘@ubaraccent{a}’ a̲ ‘@udotaccent{a}’ ạ (ạ) ‘@v{a}’ ǎ (ǎ)
> > >   +@,c ç (ç) ‘@,{c}’ ç (ç) ‘@ogonek{a}’ ą (ą)
> > > 
> > > It looks like a filling problem to me, perhaps because something
> > > counts bytes instead of characters?
> > 
> > It's almost certainly a problem with filling as you say.  In the C (XS)
> > code, the return value of wcwidth is used for each character to get
> > the width of each line.  The pure Perl code doesn't use the wcwidth
> > function as far as I know but keeps a count for each line based on
> > regex character classes.  The relevant code is in
> > Texinfo/Convert/Unicode.pm, in the 'string_width' function.
> 
> So perhaps the wcwidth function is the culprit.  I'm guessing that it
> returns 1 for every printable character in my case.

Just comparing the first line in the hunk:

-(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ
+(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ (ȷ)

the line you are getting is longer than the reference results.  

I wonder if for some of the non-ASCII characters wcwidth is returning 0 or
-1 leading the line to be longer.

It's also possible that other codepoints have inconsistent wcwidth results,
especially for combining accents.

Do you know if it is the gnulib implementation of wcwidth that is being
used or a MinGW one?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]