Re: index sorting in texi2any in C issue with spaces

bug-texinfo

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: index sorting in texi2any in C issue with spaces

From:	Patrice Dumas
Subject:	Re: index sorting in texi2any in C issue with spaces
Date:	Sun, 4 Feb 2024 22:27:00 +0100

On Sun, Feb 04, 2024 at 08:38:28PM +0000, Gavin Smith wrote:
> > 
> > strcmp is always used as a transformation on the string is done with
> > strxfrm_l for the collation in C.  If USE_UNICODE_COLLATION=0 the string
> > is not transformed, which amounts to using strcmp on the original
> > string.  Therefore it is already implemented that way in C, as can be
> > seen in tp/Texinfo/XS/main/manipulate_indices.c.
> 
> Does this always happen with "texi2any -c USE_UNICODE_COLLATION=0" if
> the XS modules are available or are there more restrictions?

It always happen with "texi2any -c USE_UNICODE_COLLATION=0" if the XS
modules are available.

> I noticed a potential problem:
> 
> static void
> set_sort_key (locale_t collation_locale, const char *input_string,
>               char **result_key)
> {
>   if (collation_locale)
>     {
>   #ifdef HAVE_STRXFRM_L
>       size_t len = strxfrm_l (0, input_string, 0, collation_locale);
>       size_t check_len;
> 
>       *result_key
>         = (char *) malloc ((len +1) * sizeof (char));
>       check_len = strxfrm_l (*result_key, input_string, len+1,
>                              collation_locale);
>       if (check_len != len)
>         fatal ("strxfrm_l returns a different length");
>   #endif
>     }
>   else
>     *result_key = strdup (input_string);
> }
> 
> It looks like *result_key is not set if HAVE_STRXFRM_L is not defined.

No, it is not possible, because collation_locale will be 0 if HAVE_STRXFRM_L
is not defined.  Though I agree that the code is somewhat confusing,
maybe the following would be clearer:

   #ifdef HAVE_STRXFRM_L
  if (collation_locale)
     {
       size_t len = strxfrm_l (0, input_string, 0, collation_locale);
       size_t check_len;
 
       *result_key
         = (char *) malloc ((len +1) * sizeof (char));
       check_len = strxfrm_l (*result_key, input_string, len+1,
                              collation_locale);
       if (check_len != len)
         fatal ("strxfrm_l returns a different length");
     }
   else
   #endif
     *result_key = strdup (input_string);
 }
 

> > If COLLATION_LOCALE is set in Perl, it is not clear to me what would be
> > the output.  Would it be ignored?
> 
> If by "set in Perl" you mean in an output converter module that is written
> in Perl, then we should try to honour the variable and sort exactly as
> specified in the stated locale.  This would probably be done by calling
> into C code to do it, or doing it in Perl somehow (perhaps with "use locale"
> and "cmp", if that actually works).  If it is not possible to do it in
> Perl, and XS modules are not available, it is not a big deal: we just
> print a warning message saying that sorting according to a locale's
> rules is not available.

In that case there is no XS, so it would be either by using "use locale"
and "cmp", and if it does not work, looks like using POSIX::strxfrm()
is possible, although in the documentation in the end of "Category
LC_COLLATE: Collation: Text Comparisons and Sorting" it is said that it is
useless as perl already does it (which seems to me to contradict the
information from perlop stating that use locale and cmp does not work
well)
https://perldoc.perl.org/perllocale#Category-LC_COLLATE:-Collation:-Text-Comparisons-and-Sorting

But not doing it in Perl actually seems much better to me, as in
addition to the unclear documentation in Perl, there is already
Unicode::Collate::Locale that does better.  Ignoring completly this in
pure Perl seems even better to me than saying anything.

> > However, if there is a
> > possibility to get variable elements set to "non-ignorable" in C,
> > possibly by using an hardcoded locale of en_US, it will not possible to
> > get automatically both the correct and more rapid option.  The user
> > would still have to set COLLATION_LOCALE to get it.
> 
> If this is possible, then we silently switch to using the C sorting 
> if we can detect that we can treat variable elements in such a way.
> This would be the "default", non-tailored collation.

Ok.  In that case, it is even clearer to me that COLLATION_LOCALE is not
really useful, I think that it will be quite challenging to understand
what it really does and will probably be unused.  I think that we should
not propose COLLATION_LOCALE at all to users.  It seems that we should
consider it as a short-term way to be able to use collation in C, that
may not be worth having in the long term, and keep it as a feature
documented as being only for XS and only for development/testing to
allow developers to test what output it leads to and that it could be
removed at any time.  Also calling it something else, like
XS_STRXFRM_COLLATION_LOCALE.


> > > For 3), accessing @documentlanguage seems like an unnecessary extra
> > > at the moment.  Again, there would be the problem of strxfrm_l and
> > > Unicode::Collate::Locale doing different things with variable collation
> > > elements.  There is no guarantee that the user has the appropriate
> > > locale installed either (for use with strxfrm_l) 
> > 
> > It seems to me that following @documentlanguage would be more desirable
> > than being able to have the use specify a specific COLLATION_LANGUAGE
> > (or COLLATION_LOCALE).  Indeed, it seems to me to be more aligned with
> > Texinfo, in which information is supposed to come primarily from the
> > Texinfo manual.  Also COLLATION_LANGUAGE and COLLATION_LOCALE suffer from
> > the same problems that you describe for @documentlanguage based
> > customization.  Also, if COLLATION_LANGUAGE and/or COLLATION_LOCALE is
> > implemented, it would be very easy to use what comes from @documentlanguage
> > instead for any of these user-supplied values, so it is a bit strange
> > not to do it.
> 
> There wouldn't any harm in implementing it as an option.  We'd have to
> decide if it went via strxfrm_l, Unicode::Collate::Locale, or configurable
> for either.

I think that we should decide it now in order to have a fully specified
interface, even if it is not fully implemented.  To me it would seems
much more logical to have it be similar to COLLATION_LANGUAGE as it is
the correct one.

> > As a side note, transliteration of file names is also different from C
> > and from Perl, the Perl function is used if TEST=1, but otherwise the
> > result are different if TEXINFO_XS_CONVERT=1.
> 
> I don't know what "transliteration of file names" refers to here.  Does
> this refer to the --transliterate-file-names option?

Yes, and also to the transliteration of file name based on sectioning
commands.

-- 
Pat

[Prev in Thread]

Current Thread

[Next in Thread]

Re: index sorting in texi2any in C issue with spaces, (continued)

Prev by Date: Re: index sorting in texi2any in C issue with spaces
Next by Date: Re: index sorting in texi2any in C issue with spaces
Previous by thread: Re: index sorting in texi2any in C issue with spaces
Next by thread: Re: index sorting in texi2any in C issue with spaces
Index(es):
- Date
- Thread