[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[no subject]
From: |
Patrice Dumas |
Date: |
Tue, 30 Jan 2024 17:28:30 -0500 (EST) |
branch: master
commit 5f6abe5eb4cb95684aa205429512f10df07993e4
Author: Patrice Dumas <pertusus@free.fr>
AuthorDate: Tue Jan 30 23:14:56 2024 +0100
tp/TODO: update
---
tp/TODO | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/tp/TODO b/tp/TODO
index f2197e999a..ed9e3901dc 100644
--- a/tp/TODO
+++ b/tp/TODO
@@ -65,12 +65,19 @@ Document *XS_EXTERNAL_FORMATTING *XS_EXTERNAL_CONVERSION?
Delayed bugs
============
-For unicode sorting in C for index sorting, strcoll_l or even better strxfrm_l
-could be used, it sorts according to a locale specified in argument. However,
-it does not seems to be that portable, there is no associated gnulib
-module, and the _l variants seem to be in glibc but are are not in the glibc
-documentation. According to Eli, if the locale's codeset is UTF-8, glibc
-uses the full Unicode CLDR, which is what we want.
+Sorting indices in C with strxfrm_l using the "en_US.utf-8" locale with
+LC_COLLATE_MASK is quite consistent with perl for number and letters, but
+leads to a different output than with Perl for non alphanumeric characters,
+which is probably somewhat incidental. There are also differences that seem to
+be related to spaces with a result that looks better in Perl. It could be the
+effect of 'variable' => 'Non-Ignorable' in Perl, as it allows to have spaces
+and punctuation marks sort before letters.
+
+Transliteration/protection with iconv in C leads to a result different of Perl
+for some characters. It seems that the iconv result depends on the locale, and
+there are quite a bit of ? output, probably when there is no obvious
+transliteration. In those cases, the Unidecode traansliterations are not
+necessarily very good, either.
hyphenation: should only appear in toplevel.