emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CSV parsing and other issues (Re: LC_NUMERIC)


From: Maxim Nikulin
Subject: Re: CSV parsing and other issues (Re: LC_NUMERIC)
Date: Thu, 10 Jun 2021 23:28:59 +0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1

On 09/06/2021 01:52, Eli Zaretskii wrote:
From: Maxim Nikulin Date: Tue, 8 Jun 2021 23:35:51 +0700

I have reordered some parts of discussion.

I just have realized that nl_langinfo(3) (and
nl_langinfo_l(3) as well) from libc accepts RADIXCHAR
(decimal dot) and THOUSEP (group separator)
arguments. They are good candidates for `locale-info'
extension.

We already use nl_langinfo in locale-info, so what exactly
are you suggesting here? adding more items?  You don't
really expect Lisp programs to format numbers such as
123,456 by hand after learning from locale-info that the
thousands separator is a comma, do you?

I have hijacked Boruch's thread and changed the subject to "CSV parsing". There are plenty of CSV dialects. If decimal separator is "," then office software uses ";" instead of comma as cell (field) separator. So to parse CSV file it is necessary to know decimal separator in a specified locale. RADIXCHAR as argument of nl_langinfo(3) is a first step to better user experience with CSV files.

Unfortunately it allows only to get reasonable visual representation. Taking advantage of Org spreadsheet calculations require parsing cell contents thus parsing of numbers (and maybe dates).

I mentioned earlier https://debbugs.gnu.org/47885 and a part of discussion that is missed in the bug tracker:
https://lists.gnu.org/archive/html/emacs-orgmode/2021-05/msg00693.html

I have seen nl_langinfo without RADIXCHAR in emacs sources
http://git.savannah.gnu.org/cgit/emacs.git/tree/src/w32proc.c#n3258
http://git.savannah.gnu.org/cgit/emacs.git/tree/lib-src/ntlib.c#n520

Originally during discussion in emacs-orgmode I did not plan to raise
the question concerning number formatting and parsing since I had no
hope for any positive outcome without consistent proposal.  Accidentally
I notices Borich's message and decided to add another use case.

On 08/06/2021 09:35, Eli Zaretskii wrote:
 > From: Boruch Baum
 >> No? If an Emacs user has two buffers in two separate languages, the
 >> buffer-local settings aren't / won't be respected?
 >
 > First, language is different from locale.  And second, we don't even
 > have a buffer-local notion of language yet.

Certainly locale is more precise than just language since it includes
region and other variants, moreover it can be granularly tuned (date,
numbers, sorting can be adjusted independently), but I still think that
all these properties can be sometimes broadly referred to as language.

No, they cannot, not in general.  A locale comes with a whole database
of different settings: language, encoding (a.k.a. "codeset"), formats
of date and time, names of days of the week and of the months, rules
for collation and capitalization, etc. etc.  You can easily find
several locales whose language is English, but some/many/all of the
other locale-dependent settings are different.  It isn't a coincidence
that a locale's name includes more than just the language part.

I wrote almost the same concerning locale variants and components, so I feel some sort of confusion and can not get its origin. I was trying to support Boruch that buffer-local variables may be important part of locale context, more precise than global settings, and a fallback if locale is not specified for particular span of text. In respect to such hierarchy language vs. locale difference does not matter.

Low level functions can accept explicit locale.

Which ones?  Most libc routines don't, they use the locale
as a global identifier.  And many libc's (with the prominent
exception of glibc) don't support efficient change of a
locale in the middle of a program, they assume that the
program's locale is set once at program startup.

Hypothetical functions in new elisp API, maybe relying on some external
libraries. I believed, you agreed that global LC_NUMERIC must be "C" to avoid various sort of problems with data exchange. I am not aware of libc functions for number formatting or parsing that can take explicit locale (I have seen such feature in C++ standard library, Qt, other languages). Totalitarian approach of libc with the only locale facet, the only timezone imposes too hard limitations to consider some libc functions as useful and reliable in more or less complex application. Its API is suitable for simple tools that can quickly do their work and do not assume any conversion. More flexible base layer is required when mix of environments is expected. Full support of locale features requires a lot of work, that is why I am asking if some external library can be used instead.

Higher level API can obtain it implicitly from
buffer-local variables and global locale. For example the
LOCALE argument of `string-collate-lessp' is optional
one. I can even anticipate that locale may be stored in
text properties some times. A random message from recent
"About multilingual documents" thread at emacs-orgmode
mail list:
https://lists.gnu.org/archive/html/emacs-orgmode/2021-05/msg00252.html

That's mostly about input methods and org-export, I don't
see how it's relevant to what Boruch asked.

I added this link to show you that demand for multilanguage documents is real. Notice that problems with spell checking were mentioned in that discussion. Earlier I saw suggestions to switch ispell language with input method. In my opinion it is ridiculous. Personally I rather need combined dictionary then explicitly marked text regions.

I expect that new features will be wider utilized when possibility to use them will appear.

At first basic functionality may be implemented. The
problem is to choose extensible API.

No, the problem is to have a design that would allow an
efficient implementation.  Given what the underlying libc
does, it isn't easy.

That is why I looking for an alternative to libc. Previously you wrote
"locale switching". I would rather say constructing and destroying
locales on demand. Switching may behave not so well when thread are involved.

And then we have conceptual problems.  For example, in a
multilingual editor such as Emacs, the notion of a "buffer
language" not always makes sense, you'd need to support
portions of text that have different language properties.
Imagine switching locales as Emacs processes adjacent
stretches of text and other complications.  For example,
changing letter-case for a stretch or Turkish text is
supposed to be different from the English or German text.
I'm all ears for ideas how to design such "language
support".  It definitely isn't easy, so if you have ideas,
please voice them!

I never have a consistent vision nor see a conceptual problem. Buffer-local settings are just more specific than global ones. That is I mentioned text properties as even more precise in my previous message. Maybe even current mode can help to build proper hierarchy of locale contexts. HTML has "lang" attribute, there is "\foreignlanguage" in LaTeX, etc.

I have heard that special case exists in Turkish, but I was not curious
enough to find details and rules when and how it should be applied.

If you are suggesting that we introduce ICU as a dependency,
we could discuss the pros and cons.

I consider it as the most complete available implementation. Do you know a comparable alternative?

I have realized that since Emacs has support of dynamic modules, it is
possible to create a prototype with bindings to external library without
rebuilding of Emacs.

I don't think the problem is the API.

I think, introducing features gradually will be more headache for developers of external packages than absence of support at all. API determines the scope of such features.

E.g. I was completely unaware that negative sign may be
represented by parenthesis

Really? it's standard in financial applications.

Is it really so standard? Maybe I have seen such format, even guessed from some context that e.g. table column with such numbers should assume negative values, or e.g. in discount entry. At least I did not recognize such format as some general rule.

new Intl.NumberFormat('de-DE', {style: 'currency', currency: 'USD', currencySign: 'accounting', signDisplay: 'always'}).format(-3500);
"-3.500,00 $"
new Intl.NumberFormat('es-ES', {style: 'currency', currency: 'USD', currencySign: 'accounting', signDisplay: 'always'}).format(-3500);
"-3500,00 US$"
new Intl.NumberFormat('fr-FR', {style: 'currency', currency: 'USD', currencySign: 'accounting', signDisplay: 'always'}).format(-3500);
"(3 500,00 $US)"
new Intl.NumberFormat('ru-RU', {style: 'currency', currency: 'USD', currencySign: 'accounting', signDisplay: 'always'}).format(-3500);
"-3 500,00 $"

I expect enough surprises and unexpected "discoveries"
during implementation of better locale support. That is
why I would consider adapting some more or less
established API for this purpose.

I don't think "consider" cuts it.  We have already a lot of
stuff in Emacs; what we don't have needs serious design and
comparison of available implementation options.  Emacs's
needs are quite special and unlike those of most other
programs.

I still think that expectation of users around the globe are more
special than Emacs' needs at least in respect to format of numbers.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]