Re: CSV parsing and other issues (Re: LC

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CSV parsing and other issues (Re: LC_NUMERIC)

From:	Maxim Nikulin
Subject:	Re: CSV parsing and other issues (Re: LC_NUMERIC)
Date:	Thu, 10 Jun 2021 23:28:59 +0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1

On 09/06/2021 01:52, Eli Zaretskii wrote:

From: Maxim Nikulin Date: Tue, 8 Jun 2021 23:35:51 +0700


I have reordered some parts of discussion.

I just have realized that nl_langinfo(3) (and
nl_langinfo_l(3) as well) from libc accepts RADIXCHAR
(decimal dot) and THOUSEP (group separator)
arguments. They are good candidates for `locale-info'
extension.


We already use nl_langinfo in locale-info, so what exactly
are you suggesting here? adding more items?  You don't
really expect Lisp programs to format numbers such as
123,456 by hand after learning from locale-info that the
thousands separator is a comma, do you?

I have hijacked Boruch's thread and changed the subject to "CSVparsing". There are plenty of CSV dialects. If decimal separator is ","then office software uses ";" instead of comma as cell (field)separator. So to parse CSV file it is necessary to know decimalseparator in a specified locale. RADIXCHAR as argument of nl_langinfo(3)is a first step to better user experience with CSV files.

Unfortunately it allows only to get reasonable visual representation.Taking advantage of Org spreadsheet calculations require parsing cellcontents thus parsing of numbers (and maybe dates).

I mentioned earlier https://debbugs.gnu.org/47885 and a part ofdiscussion that is missed in the bug tracker:

https://lists.gnu.org/archive/html/emacs-orgmode/2021-05/msg00693.html

I have seen nl_langinfo without RADIXCHAR in emacs sources
http://git.savannah.gnu.org/cgit/emacs.git/tree/src/w32proc.c#n3258
http://git.savannah.gnu.org/cgit/emacs.git/tree/lib-src/ntlib.c#n520

Originally during discussion in emacs-orgmode I did not plan to raise
the question concerning number formatting and parsing since I had no
hope for any positive outcome without consistent proposal.  Accidentally
I notices Borich's message and decided to add another use case.

On 08/06/2021 09:35, Eli Zaretskii wrote:
 > From: Boruch Baum
 >> No? If an Emacs user has two buffers in two separate languages, the
 >> buffer-local settings aren't / won't be respected?
 >
 > First, language is different from locale.  And second, we don't even
 > have a buffer-local notion of language yet.

Certainly locale is more precise than just language since it includes
region and other variants, moreover it can be granularly tuned (date,
numbers, sorting can be adjusted independently), but I still think that
all these properties can be sometimes broadly referred to as language.


No, they cannot, not in general.  A locale comes with a whole database
of different settings: language, encoding (a.k.a. "codeset"), formats
of date and time, names of days of the week and of the months, rules
for collation and capitalization, etc. etc.  You can easily find
several locales whose language is English, but some/many/all of the
other locale-dependent settings are different.  It isn't a coincidence
that a locale's name includes more than just the language part.

I wrote almost the same concerning locale variants and components, so Ifeel some sort of confusion and can not get its origin. I was trying tosupport Boruch that buffer-local variables may be important part oflocale context, more precise than global settings, and a fallback iflocale is not specified for particular span of text. In respect to suchhierarchy language vs. locale difference does not matter.

Low level functions can accept explicit locale.


Which ones?  Most libc routines don't, they use the locale
as a global identifier.  And many libc's (with the prominent
exception of glibc) don't support efficient change of a
locale in the middle of a program, they assume that the
program's locale is set once at program startup.


Hypothetical functions in new elisp API, maybe relying on some external

libraries. I believed, you agreed that global LC_NUMERIC must be "C" toavoid various sort of problems with data exchange. I am not aware oflibc functions for number formatting or parsing that can take explicitlocale (I have seen such feature in C++ standard library, Qt, otherlanguages). Totalitarian approach of libc with the only locale facet,the only timezone imposes too hard limitations to consider some libcfunctions as useful and reliable in more or less complex application.Its API is suitable for simple tools that can quickly do their work anddo not assume any conversion. More flexible base layer is required whenmix of environments is expected. Full support of locale featuresrequires a lot of work, that is why I am asking if some external librarycan be used instead.

Higher level API can obtain it implicitly from
buffer-local variables and global locale. For example the
LOCALE argument of `string-collate-lessp' is optional
one. I can even anticipate that locale may be stored in
text properties some times. A random message from recent
"About multilingual documents" thread at emacs-orgmode
mail list:
https://lists.gnu.org/archive/html/emacs-orgmode/2021-05/msg00252.html


That's mostly about input methods and org-export, I don't
see how it's relevant to what Boruch asked.

I added this link to show you that demand for multilanguage documents isreal. Notice that problems with spell checking were mentioned in thatdiscussion. Earlier I saw suggestions to switch ispell language withinput method. In my opinion it is ridiculous. Personally I rather needcombined dictionary then explicitly marked text regions.

I expect that new features will be wider utilized when possibility touse them will appear.

At first basic functionality may be implemented. The
problem is to choose extensible API.


No, the problem is to have a design that would allow an
efficient implementation.  Given what the underlying libc
does, it isn't easy.


That is why I looking for an alternative to libc. Previously you wrote
"locale switching". I would rather say constructing and destroying

locales on demand. Switching may behave not so well when thread areinvolved.

And then we have conceptual problems.  For example, in a
multilingual editor such as Emacs, the notion of a "buffer
language" not always makes sense, you'd need to support
portions of text that have different language properties.
Imagine switching locales as Emacs processes adjacent
stretches of text and other complications.  For example,
changing letter-case for a stretch or Turkish text is
supposed to be different from the English or German text.
I'm all ears for ideas how to design such "language
support".  It definitely isn't easy, so if you have ideas,
please voice them!

I never have a consistent vision nor see a conceptual problem.Buffer-local settings are just more specific than global ones. That isI mentioned text properties as even more precise in my previous message.Maybe even current mode can help to build proper hierarchy of localecontexts. HTML has "lang" attribute, there is "\foreignlanguage" inLaTeX, etc.


I have heard that special case exists in Turkish, but I was not curious
enough to find details and rules when and how it should be applied.

If you are suggesting that we introduce ICU as a dependency,
we could discuss the pros and cons.

I consider it as the most complete available implementation. Do youknow a comparable alternative?


I have realized that since Emacs has support of dynamic modules, it is
possible to create a prototype with bindings to external library without
rebuilding of Emacs.

I don't think the problem is the API.

I think, introducing features gradually will be more headache fordevelopers of external packages than absence of support at all. APIdetermines the scope of such features.

E.g. I was completely unaware that negative sign may be
represented by parenthesis


Really? it's standard in financial applications.

Is it really so standard? Maybe I have seen such format, even guessedfrom some context that e.g. table column with such numbers should assumenegative values, or e.g. in discount entry. At least I did notrecognize such format as some general rule.

new Intl.NumberFormat('de-DE', {style: 'currency', currency: 'USD',currencySign: 'accounting', signDisplay: 'always'}).format(-3500);

"-3.500,00 $"

new Intl.NumberFormat('es-ES', {style: 'currency', currency: 'USD',currencySign: 'accounting', signDisplay: 'always'}).format(-3500);

"-3500,00 US$"

new Intl.NumberFormat('fr-FR', {style: 'currency', currency: 'USD',currencySign: 'accounting', signDisplay: 'always'}).format(-3500);

"(3 500,00 $US)"

new Intl.NumberFormat('ru-RU', {style: 'currency', currency: 'USD',currencySign: 'accounting', signDisplay: 'always'}).format(-3500);

"-3 500,00 $"

I expect enough surprises and unexpected "discoveries"
during implementation of better locale support. That is
why I would consider adapting some more or less
established API for this purpose.


I don't think "consider" cuts it.  We have already a lot of
stuff in Emacs; what we don't have needs serious design and
comparison of available implementation options.  Emacs's
needs are quite special and unlike those of most other
programs.


I still think that expectation of users around the globe are more
special than Emacs' needs at least in respect to format of numbers.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: CSV parsing and other issues (Re: LC_NUMERIC), Boruch Baum, 2021/06/06
- Re: CSV parsing and other issues (Re: LC_NUMERIC), Eli Zaretskii, 2021/06/07
  - Re: CSV parsing and other issues (Re: LC_NUMERIC), Boruch Baum, 2021/06/07
    - Re: CSV parsing and other issues (Re: LC_NUMERIC), Eli Zaretskii, 2021/06/07
    - Re: CSV parsing and other issues (Re: LC_NUMERIC), Stefan Monnier, 2021/06/08
    - Re: CSV parsing and other issues (Re: LC_NUMERIC), Maxim Nikulin, 2021/06/08
    - Re: CSV parsing and other issues (Re: LC_NUMERIC), Eli Zaretskii, 2021/06/08
    - Re: CSV parsing and other issues (Re: LC_NUMERIC), Maxim Nikulin <=
    - Re: CSV parsing and other issues (Re: LC_NUMERIC), Eli Zaretskii, 2021/06/10
    - Re: CSV parsing and other issues (Re: LC_NUMERIC), Boruch Baum, 2021/06/10
    - Re: CSV parsing and other issues (Re: LC_NUMERIC), Eli Zaretskii, 2021/06/10
    - Re: CSV parsing and other issues (Re: LC_NUMERIC), Boruch Baum, 2021/06/10
    - Re: CSV parsing and other issues (Re: LC_NUMERIC), Eli Zaretskii, 2021/06/10
    - Re: CSV parsing and other issues (Re: LC_NUMERIC), Boruch Baum, 2021/06/10
    - Re: CSV parsing and other issues (Re: LC_NUMERIC), Eli Zaretskii, 2021/06/11
    - Re: CSV parsing and other issues (Re: LC_NUMERIC), Boruch Baum, 2021/06/11
    - Re: CSV parsing and other issues (Re: LC_NUMERIC), Maxim Nikulin, 2021/06/11
    - Re: CSV parsing and other issues (Re: LC_NUMERIC), Filipp Gunbin, 2021/06/11

Prev by Date: Re: [PATCH] 0001-Add-icomplete-count-format
Next by Date: Re: cc-mode fontification feels random
Previous by thread: Re: CSV parsing and other issues (Re: LC_NUMERIC)
Next by thread: Re: CSV parsing and other issues (Re: LC_NUMERIC)
Index(es):
- Date
- Thread