[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] feature request: iconv/recode dynamic extension
From: |
Franta Hanzlík |
Subject: |
Re: [bug-gawk] feature request: iconv/recode dynamic extension |
Date: |
Sun, 23 Dec 2018 03:44:28 +0100 |
On Sat, 22 Dec 2018 13:32:48 +0100
Franta Hanzlík <address@hidden> wrote:
> On Sat, 22 Dec 2018 12:37:35 +0100
> Wolfgang Laun <address@hidden> wrote:
>
> > It is correct that the Unicode Database contains a wealth of information
> > but would you like to process 31MB of XML just to learn that your Á is
> > composed from codepoints 0041 and 0301?
> >
> > But I *guess* that OP's problem may be related to establish a (more or
> > less) correct sort order*. *Some sort orders for European languages
> > containing letters with diacritical remarks equate such letters to the
> > "stripped" letter, e.g., "dd" < "de" = "dé" = "dè" < "df". This is where
> > stripping the accents works. But even within the same language there may be
> > different sort orders, and there may be one where stripping the diacritical
> > mark would not work. gawk's sort has an extension that can handle that, and
> > it would be just as easy to generate a suitable function from a string with
> > minimal mark-up: "...no=öp..." in contrast to "...noöp...".
>
> In my case, I process data from the web form, where users fill their
> name, address etc., and compare this data with 'database' of users (which
> is simple text file 1 user/row with <TAB> separated items) to decide if
> the user is already in the database. And because some users fill form
> without diacritics, for better accuracy I want compare data also without
> diacritics.
> User DB has approx. 40 thousand rows(users). One solution could also be
> convert DB file (with recode or iconv utility) to file without diacritics
> and join both files or process them in awk separately - this should avoid
> long processing time. Subsequent form data de-diacritics conversion can be
> done with some slow method, as it is small number of strings (<10).
>
> I will try your dedia() function and try to measure its speed. If it will
> be sufficient, problem will be solved.
After implementing Wolfgang's dedia() function and some tests, I got these
results (tested on flat file UserDB with 55000 users, average length cca
160 chars/user, dedia() know 52 accented chars):
iconv - 0.12 sec
recode - 0.17 sec
dedia() - 4.0 sec (without dedia() I fill my arrays within <2 sec, with dedia
the time increased to about 6 sec).
Although dedia() works much more slowly than iconv/recode, the result is very
nice and it's enough for me. Conversion with awk dedia() is much faster than
I thought.
Thus IMO conclusion is that conversions using iconv/recode dynamic extension
would be much quicker and more complex, but for less demanding needs (as here),
a solution with awk own routines is enough.
Thanks, Franta
> > On Sat, 22 Dec 2018 at 11:08, Eli Zaretskii <address@hidden> wrote:
> >
> [...]
> > > individual
> [...]
> > >
> > > As I tried to explain, it isn't transliteration that is being sought
> > > here, it's removal of combining accents and diacritics.
> > >
> [...]
> > >
> > > Yes, of course. But coming up with the list of such translations on
> > > one's own is a huge job, and the Unicode database already has all that
> > > figured out. So my suggestion would be to import their tables, rather
> > > than create them from scratch manually.
> > >
> > > Of course, for one-off jobs that need to handle only a small set of
> > > accented characters, what you suggest is sufficient. My
> > > interpretation of the question was that a solution for a more general
> > > problem was sought.
--
Franta Hanzlik
Re: [bug-gawk] feature request: iconv/recode dynamic extension, arnold, 2018/12/22