bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] feature request: iconv/recode dynamic extension


From: Franta Hanzlík
Subject: Re: [bug-gawk] feature request: iconv/recode dynamic extension
Date: Sat, 22 Dec 2018 13:32:48 +0100

On Sat, 22 Dec 2018 12:37:35 +0100
Wolfgang Laun <address@hidden> wrote:

> It is correct that the Unicode Database contains a wealth of information
> but would you like to process 31MB of XML just to learn that your Á is
> composed from codepoints 0041 and 0301?
> 
> But I *guess* that OP's problem may be related to establish a (more or
> less) correct sort order*. *Some sort orders for European languages
> containing letters with diacritical remarks equate such letters to the
> "stripped" letter, e.g., "dd" < "de" = "dé" = "dè" < "df". This is where
> stripping the accents works. But even within the same language there may be
> different sort orders, and there may be one where stripping the diacritical
> mark would not work. gawk's sort has an extension that can handle that, and
> it would be just as easy to generate a suitable function from a string with
> minimal mark-up: "...no=öp..." in contrast to "...noöp...".

In my case, I process data from the web form, where users fill their
name, address etc., and compare this data with 'database' of users (which
is simple text file 1 user/row with <TAB> separated items) to decide if
the user is already in the database. And because some users fill form
without diacritics, for better accuracy I want compare data also without
diacritics.
User DB has approx. 40 thousand rows(users). One solution could also be
convert DB file (with recode or iconv utility) to file without diacritics
and join both files or process them in awk separately - this should avoid
long processing time. Subsequent form data de-diacritics conversion can be
done with some slow method, as it is small number of strings (<10).

I will try your dedia() function and try to measure its speed. If it will
be sufficient, problem will be solved.

Thanks!

> On Sat, 22 Dec 2018 at 11:08, Eli Zaretskii <address@hidden> wrote:
> 
> > > From: Wolfgang Laun <address@hidden>
> > > Date: Sat, 22 Dec 2018 09:17:07 +0100
> > >
> > > The most general case of transliteration is handled by defining  
> > individual  
> > > characters.  
> >
> > As I tried to explain, it isn't transliteration that is being sought
> > here, it's removal of combining accents and diacritics.
> >  
> > > You can add such a transliteration function ("dedia(str)") to
> > > any awk program ("foo.awk") using a simple generator like genf.awk:
> > >      gawk -- "`gawk -f genf.awk <<<"üöóäěščřžýáíéúů uooaescrzyaieuu
> > > foo.awk"`  
> >
> > Yes, of course.  But coming up with the list of such translations on
> > one's own is a huge job, and the Unicode database already has all that
> > figured out.  So my suggestion would be to import their tables, rather
> > than create them from scratch manually.
> >
> > Of course, for one-off jobs that need to handle only a small set of
> > accented characters, what you suggest is sufficient.  My
> > interpretation of the question was that a solution for a more general
> > problem was sought.
-- 
Franta Hanzlík



reply via email to

[Prev in Thread] Current Thread [Next in Thread]