bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] feature request: iconv/recode dynamic extension


From: Franta Hanzlík
Subject: Re: [bug-gawk] feature request: iconv/recode dynamic extension
Date: Sat, 22 Dec 2018 15:55:17 +0100

(google repeatedly reject my mail to Wolfgang, thus now I try send it
also to <address@hidden>)

Hi Wolfgang,
not sure what your gawk DB recommendation is. You mean using some gawk
DB extension (pgsql/redis/lmdb)? I admit that my ambitions and knowledge
do not go far enough. Now I'm using simply set of arrays as NAME[1-N],
SURNAME[1-N], STREET[1-N],... where N is number of users. These arrays
I fill in BEGIN section in while getline loop:

while (getline < "userdb" >0){
  i++
  NAME[i]=$1
  SURNAME[i]=$2
  ...
}

and then I search in these arrays with simple for cycle:

for (i=1; i<=N; i++){
    if (SEARCHEDNAME == NAME[i] && SEARCHEDSURNAME == SURNAME[i] && 
SEARCHEDSTREET == STREET[i]){
        isInDBflag++; break }
}

For number of users I have in 'DB' this work well and fast.
I not know, if there is any possibility of a significant acceleration.
Anyway, I'll be grateful for your recommendations for improvements.

Franta H

On Sat, 22 Dec 2018 14:01:03 +0100
Wolfgang Laun <address@hidden> wrote:

> You probably know this: You may have to add a DB field "name without
> diacritics" as an additional key, which will slow down the initial loading
> but just marginally; but lookup will be fast: one conversion for the input
> and at most two lookups in the DB to decide found or not found.
> 
> I guess the function will be fast enough. ;-)
> -W
> 
> 
> On Sat, 22 Dec 2018 at 13:33, Franta Hanzlík <address@hidden> wrote:
> 
> > On Sat, 22 Dec 2018 12:37:35 +0100
> > Wolfgang Laun <address@hidden> wrote:
> >  
> > > It is correct that the Unicode Database contains a wealth of information
> > > but would you like to process 31MB of XML just to learn that your Á is
> > > composed from codepoints 0041 and 0301?
> > >
> > > But I *guess* that OP's problem may be related to establish a (more or
> > > less) correct sort order*. *Some sort orders for European languages
> > > containing letters with diacritical remarks equate such letters to the
> > > "stripped" letter, e.g., "dd" < "de" = "dé" = "dè" < "df". This is where
> > > stripping the accents works. But even within the same language there may  
> > be  
> > > different sort orders, and there may be one where stripping the  
> > diacritical  
> > > mark would not work. gawk's sort has an extension that can handle that,  
> > and  
> > > it would be just as easy to generate a suitable function from a string  
> > with  
> > > minimal mark-up: "...no=öp..." in contrast to "...noöp...".  
> >
> > In my case, I process data from the web form, where users fill their
> > name, address etc., and compare this data with 'database' of users (which
> > is simple text file 1 user/row with <TAB> separated items) to decide if
> > the user is already in the database. And because some users fill form
> > without diacritics, for better accuracy I want compare data also without
> > diacritics.
> > User DB has approx. 40 thousand rows(users). One solution could also be
> > convert DB file (with recode or iconv utility) to file without diacritics
> > and join both files or process them in awk separately - this should avoid
> > long processing time. Subsequent form data de-diacritics conversion can be
> > done with some slow method, as it is small number of strings (<10).
> >
> > I will try your dedia() function and try to measure its speed. If it will
> > be sufficient, problem will be solved.
> >
> > Thanks!
> >  
> > > On Sat, 22 Dec 2018 at 11:08, Eli Zaretskii <address@hidden> wrote:
> > >  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
> > --
> > Franta Hanzlík
> >  


-- 
S pozdravem
František Hanzlík

Luční 502           Linux/Unix/LAN/Internet       Tel: +420-372-222302
33209 Štěnovice    e-mail:address@hidden      Fax: +420-372-222302
Czech Republic        http://hanzlici.cz/         GSM: +420-604-117319
Tento mail neobsahuje viry, byl odeslán z operačního systému Linux



reply via email to

[Prev in Thread] Current Thread [Next in Thread]