bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] feature request: iconv/recode dynamic extension


From: Wolfgang Laun
Subject: Re: [bug-gawk] feature request: iconv/recode dynamic extension
Date: Sat, 22 Dec 2018 09:17:07 +0100

The most general case of transliteration is handled by defining individual
characters. You can add such a transliteration function ("dedia(str)") to
any awk program ("foo.awk") using a simple generator like genf.awk:
     gawk -- "`gawk -f genf.awk <<<"üöóäěščřžýáíéúů uooaescrzyaieuu
foo.awk"`

# genf.awk:
{   ind = $1;
    val = $2;
    prg = $3;
    print "BEGIN {";
    print "    dduse = \"" ind "\";";
    for( i = 1; i <= length(ind); ++i ){
        print "    dd[\"" substr(ind, i, 1 ) "\"] = \"" substr( val, i, 1 )
"\";";
    }
    print "}";
    print "function dedia(s){"
    print "  r = \"\";"
    print "  for( i = 1; i <= length(s); ++i ){";
    print "    c = substr( s, i, 1 );"
    print "    if( index( dduse, c ) > 0 ){";
    print "      c = dd[c];";
    print "    }";
    print "    r = r c;";
    print "  }";
    print "  return r;";
    print "}";
    if( prg == "" ){
    print "{ print dedia($0); }";
    } else {
        while ((getline line < prg) > 0)
        print line;
    }
}


On Sat, 22 Dec 2018 at 08:40, Eli Zaretskii <address@hidden> wrote:

> > Date: Sat, 22 Dec 2018 02:29:37 +0100
> > From: Franta Hanzlík <address@hidden>
> >
> > not sure when it is good idea, but I think this may be usefull for
> > others also: I'm just doing some word processing in gawk, and it's
> > part is two string comparison. These strings are plaintext ASCII
> > strings obtained by removing diacritics from the original Latin-1
> > and Latin-2 strings - thus I need conversion as
> >  "äáéěóöščýíüúů" -> "aaeeooscyiuuu".
> > For now I solve this by calling external conversion program - as
> >
> > iconv -f UTF-8 -t US-ASCII//TRANSLIT <<< "üöóäěščřžýáíéúů"
> >    or
> > recode -f u8..flat <<< "üöóäěščřžýáíéúů"
> >
> > but for thousands strings it is too slow (and resource expensive).
>
> libiconv's TRANSLIT will only work for Latin characters, as it's not
> what you want in general.  What you want is the "decomposition" of
> each character into the base character and the diacriticals/combining
> accents; then you want to throw out the non-base parts.  How to do
> that is defined by the Unicode Standard, and needs to use the various
> data files provided by the UCD, the Unicode Character Database.
>
> > There is perhaps lot of similar text conversions cases, where gawk
> > dynamic extension for this needs wil be very useful.
>
> It could be useful for such jobs, yes.  How frequently these jobs
> happen in typical Gawk usage is another question; I don't have an
> answer for that.
>
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]