[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] feature request: iconv/recode dynamic extension
From: |
Wolfgang Laun |
Subject: |
Re: [bug-gawk] feature request: iconv/recode dynamic extension |
Date: |
Sat, 22 Dec 2018 09:17:07 +0100 |
The most general case of transliteration is handled by defining individual
characters. You can add such a transliteration function ("dedia(str)") to
any awk program ("foo.awk") using a simple generator like genf.awk:
gawk -- "`gawk -f genf.awk <<<"üöóäěščřžýáíéúů uooaescrzyaieuu
foo.awk"`
# genf.awk:
{ ind = $1;
val = $2;
prg = $3;
print "BEGIN {";
print " dduse = \"" ind "\";";
for( i = 1; i <= length(ind); ++i ){
print " dd[\"" substr(ind, i, 1 ) "\"] = \"" substr( val, i, 1 )
"\";";
}
print "}";
print "function dedia(s){"
print " r = \"\";"
print " for( i = 1; i <= length(s); ++i ){";
print " c = substr( s, i, 1 );"
print " if( index( dduse, c ) > 0 ){";
print " c = dd[c];";
print " }";
print " r = r c;";
print " }";
print " return r;";
print "}";
if( prg == "" ){
print "{ print dedia($0); }";
} else {
while ((getline line < prg) > 0)
print line;
}
}
On Sat, 22 Dec 2018 at 08:40, Eli Zaretskii <address@hidden> wrote:
> > Date: Sat, 22 Dec 2018 02:29:37 +0100
> > From: Franta Hanzlík <address@hidden>
> >
> > not sure when it is good idea, but I think this may be usefull for
> > others also: I'm just doing some word processing in gawk, and it's
> > part is two string comparison. These strings are plaintext ASCII
> > strings obtained by removing diacritics from the original Latin-1
> > and Latin-2 strings - thus I need conversion as
> > "äáéěóöščýíüúů" -> "aaeeooscyiuuu".
> > For now I solve this by calling external conversion program - as
> >
> > iconv -f UTF-8 -t US-ASCII//TRANSLIT <<< "üöóäěščřžýáíéúů"
> > or
> > recode -f u8..flat <<< "üöóäěščřžýáíéúů"
> >
> > but for thousands strings it is too slow (and resource expensive).
>
> libiconv's TRANSLIT will only work for Latin characters, as it's not
> what you want in general. What you want is the "decomposition" of
> each character into the base character and the diacriticals/combining
> accents; then you want to throw out the non-base parts. How to do
> that is defined by the Unicode Standard, and needs to use the various
> data files provided by the UCD, the Unicode Character Database.
>
> > There is perhaps lot of similar text conversions cases, where gawk
> > dynamic extension for this needs wil be very useful.
>
> It could be useful for such jobs, yes. How frequently these jobs
> happen in typical Gawk usage is another question; I don't have an
> answer for that.
>
>
Re: [bug-gawk] feature request: iconv/recode dynamic extension, arnold, 2018/12/22