aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[aspell-devel] Diacritic restoration + new spell checking packages


From: Kevin Scannell
Subject: [aspell-devel] Diacritic restoration + new spell checking packages
Date: Tue, 7 Apr 2009 21:18:53 -0500

Hello all,

  This is an announcement of a new package called "charlifter" that
does statistical diacritic restoration:

https://sourceforge.net/project/showfiles.php?group_id=256316&package_id=317046

and two new open source word lists, one for Lingala (joint work with
Denis Jacquerye):

https://sourceforge.net/project/showfiles.php?group_id=256316&package_id=317051

and one for Hawaiian:

http://borel.slu.edu/ispell/haw_US.zip



The charlifter script is language-independent - all you need to do is
provide it with some plain text in the language of interest with all
of the diacritical marks in place.   From this the script "learns"
where the diacritics belong, statistically.   You can also improve
performance by feeding it a word list during the training phase.
I've built and packaged pre-trained models for several languages,
including Irish, French, Lingala, Samoan, and Hawaiian - see the
directories "charlifter-*" here:

http://lingala.svn.sourceforge.net/viewvc/lingala/

Once you've trained a language model, or installed one of the models
above, you can feed plain ASCII text to the script and it restores the
diacritics or extended Unicode characters that are missing:

Irish:
$ echo "an chead teanga oifigiuil" | sf.pl -r ga
an chéad teanga oifigiúil

Lingala (note the open vowels "ɔ" are restored correctly):
$ echo "Ngolo, nina, zambi ikamwisi bango." | sf.pl -r ln
Ngɔlɔ, niná, zambí ikamwísí bangó.

Hawaiian:
$ echo "Olelo aku 'o Papa" | sf.pl -r haw
ʻŌlelo aku ʻo Pāpā

etc....


This work ties in closely with my Crúbadán project which is gathering
text corpora for 400+ languages with a web crawler:

http://borel.slu.edu/crubadan/

Lingala is a good example.  When written properly, it uses diacritics
to indicate tone, and also uses the open vowels "ɔ" and "ɛ", but 95%
of what is written on the web is in plain ASCII (no tone marks, "o"
and "e" in place of "ɔ" and "ɛ").    Therefore, to use the web corpus
effectively for language modelling purposes, it is important to
restore these ASCII texts to the proper encoding as best as possible.

The spell checkers for Lingala and Hawaiian came directly from this
approach - train charlifter on the small amount (say 5%) of web text
with correct diacritics in place, the restore the other 95% and use
the resulting large corpus to generate frequency lists for
hand-editing, just as we've done with many other Crúbadán languages.

Please contact me if you're interested in trying to develop a new word
list using this approach.  I'm particularly interested in African
languages.

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]