[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[aspell-devel] Experiments with spell-checking...
From: |
Mike C. Fletcher |
Subject: |
[aspell-devel] Experiments with spell-checking... |
Date: |
Mon, 28 Oct 2002 16:19:16 -0500 |
User-agent: |
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.1) Gecko/20020826 |
I've been working on and off the last week on a contexualisable
spell-check engine for Python. So far I have a basically functional
system, but I'm running into a number of issues. I'm wondering if
others have suggestions about how to fix these problems (better
approaches). I'm going to summarise the approach I've taken so the
problems have some useful context:
Overview of the system
I'm providing two major services with the spell-checker:
The first is the standard "is this word in the dictionary"
"check", which is fast, and very basic regardless of the underlying
technology. Basically for any given word you have 1 lookup in the
database which returns the (normally first, but optionally all) word-set
object in which the word was found, and any variants on the normalised
form of the word (e.g. if the word checked was "Form" and only "form"
was in the database, you'd get back a list with only "form" in it, and
your app could decide whether that constituted an error or not (e.g. by
looking at whether you are at the start of a sentence, in a title, etc).
The second service is the "what words are similar to this?"
"suggestion", which is comparatively very slow (0.03s (at
edit-distance=1) through 2.5s (at edit-distance=2) seconds depending on
the chosen "fuzziness"). The algorithm is currently based on a phonetic
compression mechanism (with the rules read from Aspell's files) with
edit-distance calculations determining which records are considered
'hits'. At the moment, this uses an approach based on the
documentation for the aspell engine. It does a heavily-trimmed query
for single-edit-distance fuzziness (~2000 queries on a 500,000 word
dictionary) and a (slightly-trimmed) straight scan through the database
for more lax searches (which scans an average of ~200,000 records in a
500,000 word dictionary).
Eventually I intend to provide a few other services:
Rank words according to their likely relevance to a context
(assumes words are in the dictionary). This will be built on a number
of mechanisms, including whole-word-list rankings by dictionary (e.g.
common English words having a higher ranking than Shakespearean English
words in most dictionaries, but the reverse in a Shakespearean
dictionary), statistical recording per-user (i.e. you always use the
word "ripping", so it's more likely to be what you mean), and
potentially algorithmic mechanisms based on the context of the word (for
example, Noun-verb agreement). Beyond the obvious use-case of sorting
suggestions, it would be useful for technical writers to take a
Standard-English dictionary split by commonality of word (as is
available) and colourise each word in a corpus according to its commonality.
Word-completion (lookup of all completing words in the current
dictionary)
Correction and usage tracking (would feed some of the ranking
mechanisms above). This would likely be per-user (and optional). It
would allow you to store per-app/per-doc/whatever usage databases to
provide both more efficient and more personal/contextualised services to
the user. Correction tracking is a sub-set of usage tracking, used
solely by the spelling system. Usage tracking in the more general
approach allows you to make educated guesses based on the current input
and past history, what the user probably intended to say.
The primary focus of the system's current design is on modularity and
flexibility, so the code allows for loading both in-memory and on-disk
word lists and mixing and matching word-lists within multiple loaded
dictionaries (note: aspell also offers mix-and-match mechanisms for
word-lists). The idea here being that you load word-lists per-document,
per-project, and per-application and add them to the current user's
default dictionaries/correction-lists/usage-statistics to give you a
contextually-relevant dictionary.
Current Status and Problems
I'm currently using Python's bsddb wrappers for storing the
word-lists. I'm storing the phonetic database as "phonetic-compressed":
word [+ '\000'+word] and the non-phonetic database as "normalised": word
[+ '\000'+word]. That means that the database requires (loosely) 4
times the storage space of the words themselves + the overhead of the
database. In practice that means that the 500,000 word phonetic
database (7.1MB plain-text for words) takes around 29MB when compiled).
500,000 is a fairly large dictionary (my "official scrabble dictionary"
(dead-trees) is 500,000), but I'm thinking a real-world installation
will want ~ 1,000,000 words to be available in system word-sets (for
each language), so we'd be talking around 120MB*(number of languages)
just for the system dictionaries. So, some thoughts on how to compress
the word-sets without dramatically overloading the system (processing
overhead) would be appreciated.
bsddb will, if a program using it crashes with a db open for
writing, corrupt the database and be unable to open it again. Is there
a better embeddable dbase system for this kind of work? I'm considering
testing meta-kit (which I used a few years ago), but the last time I
worked with it the same problem plagued it (it would corrupt files if
not properly closed).
The scanning speed is just not usable under the current bsddb system
(even if I use a non-thread-safe naive approach), so the aspell-style
searches for larger edit distances (i.e. more mistakes allowed) are
prohibitively expensive (i.e. a single suggestion taking 2.5 seconds).
The current code goes through around 200,000 iterations for a
distance=2 search (even after optimising out the 2-deletion cases).
That would likely be unnoticably fast in C, but it's hideously slow in
Python.
I am considering creating a phonetic-feature-set database instead of
continuing work on the scanning phonetic database. This would store all
(2-character sub-strings in the phonetic compression): (each word having
that feature). That would give n+1 (start and end are considered
characters in the feature-sets) entries for each word. To do that w/out
creating huge databases I would need a mechanism for storing pointers to
words, rather than words, within the database. I'm sure there must be
some useful way to do that, but everything I've seen relies on storing
integers then doing a query to resolve each integer back to a string.
That seems like it would still be expensive (especially using bsddb).
Any better mechanisms people know of? As a note, this is a toolkit for
building spell-checkers, so I'll likely provide both mechanisms, but
it's unlikely anyone would use both at the same time (it would mean
~180MB/language for system dictionaries).
For those interested in playing with the current code, the project is on
sourceforge at:
http://sourceforge.net/projects/pyspelling/
Enjoy yourselves,
Mike
_______________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://members.rogers.com/mcfletch/
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [aspell-devel] Experiments with spell-checking...,
Mike C. Fletcher <=