[aspell-devel] Experiments with spell-checking...

aspell-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[aspell-devel] Experiments with spell-checking...

From:	Mike C. Fletcher
Subject:	[aspell-devel] Experiments with spell-checking...
Date:	Mon, 28 Oct 2002 16:19:16 -0500
User-agent:	Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.1) Gecko/20020826

I've been working on and off the last week on a contexualisablespell-check engine for Python. So far I have a basically functionalsystem, but I'm running into a number of issues. I'm wondering ifothers have suggestions about how to fix these problems (betterapproaches). I'm going to summarise the approach I've taken so theproblems have some useful context:


Overview of the system

   I'm providing two major services with the spell-checker:

The first is the standard "is this word in the dictionary""check", which is fast, and very basic regardless of the underlyingtechnology. Basically for any given word you have 1 lookup in thedatabase which returns the (normally first, but optionally all) word-setobject in which the word was found, and any variants on the normalisedform of the word (e.g. if the word checked was "Form" and only "form"was in the database, you'd get back a list with only "form" in it, andyour app could decide whether that constituted an error or not (e.g. bylooking at whether you are at the start of a sentence, in a title, etc).

The second service is the "what words are similar to this?""suggestion", which is comparatively very slow (0.03s (atedit-distance=1) through 2.5s (at edit-distance=2) seconds depending onthe chosen "fuzziness"). The algorithm is currently based on a phoneticcompression mechanism (with the rules read from Aspell's files) withedit-distance calculations determining which records are considered'hits'. At the moment, this uses an approach based on thedocumentation for the aspell engine. It does a heavily-trimmed queryfor single-edit-distance fuzziness (~2000 queries on a 500,000 worddictionary) and a (slightly-trimmed) straight scan through the databasefor more lax searches (which scans an average of ~200,000 records in a500,000 word dictionary).


   Eventually I intend to provide a few other services:

Rank words according to their likely relevance to a context(assumes words are in the dictionary). This will be built on a numberof mechanisms, including whole-word-list rankings by dictionary (e.g.common English words having a higher ranking than Shakespearean Englishwords in most dictionaries, but the reverse in a Shakespeareandictionary), statistical recording per-user (i.e. you always use theword "ripping", so it's more likely to be what you mean), andpotentially algorithmic mechanisms based on the context of the word (forexample, Noun-verb agreement). Beyond the obvious use-case of sortingsuggestions, it would be useful for technical writers to take aStandard-English dictionary split by commonality of word (as isavailable) and colourise each word in a corpus according to its commonality.

Word-completion (lookup of all completing words in the currentdictionary)

Correction and usage tracking (would feed some of the rankingmechanisms above). This would likely be per-user (and optional). Itwould allow you to store per-app/per-doc/whatever usage databases toprovide both more efficient and more personal/contextualised services tothe user. Correction tracking is a sub-set of usage tracking, usedsolely by the spelling system. Usage tracking in the more generalapproach allows you to make educated guesses based on the current inputand past history, what the user probably intended to say.

The primary focus of the system's current design is on modularity andflexibility, so the code allows for loading both in-memory and on-diskword lists and mixing and matching word-lists within multiple loadeddictionaries (note: aspell also offers mix-and-match mechanisms forword-lists). The idea here being that you load word-lists per-document,per-project, and per-application and add them to the current user'sdefault dictionaries/correction-lists/usage-statistics to give you acontextually-relevant dictionary.



Current Status and Problems

I'm currently using Python's bsddb wrappers for storing theword-lists. I'm storing the phonetic database as "phonetic-compressed":word [+ '\000'+word] and the non-phonetic database as "normalised": word[+ '\000'+word]. That means that the database requires (loosely) 4times the storage space of the words themselves + the overhead of thedatabase. In practice that means that the 500,000 word phoneticdatabase (7.1MB plain-text for words) takes around 29MB when compiled).500,000 is a fairly large dictionary (my "official scrabble dictionary"(dead-trees) is 500,000), but I'm thinking a real-world installationwill want ~ 1,000,000 words to be available in system word-sets (foreach language), so we'd be talking around 120MB*(number of languages)just for the system dictionaries. So, some thoughts on how to compressthe word-sets without dramatically overloading the system (processingoverhead) would be appreciated.

bsddb will, if a program using it crashes with a db open forwriting, corrupt the database and be unable to open it again. Is therea better embeddable dbase system for this kind of work? I'm consideringtesting meta-kit (which I used a few years ago), but the last time Iworked with it the same problem plagued it (it would corrupt files ifnot properly closed).

The scanning speed is just not usable under the current bsddb system(even if I use a non-thread-safe naive approach), so the aspell-stylesearches for larger edit distances (i.e. more mistakes allowed) areprohibitively expensive (i.e. a single suggestion taking 2.5 seconds).The current code goes through around 200,000 iterations for adistance=2 search (even after optimising out the 2-deletion cases).That would likely be unnoticably fast in C, but it's hideously slow inPython.

I am considering creating a phonetic-feature-set database instead ofcontinuing work on the scanning phonetic database. This would store all(2-character sub-strings in the phonetic compression): (each word havingthat feature). That would give n+1 (start and end are consideredcharacters in the feature-sets) entries for each word. To do that w/outcreating huge databases I would need a mechanism for storing pointers towords, rather than words, within the database. I'm sure there must besome useful way to do that, but everything I've seen relies on storingintegers then doing a query to resolve each integer back to a string.That seems like it would still be expensive (especially using bsddb).Any better mechanisms people know of? As a note, this is a toolkit forbuilding spell-checkers, so I'll likely provide both mechanisms, butit's unlikely anyone would use both at the same time (it would mean~180MB/language for system dictionaries).

For those interested in playing with the current code, the project is onsourceforge at:


   http://sourceforge.net/projects/pyspelling/

Enjoy yourselves,
Mike

_______________________________________
 Mike C. Fletcher
 Designer, VR Plumber, Coder
 http://members.rogers.com/mcfletch/

[Prev in Thread]

Current Thread

[Next in Thread]

[aspell-devel] Experiments with spell-checking..., Mike C. Fletcher <=

Prev by Date: [aspell-devel] Understanding the suggestion algorithm
Next by Date: [aspell-devel] Aspell does not compile under current GCC-CYGWIN
Previous by thread: [aspell-devel] Understanding the suggestion algorithm
Next by thread: [aspell-devel] Aspell does not compile under current GCC-CYGWIN
Index(es):
- Date
- Thread