aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[aspell-devel] Re: Aspell and


From: Mike C. Fletcher
Subject: [aspell-devel] Re: Aspell and
Date: Mon, 21 Oct 2002 18:24:35 -0400
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.1) Gecko/20020826

Sorry if I caused offense by using your code w/out notifying you. I didn't really think you'd be interested in a project that's so early in it's life-cycle (I only just set up the SourceForge project in the last hour). I cancelled a message I wrote to you Saturday night because I figured you'd be too busy to be answering questions from the likes of me before I have anything that's actually working :) .

As for compiling Aspell on Win32, I hadn't tried the MingWin32 version of GCC. I had noticed the post about the VC++ compilation patch, but your comment on it seemed to suggest that it would require quite a bit of work to be acceptable. Given that I have no great C/C++ skill, it is easier for me to build the infrastructure in Python and only use C/C++ for a few key algorithms than it is to try and modify a complex C/C++ project.

Too bad about using the *.rws files directly, but in considering it, I'm leaning toward giving (GUI) tools to both dictionary creators and users for generating redistributable files for both dictionaries and word-sets. From the sound of it, it should be easy to allow users to generate distributables for either system. If they have aspell installed we'll offer the word-list-(de)compress functionality, otherwise I'll only accept/generate uncompressed lists.

I am somewhat at a loss for how you access the "compressed" files. I'd thought they were using a b-tree or similar index, but it doesn't seem that way when I look at the code for word-list-compress. Are you loading the whole word-set into memory? That should make it fast, but doesn't it consume a lot of space? I'm currently using bsddb tables on disk, with an in-memory hash-table implementation for temporary word-sets (such as per-document and per-application sets).

I'll have to look at the typo-weighting code, as I'm not sure where to hook it into the leditdistance algorithm. It would seem that you'd need each "swap" to be a lookup into the typo table. I'm looking at making a set of ranking algos based on:

   set meta-data
       user-specific sets have higher rank than system sets
dictionaries declare set's "commonality" ranking (e.g. the english dict has levels 10,20,...90) might allow for "formality" rankings (e.g. slang word-sets have lower ranking in Business dictionaries and higher in Informal dictionaries). Similarly "technicality", "political correctness" or whatever key you want. Made a float factor, sets which don't include the meta-data just get the default values. Each dictionary would then include the set meta-data to determine the ranking of suggestions within itself. Most likely would use a single float value at run time (basically the product of the various set weightings) frequency tracking individual user's word-frequency tracking (optional). If it's tracked, may as well use it. individual user's typo-frequency tracking (optional). It might be useful to track the frequency of typos for a given user to generate the weightings (i.e. if a correction is reported, increment the diff (i -> o) frequency record as well as the whole-word correction's frequency record).

Anyway, rather than blathering on at you, suppose I'll do some more work now. Have fun,
Mike

Kevin Atkinson wrote:

[CC to Aspell-devel for a public record of our conversation, please continue to do so unless you have a good reason not to.]

I was browsing though Usenet groups on the search term "Aspell" as I do from time to time to see what out people are saying about Aspell and I came across your thread "Spell-check engine?" to comp.lang.python.

Although the LGPL gives you the right to reuse my code I would of appreciate a note to that effort. You could of saved yourself a decent deal of effort by contacting me first.

A few points I want to address:

The Aspell library should compile on Win32 using the MinGW version of Gcc which means that the CygWin library does not need to be pulled in. It can now also compile using VC++ but with a user contributed patch but that is completely unsupported by me.

Do not even think about using the *.rws files as it is a compiled
dictionary format internal to Aspell and can change at any time.  For
example the next Aspell release 0.51 will change the format of the
compiled words lists in a non trivial way.  However using the *.cwl is
rather easy.  All the *.cwl are just compressed word lists with the
word-list-compress utility distributed with Aspell.  The process is
extremely simple and can easy be written in any language.

When edit distance are computed each "edit" has a weight associated with it. When typo analysis is used the weights are significantly different from the normal edit distance algorithm. The basic algorithm is the same however.

If you have any other questions I will be happy to address them.
_______________________________________
 Mike C. Fletcher
 Designer, VR Plumber, Coder
 http://members.rogers.com/mcfletch/







reply via email to

[Prev in Thread] Current Thread [Next in Thread]