aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [aspell-devel] Problems with aspell-en license


From: Kevin Atkinson
Subject: Re: [aspell-devel] Problems with aspell-en license
Date: Mon, 21 Oct 2002 00:55:26 -0400 (EDT)

Here is the README as found in the dec wordlist.  It lists all the word
list used.  Since SCOWL is a compilation of several word lists, one of
them which in DEC which is also a compilation of several word lists.  I
really do not think there is a problem.  Furthermore the copyright of word
lists is questionable in and of itself.  Also, many word lists is SCOWL
are taken from listing the words in well known copyrighted dictionaries.  
So a case could be made that many of the other word lists SCOWL uses are 
also of questionable copyright and should be removed which will lead me 
with nothing.

I will remove the DEC word list from my source only if Debian will refuse 
to include the English word list due to questionable copyright on some of 
the sources that DEC uses.  But If I do I will make a note on the reason 
why it is removed which will include a statement by me which more or less 
states that I think debian-legal is being completely anneal about the 
matter.

Have a good day.

>>BEGIN DEC README

FILE: english.words
VERSION: DEC-SRC-92-04-05

EDITOR

    Jorge Stolfi <address@hidden>
    DEC Systems Research Center
  
AUTHORS OF ORIGIONAL WORDLISTS

    Andy Tanenbaum <address@hidden>
    Barry Brachman <address@hidden>
    Geoff Kuenning <address@hidden>
    Henk Smit <address@hidden>
    Walt Buehring <address@hidden>

DESCRIPTION

    The file english.words is a list  of over 104,000
    English words compiled from several public domain wordlists.  

    The file has one word per line, and is sorted with sort(1)
    in plain ASCII collating sequence.

    The file is supposed to include all verb forms ("-s", "-ed",
    "-ing"), noun plurals and possesives, and forms derived by various
    prefixes and suffixes ("un-", "re-", "-ly", "-er", "-ation", etc.)
    However, the list is still highly incomplete and inconsistent: not
    all stems have all forms, and some forms (notably possesive
    plural) are missing altogether.

    The file is NOT supposed to contain any "proper" names, such as
    the names of ordinary persons, corporations and organizations;
    nations, countries and other geographical names; mythological
    figures; biological genera; and trademarked products.  It is also
    not supposed to contain abbreviations, measurement symbols, and
    acronyms. (Some of these are available in separate files; see
    below).

    The pronoun "I" and its contractions ("I'm", "I've") are
    capitalized as usual; the other words are all in lowercase.
    Besides the letters [a-zA-Z], the file uses only hyphen
    apostrophe, and newline.

AUXILIARY LISTS

    In the same directory as englis.words there are a few
    complementary word lists, all derived from the same sources [1--8]
    as the main list:

    english.names

        A list of common English proper names and their derivatives.
        The list includes: person names ("John", "Abigail",
        "Barrymore"); countries, nations, and cities ("Germany",
        "Gypsies", "Moscow"); historical, biblical and mythological
        figures ("Columbus", "Isaiah", "Ulysses"); important
        trademarked products ("Xerox", "Teflon"); biological genera
        ("Aerobacter"); and some of their derivatives ("Germans",
        "Xeroxed", "Newtonian").
    
    misc.names

        A list of foreign-sounding names of persons and places
        ("Antonio", "Albuquerque", "Balzac", "Stravinski"), extracted
        from the lists [1--8].  (The distinction betweeen
        "English-sounding" and "foreign-sounding" is of course rather
        arbitrary).

    org.names

        A short lists names of corporations and other institutions
        ("Pepsico", "Amtrak", "Medicare"), and a few derivatives.  

        The file also includes some initialisms --- acronyms and
        abbreviations that are generally pronounced as words rather
        than spelled out ("NASA", "UNESCO").

    english.abbrs

        A list of common abbreviations ("etc.", "Dr.", "Wed."),
        acronyms ("A&M", "CPU", "IEEE"), and measurement symbols
        ("ft", "cm", "ns", "kHz").

    english.trash
                
        A list of words from the original wordlists
        that I decided were either wrong or unsuitable for inclusion
        in the file english.words or any of the other auxiliary 
        lists. It includes
        
          typos ("accupy", "aquariia", "automatontons")
          spelling errors ("abcissa", "alleviater", "analagous")
          bogus derived forms ("homeown", "unfavorablies", "catched")
          uncapitalized proper names ("afghanistan", "algol", "decnet")
          uncapitalized acronyms ("apl", "ccw", "ibm")
          unpunctuated abbreviations ("amp", "approx", "etc")
          British spellings ("advertize", "archaeology")
          archaic words ("bedight")
          rare variants ("babirousa")
          unassimilated foreign words ("bambino", "oui", "caballero")
          mis-hyphenated compounds ("babylike", "backarrows")
          computer keywords and slang ("lconvert", "noecho", "prog"), 

        (I apologize for excluding British spellings.  I should have
        split the list in three sublists--- common English, British,
        American---as ispell does.  But there are only so many hours
        in a day...)

    english.maybe

        A list of about 5,000 lowercase words from the "mts.dict"
        wordlist [6] that weren't included in english.words.

        This list seems to include lots of "trash", like uncapitalized
        proper names and weird words.  It would take me several days
        to sort this mess, so I decided to leave it as a separate
        file.  Use at your own risk...
        
ORIGINAL LISTS 

    The original wordlists from which those files were compiled are
    listed below.  They were obtained by anonymous FTP on 92-Feb-10.

    [1] file: ispell/ispell/english.lrg
        size: 690778 bytes
        contact: Walt Buehring <address@hidden>
        from: phloem.uoregon.edu: /pub/src/ispell.3.0.tar.Z

          * The (unexpanded) "large" english wordlist for ispell 3.0.

    [2] file: ispell/ispell/english.sml+
        size: 575226 bytes
        contact: Walt Buehring <address@hidden>
        from: phloem.uoregon.edu: /pub/src/ispell.3.0.tar.Z

          * The (expanded) "small" english wordlist for ispell 3.0.

    [3] file: words.english.Z
        size: 217119 bytes (479261 bytes uncompressed)
        contact: Henk Smit <address@hidden>
        from: donau.et.tudelft.nl: /pub/words/

          * From the README file on ftp.cs.vu.nl:

                This list is made out of 2 lists,
                  the normal /usr/dict/words on most Unix systems,
                  TeX english wordlist (available at archive.cs.ruu.nl)

    [4] file: dict.2
        size:   274848 bytes
        contact: H Morrow Long <address@hidden>
        from: bulldog.cs.yale.edu: /pub/dict.shar

          * According to H. Morrow, it came with some version
            of the "ispell" package.

    [5] file: minix.dict
        size: 357226 bytes
        author: Andy Tanenbaum <address@hidden>
        from: cs.ubc.ca: /pub/wordlists-1.0.tar.Z

          * From the README file:

            Article 1997 of comp.os.minix:
            From: address@hidden
            Subject: A spelling checker for MINIX
            Date: 6 Jan 88 22:28:22 GMT
            Reply-To: address@hidden (Andy Tanenbaum)
            Organization: VU Informatica, Amsterdam

            This dictionary is NOT based on the UNIX dictionary so it
            is free of AT&T copyright.

            I built the dictionary from three sources.  First, I
            started by sorting and uniq'ing some public domain
            dictionaries.  Second, as some of you probably know, I
            have written somewhere between 3 and 6 books (depending on
            precisely what you count) and an additional 50 published
            papers on operating systems, networks, compilers,
            languages, etc.  This data base, which is online, is
            nonnegligible :-) Finally, I added a number of words that
            I thought ought to be in the dictionary including all the
            U.S. states, all the European and some other major
            countries, principal U.S. and world cities, and a bunch of
            technical terms.  I don't want my spelling checker to barf
            on arpanet, diskless, modem, login, internetwork,
            subdirectory, superuser, vlsi, or winchester just because
            Webster wouldn't approve of them.

            All in all, the dictionary is over 40,000 words.  If you
            have any suggestions for additions or deletions, please
            post them.  But please be sure you are not infringing on
            anyone's copyright in doing so.

              Andy Tanenbaum (address@hidden)

    [6] file: mts.dict
        size: 346983 bytes
        contact: Barry Brachman <address@hidden>
        from: cs.ubc.ca: /pub/wordlists-1.0.tar.Z

          * From the README file:

            These word lists were collected by Barry Brachman
            <address@hidden> at the University of British
            Columbia.  They may be freely distributed as long as this
            notice accompanies them.

            mts.dict contains only words that are not in
            /usr/dict/words.  [But note that your version of
            /usr/dict/words may be different from mine!  Use "sort -u"
            to get a list of unique words. ]

              From wc:

              24259   24259  198596 /usr/dict/words
              35475   35475  346992 mts.dict
              -----   ----- -------
              59734   59734  545588 total


    [7] file: words.english.Z
        size: 288385 bytes (644217 bytes uncompressed)
        from: ftp.hawaii.edu: /pub/editors/LEXICAL/word-lists/
        author: unknown.

    COMMENTS: The "large" list from ispell 3.0 [1] is the most
    complete, and contains almost all the words of the "small" ispell
    list [2], of Andy Tannenbaum's list minix.dict [5], and of the
    lists from Delft and Yale [3, 4], as well as /usr/dict/words. It
    leaves out some 500--1000 words from each of these lists.

    On the other hand, the file mts.dict from UBC [6] contains some 7000
    words that are not in the ispell list [1].  Therefore, mts.dict
    seems to be largely orthogonal to the list [1--5].

    The file words.english from Hawaii [7] seems to be the union of
    mts.dict [6], Andy's file minix.dict [5], and /usr/dict/words,
    except that it omits some 250 words from the latter.

COMPILATION PROCESS

    The file english.words is a slightly cleaned-up version of
    the "large" english wordlist [1] that comes with the ispell
    3.0 package, which is available from phloem.uoregon.edu.  

    First, I expanded the prefixes and suffixes using "isexpand" and
    some Gnuemacs hacking, and removed all words with capitals or
    periods.  Then I compared the result with other publicly available
    wordlists [2--7], and did a little bit of manual cleanup.  That
    meant removing some 8500 words that were obviously wrong or
    inappropriate, and adding about 4800 new words.  Those 8500
    words were largely distributed among the other lists.

    The table below gives the number of lowercase words in each
    original list ("lcase"), and how many of such words were included
    ("accept") and not included ("reject") in the final file
    english.words:

      ref  site: file                lcase  accept  reject
      ---  ----------------------  -------  ------  ------
      [1]  uoregon: english.lrg     103124  102000    1124
      [2]  uoregon: english.sml+     56694   56223     471
      [3]  tudelft: words.english    48150   47305     845
      [4]  yale: dict.2              47355   46577     778
      [5]  ubc: minix.dict           38699   38394     305
      [6]  ubc: mts.dict             35215   28874    6341
      [7]  hawaii: words.english     65165   57558    7607

(NON-)COPYRIGHT STATUS

  To the best of my knowledge, all the files I used to build these
  wordlists were available for public distribution and use, at least
  for non-commercial purposes.  I have confirmed this assumption with
  the authors of the lists, whenever they were known.
  
  Therefore, it is safe to assume that the wordlists in this package
  can also be freely copied, distributed, modified, and used for
  personal, educational, and research purposes.  (Use of these files in
  commercial products may require written permission from DEC and/or
  the authors of the original lists.)
  
  Whenever you distribute any of these wordlists, please distribute
  also the accompanying README file.  If you distribute a modified
  copy of one of these wordlists, please include the original README
  file with a note explaining your modifications.  Your users will
  surely appreciate that.

(NO-)WARRANTY DISCLAIMER

  These files, like the original wordlists on which they are based,
  are still very incomplete, uneven, and inconsitent, and probably
  contain many errors.  They are offered "as is" without any warranty
  of correctness or fitness for any particular purpose.  Neither I nor
  my employer can be held responsible for any losses or damages that
  may result from their use.


-- 
http://kevin.atkinson.dhs.org





reply via email to

[Prev in Thread] Current Thread [Next in Thread]