bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 'box' missing from web2?


From: pacman
Subject: Re: 'box' missing from web2?
Date: Wed, 5 Jan 2011 23:33:20 -0500 (GMT+5)

Karl Berry writes:
> 
>     I just downloaded the 1.4.2 files
> 
> I also do not see "box" in the miscfiles-1.4.2 "web2" word list.
> 
> Miscfile maintainers, can you add it?

I found "box" in my /usr/dict/words, which inspired me to do 2 things:

1. wonder why is there more than one free wordlist (mine came from
SCOWL - wordlist.sourceforge.net), since this seems like an area where
combining effort should be easy

2. look for more missing words

And there are lots of interesting ones. Of course you expect a lot of
differences in the obscure, borderline-English words but SCOWL is
organized into several tiers of words and even the lowest tier has 1191
words that are missing from web2, including:

  all box has hid kid adds ages aims asks

(Now I'm thinking, maybe web2 intentionally omits plurals and other suffixed
words. But doesn't that make it less useful than the alternative?)

skipping ahead...

  byte feet hang held

(even irregular plurals and past tenses are omitted? are you kidding?)

  near fewer nicer safer using women analog became lowest catalog

and my favorite

  software

I picked those examples from a list generated by
comm -13 <(tr A-Z a-z < miscfiles/web2 | sort -u) \
  <(cat wordlist/scowl/final/*.10 | fgrep -v \' | tr A-Z a-z | sort -u)

Try it yourself and see the whole list.

If you include the next tier, *.20, there are a lot more plurals and
past tenses (regular and irregular), and some British spellings, but
also these which are missing from web2 for no apparent reason:

  fond debug disco proud bitmap cookie trendy desktop goodbye numeric
  automate postcard bandwidth lifestyle girlfriend mainstream subroutine

Going the other way, 162 words in web2 are missing from even the biggest
wordlist in that other collection. Most of them are foreign words with
accent marks dropped, which I think is fine - this is an English word
list and a word that can't be spelled in our 26 glorious letters isn't
English.

But there are 26 words in web2 for which I can't find a corresponding
entry in the big wordlist, even with added accents:

  yez yday youl podesta slainte yoursel baptisin cementin colletin
  esterlin gardenin gingerin latherin letterin sisterin smutchin
  syringin aggressin automatin batikulin batikuling catharping
  fluorescin hemorrhagin vasodilatin youdendrift

Some of those are chemicals, another area where any wordlist has to draw
an arbitrary line to avoid expanding to infinite size. "yez" and "yday"
are baffling though. And if "yoursel" isn't a typo, I don't know what it
is.

The obvious thing to do deprecate web2 in favor of SCOWL (it only needs
"youdendrift" and a few others to become a superset)

-- 
Alan Curry



reply via email to

[Prev in Thread] Current Thread [Next in Thread]