[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] Strip markup for spell checking?
From: |
Greg Chicares |
Subject: |
Re: [lmi] Strip markup for spell checking? |
Date: |
Thu, 29 Oct 2015 15:12:04 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.3.0 |
On 2015-10-28 17:19, Vadim Zeitlin wrote:
> On Wed, 28 Oct 2015 15:47:15 +0000 Greg Chicares <address@hidden> wrote:
>
> GC> I'm looking for a way to check spelling in XSL files, excluding markup.
[...]
> I don't know of any tools specifically for spell checking XSL, but I think
> any tool usable with XML should do and AFAIK many of them exist and aspell
> does seem to support this and so does hunspell.
Is one preferable to the other? Let's see...
https://wiki.ubuntu.com/ConsolidateSpellingLibs
| hunspell is the most modern implementation and considered the best choice
| in the free software world.
http://fedoraproject.org/wiki/Releases/FeatureDictionary
| Fix the proliferation of dictionaries in the OS.
...
| This is complete, all major applications and default GNOME/KDE spell
| checking now goes through hunspell.
https://lists.gnu.org/archive/html/aspell-announce/2011-09/msg00000.html
| For a long time I thought about ways to regain Aspell status as the
| standard system spell checker, but after giving it a lot of through I
| have decided that this goal that is no longer worth pursuing. ...
| I thought that ... I could ... convince Linux distributions to consider
| Aspell over Hunspell as the one true spell checker; however after many
| years, I finally decided that it wasn't worth it.
It seems clear that 'hunspell' wins, although maybe debian didn't get
that memo...
/home/greg[0]$uname --all
Linux turgon 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2+deb7u2 x86_64 GNU/Linux
/home/greg[0]$whence aspell
/usr/bin/aspell
/home/greg[0]$whence hunspell
/home/greg[1]$aptitude show hunspell |head -2
Package: hunspell
State: not installed
...but I can just install it:
# apt-get install hunspell
> But I'd just like to say how I do it, which is definitely very low
> technological but works well for me: I open the file in Vim and do "set
> spell", then use "]s" to go the next spelling error, correct it (usually by
> just pressing "1z=" to select the first suggested replacement), press "]s"
> again and so on.
>
> Unfortunately in this particular case, it doesn't work well out of the box
I seek an easy method for people who don't know vim.
> After doing this I could spell check the entire file and, in addition to
> the typo you found, only found one other one and that one in a comment:
[...]
> - <!-- The data to be diplayed in the pages, cover page first -->
> + <!-- The data to be displayed in the pages, cover page first -->
That's been there ever since the file was added to svn:
http://svn.savannah.nongnu.org/viewvc?view=rev&root=lmi&revision=696
I wonder how I missed it in my initial message in this thread. Oh...I
filtered with "sed -e'/<.*>/d'".
hunspell's '-H' also seems to remove <!-- comments -->, so I'll avoid
that. And hunspell has widely-reported problems with apostrophes, so
I'll filter them. Removing just a few words that I know are okay, I
come to this casual but useful command:
< /opt/lmi/src/lmi/nasd.xsl sed -e'/<[^!].*>/d' \
| hunspell -L | tr --delete "'" | hunspell | sed -e'/^&/!d' \
-e'/^& \(MEC\|Sep\|nbsp\|DOCTYPE\|stylesheet\|xsl\|[Cc]hicares\) /d'
& nasd 6 9: ands, sand, NASDAQ, NASA, nasty, nosed
& xA0 3 17: Alexa, Xmas, Xian
& diplayed 7 26: displayed, played, diplomaed, display, dismayed, employed,
swordplayer
& inital 9 61: initial, in ital, in-ital, genital, Vinita, Ritalin, Italian,
Intel, entail
It caught both "diplayed" and "inital". The other lines are okay.
> If you'd like to automate this to ensure that new typos don't get checked
> in, I think aspell is still the best solution.
Yes, that's the goal.
Let's try the command above with 'illustration_reg.xsl'. Manually
filtering the output, it identifies:
& diplayed 7 26: displayed, played, diplomaed, display, dismayed, employed,
swordplayer
& unaffilliated 7 16: unaffiliated, affiliated, affiliate, unaffectionate,
unillustrated, overinflated, unflappability
& guaranted 9 18: guaranteed, guarantied, guarantee, guarantor, guaranty,
quarantined, warranted, granddaddy, granddad
And in 'fo_common.xsl':
& acces 12 5: aces, access, acnes, acres, acmes, aches, accedes, accuses,
accepts, accents, accent, accede
& differencies 7 51: differences, difference's, difference, differentness,
differentiates, differential, interference
& paranteses 6 8: parentheses, separateness, separates, guarantees, printers,
Prentiss
& Simlpy 4 2: Simply, Simplify, Smallpox, Simla
& recursivly 7 35: recursively, recursive, cursively, recursion, recessively,
reflexivity, aggressively
& adjucent 7 44: adjacent, adjustment, adjutant, adjunct, antecedent,
adjacency, adjusted
[I would have suggested "adjuvant".]
& recursivly 7 24: recursively, recursive, cursively, recursion, recessively,
reflexivity, aggressively
& splitted 11 66: spitted, slitted, splatted, splinted, splitter, splittable,
splintered, splitting, splattered, exploited, splatter
& appox 5 96: approx, APO, pox, apex, Ampex
...though I imagine those are commentary in "common" macros, which
shouldn't affect the PDF.
This is already immediately useful. Refinements to the command I
cobbled together are welcome.