[Bug-ocrad] Possibility of dictionary-enhanced OCR in Ocrad

bug-ocrad

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-ocrad] Possibility of dictionary-enhanced OCR in Ocrad

From:	Martin C. Doege
Subject:	[Bug-ocrad] Possibility of dictionary-enhanced OCR in Ocrad
Date:	Tue, 5 Jul 2005 21:32:07 +0200

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Antonio!

First of all, thanks for taking the time to develop Ocrad! Iparticularly like how incredibly fast it is in comparison to, say,GOCR. And the recognition rate is not too shabby for a character-basedOCR program.

That being said, I do wonder if it would be possible to extend Ocradwith dictionary lookups to improve its recognition rate. So forexample, it would work on a word and assign each character a confidencevalue. So it might recognize the word "heflo", and since that word isnot in the dictionary, it would tweak the characters with the lowestconfidence value first.

So in this example, maybe it would happen to have a low confidencevalue for the "h" and the "l" and therefore try to find permutations inthe dictionary: "beflo", "bello", ""hello",... If this is notsuccessful it could try replacements like "m" -> "ch", "l_" -> "n","ii" -> "ü", or whatever other common errors are found in OCR output.Of course the actual dictionary lookups could be handled by an externalprogram like aspell.

Given that Ocrad is so insanely fast, I think this kind of (optional,of course) overhead could be worthwhile. I have been working on alarger project with Ocrad for a few days, and while I am pretty contentwith Ocrad's recognition rate, I wish there was an easy way to identifyand correct the kinds of typical OCR errors which standard spellcheckers do not know how to handle.

Of course much of this could perhaps be done with a filter on theOcrad-generated text file after the fact, like with a modified aspell(http://lists.gnu.org/archive/html/aspell-user/2002-07/msg00003.html).But of course doing some of this in Ocrad itself might be beneficialbecause the internal knowledge of the program about the charactersbeing worked on could be used. And in terms of programming work, thiswould probably be cheaper than trying to improve the OCR engineitself...


Any thoughts on this?

Martin
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (Darwin)

iD8DBQFCyuA4mifxvst1lQIRAgBhAKDJwj2WL6UMkslCSjLDrvNZMQA7GQCdFpYe
6bxhFoCBb7Xw980EIF1t/lM=
=t67Q
-----END PGP SIGNATURE-----

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-ocrad] Possibility of dictionary-enhanced OCR in Ocrad, Martin C. Doege <=
- [Bug-ocrad] Re: Possibility of dictionary-enhanced OCR in Ocrad, Antonio Diaz Diaz, 2005/07/09

Next by Date: [Bug-ocrad] Re: Possibility of dictionary-enhanced OCR in Ocrad
Next by thread: [Bug-ocrad] Re: Possibility of dictionary-enhanced OCR in Ocrad
Index(es):
- Date
- Thread