[elpa] externals/doc-toc ae455b4863 52/84: Implement language customizat

emacs-elpa-diffs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[elpa] externals/doc-toc ae455b4863 52/84: Implement language customizat

From:	ELPA Syncer
Subject:	[elpa] externals/doc-toc ae455b4863 52/84: Implement language customization for OCR
Date:	Mon, 26 Sep 2022 13:58:38 -0400 (EDT)

branch: externals/doc-toc
commit ae455b486327205e5e6df786d42af188bddd8fa5
Author: Daniel Nicolai <dalanicolai@gmail.com>
Commit: Daniel Nicolai <dalanicolai@gmail.com>

    Implement language customization for OCR
---
 README.org  |  3 ++-
 toc-mode.el | 37 ++++++++++++++++++++++++++++++-------
 2 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/README.org b/README.org
index 8e268b752e..86a38c2d93 100644
--- a/README.org
+++ b/README.org
@@ -45,7 +45,8 @@ toc-extract-pages-ocr= if doc has no text layer or text layer 
is bad, and answer
 the subsequent prompts by entering the pagenumbers for the first and the last
 page each followed by =RET=. *For PDF extraction with OCR, currently it is 
required*
 *to view all contents pages once before extraction* (toc-mode uses the cached 
file
-data).
+data). Also the languages used for tesseract OCR can be customized via the
+`toc-ocr-languages' variable.
 
 [[toc-mode-extract.gif]]
 
diff --git a/toc-mode.el b/toc-mode.el
index 3ce807deb5..d2dbfa279a 100644
--- a/toc-mode.el
+++ b/toc-mode.el
@@ -48,11 +48,12 @@
 ;; text layer is bad, and answer the subsequent prompts by entering the
 ;; pagenumbers for the first and the last page each followed by RET. For PDF
 ;; extraction with OCR, currently it is required to view all contents pages 
once
-;; before extraction (toc-mode uses the cached file data). A buffer with the,
-;; somewhat cleaned up, extracted text will open in TOC-cleanup mode. Prefix
-;; command with the universal argument (C-u) to omit clean and get the raw 
text.
-;; 2. TOC-Cleanup In this mode you can further cleanup the contents to create a
-;; list where each line has the structure:
+;; before extraction (toc-mode uses the cached file data). Also the languages
+;; used for tesseract OCR can be customized via the `toc-ocr-languages'
+;; variable. A buffer with the, somewhat cleaned up, extracted text will open 
in
+;; TOC-cleanup mode. Prefix command with the universal argument (C-u) to omit
+;; clean and get the raw text. 2. TOC-Cleanup In this mode you can further
+;; cleanup the contents to create a list where each line has the structure:
 
 ;; TITLE (SOME) PAGENUMBER
 
@@ -167,6 +168,14 @@ For DJVU the old DJVU file is replaced by default"
   :type 'file
   :group 'toc)
 
+(defcustom toc-ocr-languages nil
+  "Languages used for extraction with ocr.
+Should be one or multiple language codes as recognized
+by tesseract -l flag, e.g. eng or eng+nld. Use
+\\[execute-extended-command] `toc-list-languages' to list the
+available languages."
+  :type 'string
+  :group 'toc)
 ;;;; toc-extract and cleanup
 
 ;;; toc-cleanup
@@ -311,6 +320,15 @@ text."
             (toc--cleanup startpage)))
       (message "Buffer not in pdf-view- or djvu-read-mode"))))
 
+(defun toc-list-languages ()
+  "List languages available for ocr.
+For use in `toc-ocr-languages'."
+  (interactive)
+  (let ((print-length nil))
+    (message (format "%s" (seq-subseq
+                         (split-string
+                          (shell-command-to-string "tesseract --list-langs"))
+                         5)))))
 
 ;;;###autoload
 (defun toc-extract-pages-ocr (arg)
@@ -327,7 +345,10 @@ unprocessed text."
                          (read-string "Enter end-pagenumber for extraction: 
")))
                (source-buffer (current-buffer))
                (ext (url-file-extension (buffer-file-name (current-buffer))))
-               (buffer (file-name-sans-extension (buffer-name))))
+               (buffer (file-name-sans-extension (buffer-name)))
+               (args (list "stdout" "--psm" "6")))
+          (when toc-ocr-languages
+            (setq args (append args (list "-l" toc-ocr-languages))))
           (while (<= page (+ endpage))
             (let ((file (cond ((string= ".pdf" ext)
                                (make-temp-file "pageimage"
@@ -340,7 +361,9 @@ unprocessed text."
                                                nil
                                                (number-to-string page)
                                                (image-property djvu-doc-image 
:data))))))
-              (call-process "tesseract" nil (list buffer nil) nil file 
"stdout" "--psm" "6")
+              (apply 'call-process
+                     (append (list "tesseract" nil (list buffer nil) nil file)
+                             args))
               (setq page (1+ page))))
           (switch-to-buffer buffer)
           (toc-cleanup-mode) ;; required before setting local variable

[Prev in Thread]

Current Thread

[Next in Thread]

[elpa] externals/doc-toc 74c68f0cda 24/84: Update README.org, (continued)
- [elpa] externals/doc-toc 74c68f0cda 24/84: Update README.org, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc f8fbca0c41 23/84: README add features and keybindings, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc f643745b06 32/84: Set windows encoding for djvu, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc c521029525 36/84: Change (beginning-og-buffer) to (goto-char (point-min)) for MELPA, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc a1d26eceb5 37/84: Add defvar and declare-function to avoid warning MELPA, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 05cd1287f1 42/84: Fix toc--add-to-djvu. Don't ask save location., ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc f430243a88 43/84: Add version: 0 header, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 59f4471e6a 50/84: Update README.org, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 2d95c466a3 48/84: Add MELPA and GPL3 badges, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 734043bdc7 47/84: Improve documentation in toc-mode.el, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc ae455b4863 52/84: Implement language customization for OCR, ELPA Syncer <=
- [elpa] externals/doc-toc 23e1fb2fde 54/84: Implement HandyOutliner option, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc b45b78102c 55/84: Update README, add extract-only documentation, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc b1a843fd6f 57/84: Implement roman-to-arabic and add pdf djvu keybindings, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 719f6a03a7 64/84: Return page text when pdfxmeta fails, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc dd1dfd83ac 63/84: Fix docstrings and warnings for MELPA, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 7e2e6be947 69/84: Update/improve README, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 782d0cd6b5 80/84: Update README.org, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 977bec00d8 74/84: Tiny bug fix in toc--tablist-to-handyoutliner, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 448a0ac00c 82/84: Small fixes before release on ELPA (fix compiler warnings), ELPA Syncer, 2022/09/26

Prev by Date: [elpa] externals/doc-toc 734043bdc7 47/84: Improve documentation in toc-mode.el
Next by Date: [elpa] externals/doc-toc 23e1fb2fde 54/84: Implement HandyOutliner option
Previous by thread: [elpa] externals/doc-toc 734043bdc7 47/84: Improve documentation in toc-mode.el
Next by thread: [elpa] externals/doc-toc 23e1fb2fde 54/84: Implement HandyOutliner option
Index(es):
- Date
- Thread