emacs-elpa-diffs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[elpa] externals/doc-toc 7e2e6be947 69/84: Update/improve README


From: ELPA Syncer
Subject: [elpa] externals/doc-toc 7e2e6be947 69/84: Update/improve README
Date: Mon, 26 Sep 2022 13:58:39 -0400 (EDT)

branch: externals/doc-toc
commit 7e2e6be947b4da96cb12c1db833cf6e076ae328d
Author: Daniel Nicolai <dalanicolai@gmail.com>
Commit: Daniel Nicolai <dalanicolai@gmail.com>

    Update/improve README
---
 README.org  | 39 +++++++++++++++++++++++++++++----------
 toc-mode.el | 58 ++++++++++++++++++++++++++++++++--------------------------
 2 files changed, 61 insertions(+), 36 deletions(-)

diff --git a/README.org b/README.org
index f0932cc197..141ac1aabd 100644
--- a/README.org
+++ b/README.org
@@ -61,14 +61,19 @@ or with two dashes in the mode name (e.g. =M-x 
toc--cleanup=). Of course if you
 use packages like Ivy or Helm you just use the fuzzy search functionality.
 
 ** 1. Extraction
-Open some pdf or djvu file in Emacs (pdf-tools and djvu package recommended).
-Find the pagenumbers for the TOC. Then type =M-x toc-extract-pages=, or =M-x
-toc-extract-pages-ocr= if doc has no text layer or text layer is bad, and 
answer
-the subsequent prompts by entering the pagenumbers for the first and the last
-page each followed by =RET=. *For PDF extraction with OCR, currently it is 
required*
-*to view all contents pages once before extraction* (toc-mode uses the cached 
file
-data). Also the languages used for tesseract OCR can be customized via the
-`toc-ocr-languages' variable.
+For PDFs without TOC pages, with a very complicated TOC (i.e. that
+require much cleanup work) or with headlines well fitted for automatic
+extraction (you will have to decide for yourself by trying it), consider to use
+the [[https://krasjet.com/voice/pdf.tocgen/][pdf.tocgen]] functionality 
described below.
+
+Otherwise, start with opening some pdf or djvu file in Emacs (pdf-tools and 
djvu
+package recommended). Find the pagenumbers for the TOC. Then type =M-x
+toc-extract-pages=, or =M-x toc-extract-pages-ocr= if doc has no text layer or 
text
+layer is bad, and answer the subsequent prompts by entering the pagenumbers for
+the first and the last page each followed by =RET=. *For PDF extraction with 
OCR,
+currently it is required* *to view all contents pages once before extraction*
+(toc-mode uses the cached file data). Also the languages used for tesseract OCR
+can be customized via the ~toc-ocr-languages~ variable.
 
 [[toc-mode-extract.gif]]
 
@@ -80,6 +85,20 @@ to extract the text with the 
[[https://pypi.org/project/document-contents-extrac
 more configurable (you are also welcome to hack on and improve that script). 
For
 this the 
[[https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html][tesseract]] 
documentation might be useful.
 
+*** Software-generated PDF's with pdf.tocgen ( 
[[https://krasjet.com/voice/pdf.tocgen/]])
+For 'software-generated' (i.e. PDF's not created from scans) PDF-files it is
+sometimes easier to use ~toc-extract-with-pdf-tocgen~. To use this function
+you first have to provide the font properties for the different headline
+levels. For that select the word in a headline of a certain level and then
+type M-x ~toc-gen-set-level~. This function will ask which level you are
+setting, the highest level should be level 1. After you have set the various
+levels (1,2, etc.) then it is time to run M-x ~toc-extract-with-pdf-tocgen~.
+If a TOC is extracted succesfully, then in the pdftocgen-mode buffer simply
+press C-c C-c to add the contents to the PDF. The contents will be added to a
+copy of the original PDF with the filename output.pdf and this copy will be
+opened in a new buffer. If the pdf-tocgen option does not work well then
+continue with the steps below.
+
 If you merely want to extract text without further processing then you can
 use the command [[help:toc-extract-only][toc-extract-only]].
 
@@ -181,8 +200,8 @@ toc-mode (tablist)
 
 
 * Alternatives
-For TOC extraction: 
[[https://pypi.org/project/document-contents-extractor/][documents-contents-extractor]]
-For adding TOC to document (pdf and djvu): 
[[http://handyoutlinerfo.sourceforge.net/][HandyOutliner]]
+- For TOC extraction: 
[[https://pypi.org/project/document-contents-extractor/][documents-contents-extractor]]
+- For adding TOC to document (pdf and djvu): 
[[http://handyoutlinerfo.sourceforge.net/][HandyOutliner]]
 
 *** Donate
 
diff --git a/toc-mode.el b/toc-mode.el
index b3c45968e9..08adc9b4f7 100644
--- a/toc-mode.el
+++ b/toc-mode.el
@@ -44,18 +44,6 @@
 
 ;; Usage:
 
-;; For 'software-generated' (i.e. PDF's not created from scans) PDF-files it is
-;; recommend to use `toc-extract-with-pdf-tocgen'. To use this function you
-;; first have to provide the font properties for the different headline levels.
-;; For that select the word in a headline of a certain level and then type M-x
-;; `toc-gen-set-level'. This function will ask which level you are setting, the
-;; highest level should be level 1. After you have set the various levels (1,2,
-;; etc.) then it is time to run M-x `toc-extract-with-pdf-tocgen'. If a TOC is
-;; extracted succesfully, then in the pdftocgen-mode buffer simply press C-c 
C-c
-;; to add the contents to the PDF. The contents will be added to a copy of the
-;; original PDF with the filename output.pdf and this copy will be opened in a
-;; new buffer. If the pdf-tocgen option does not work well then continue with
-;; the steps below.
 
 ;; In each step below, check out available shortcuts using C-h m. Additionally
 ;; you can find available functions by typing the M-x mode-name (e.g. M-x
@@ -69,20 +57,24 @@
 ;; 3 adjust/correct pagenumbers
 ;; 4 add TOC to document
 
-;; 1. Extraction Open some pdf or djvu file in Emacs (pdf-tools and djvu 
package
-;; recommended). Find the pagenumbers for the TOC. Then type M-x
-;; `toc-extract-pages', or M-x `toc-extract-pages-ocr' if doc has no text layer
-;; or text layer is bad, and answer the subsequent prompts by entering the
-;; pagenumbers for the first and the last page each followed by RET. For PDF
-;; extraction with OCR, currently it is required to view all contents pages 
once
-;; before extraction (toc-mode uses the cached file data). Also the languages
-;; used for tesseract OCR can be customized via the `toc-ocr-languages'
-;; variable. A buffer with the, somewhat cleaned up, extracted text will open 
in
-;; TOC-cleanup mode. Prefix command with the universal argument (C-u) to omit
-;; clean and get the raw text. If the extracted text is of too low quality you
-;; either can hack/extend the `toc-extract-pages-ocr' definition, or
-;; alternatively you can try to extract the text with the python
-;; document-contents-extractor script (see URL
+;; 1. Extraction For PDFs without TOC pages, with a very complicated TOC (i.e.
+;; that require much cleanup work) or with headlines well fitted for automatic
+;; extraction (you will have to decide for yourself by trying it) consider to
+;; use the pdf.tocgen (URL `https://krasjet.com/voice/pdf.tocgen/')
+;; functionality described below. Otherwise, start with opening some pdf or 
djvu
+;; file in Emacs (pdf-tools and djvu package recommended). Find the pagenumbers
+;; for the TOC. Then type M-x `toc-extract-pages', or M-x
+;; `toc-extract-pages-ocr' if doc has no text layer or text layer is bad, and
+;; answer the subsequent prompts by entering the pagenumbers for the first and
+;; the last page each followed by RET. For PDF extraction with OCR, currently 
it
+;; is required to view all contents pages once before extraction (toc-mode uses
+;; the cached file data). Also the languages used for tesseract OCR can be
+;; customized via the `toc-ocr-languages' variable. A buffer with the, somewhat
+;; cleaned up, extracted text will open in TOC-cleanup mode. Prefix command 
with
+;; the universal argument (C-u) to omit clean and get the raw text. If the
+;; extracted text is of too low quality you either can hack/extend the
+;; `toc-extract-pages-ocr' definition, or alternatively you can try to extract
+;; the text with the python document-contents-extractor script (see URL
 ;; `https://pypi.org/project/document-contents-extractor/'), which is more
 ;; configurable (you are also welcome to hack and improve that script).
 
@@ -90,6 +82,20 @@
 ;; `https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html' might be
 ;; useful.
 
+;; Software-generated PDF's with pdf.tocgen
+;; For 'software-generated' (i.e. PDF's not created from scans) PDF-files it is
+;; sometimes easier to use `toc-extract-with-pdf-tocgen'. To use this function
+;; you first have to provide the font properties for the different headline
+;; levels. For that select the word in a headline of a certain level and then
+;; type M-x `toc-gen-set-level'. This function will ask which level you are
+;; setting, the highest level should be level 1. After you have set the various
+;; levels (1,2, etc.) then it is time to run M-x `toc-extract-with-pdf-tocgen'.
+;; If a TOC is extracted succesfully, then in the pdftocgen-mode buffer simply
+;; press C-c C-c to add the contents to the PDF. The contents will be added to 
a
+;; copy of the original PDF with the filename output.pdf and this copy will be
+;; opened in a new buffer. If the pdf-tocgen option does not work well then
+;; continue with the steps below.
+
 ;; If you merely want to extract text without further processing then you can
 ;; use the command `toc-extract-only'.
 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]