[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[elpa] externals/doc-toc 7e2e6be947 69/84: Update/improve README
From: |
ELPA Syncer |
Subject: |
[elpa] externals/doc-toc 7e2e6be947 69/84: Update/improve README |
Date: |
Mon, 26 Sep 2022 13:58:39 -0400 (EDT) |
branch: externals/doc-toc
commit 7e2e6be947b4da96cb12c1db833cf6e076ae328d
Author: Daniel Nicolai <dalanicolai@gmail.com>
Commit: Daniel Nicolai <dalanicolai@gmail.com>
Update/improve README
---
README.org | 39 +++++++++++++++++++++++++++++----------
toc-mode.el | 58 ++++++++++++++++++++++++++++++++--------------------------
2 files changed, 61 insertions(+), 36 deletions(-)
diff --git a/README.org b/README.org
index f0932cc197..141ac1aabd 100644
--- a/README.org
+++ b/README.org
@@ -61,14 +61,19 @@ or with two dashes in the mode name (e.g. =M-x
toc--cleanup=). Of course if you
use packages like Ivy or Helm you just use the fuzzy search functionality.
** 1. Extraction
-Open some pdf or djvu file in Emacs (pdf-tools and djvu package recommended).
-Find the pagenumbers for the TOC. Then type =M-x toc-extract-pages=, or =M-x
-toc-extract-pages-ocr= if doc has no text layer or text layer is bad, and
answer
-the subsequent prompts by entering the pagenumbers for the first and the last
-page each followed by =RET=. *For PDF extraction with OCR, currently it is
required*
-*to view all contents pages once before extraction* (toc-mode uses the cached
file
-data). Also the languages used for tesseract OCR can be customized via the
-`toc-ocr-languages' variable.
+For PDFs without TOC pages, with a very complicated TOC (i.e. that
+require much cleanup work) or with headlines well fitted for automatic
+extraction (you will have to decide for yourself by trying it), consider to use
+the [[https://krasjet.com/voice/pdf.tocgen/][pdf.tocgen]] functionality
described below.
+
+Otherwise, start with opening some pdf or djvu file in Emacs (pdf-tools and
djvu
+package recommended). Find the pagenumbers for the TOC. Then type =M-x
+toc-extract-pages=, or =M-x toc-extract-pages-ocr= if doc has no text layer or
text
+layer is bad, and answer the subsequent prompts by entering the pagenumbers for
+the first and the last page each followed by =RET=. *For PDF extraction with
OCR,
+currently it is required* *to view all contents pages once before extraction*
+(toc-mode uses the cached file data). Also the languages used for tesseract OCR
+can be customized via the ~toc-ocr-languages~ variable.
[[toc-mode-extract.gif]]
@@ -80,6 +85,20 @@ to extract the text with the
[[https://pypi.org/project/document-contents-extrac
more configurable (you are also welcome to hack on and improve that script).
For
this the
[[https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html][tesseract]]
documentation might be useful.
+*** Software-generated PDF's with pdf.tocgen (
[[https://krasjet.com/voice/pdf.tocgen/]])
+For 'software-generated' (i.e. PDF's not created from scans) PDF-files it is
+sometimes easier to use ~toc-extract-with-pdf-tocgen~. To use this function
+you first have to provide the font properties for the different headline
+levels. For that select the word in a headline of a certain level and then
+type M-x ~toc-gen-set-level~. This function will ask which level you are
+setting, the highest level should be level 1. After you have set the various
+levels (1,2, etc.) then it is time to run M-x ~toc-extract-with-pdf-tocgen~.
+If a TOC is extracted succesfully, then in the pdftocgen-mode buffer simply
+press C-c C-c to add the contents to the PDF. The contents will be added to a
+copy of the original PDF with the filename output.pdf and this copy will be
+opened in a new buffer. If the pdf-tocgen option does not work well then
+continue with the steps below.
+
If you merely want to extract text without further processing then you can
use the command [[help:toc-extract-only][toc-extract-only]].
@@ -181,8 +200,8 @@ toc-mode (tablist)
* Alternatives
-For TOC extraction:
[[https://pypi.org/project/document-contents-extractor/][documents-contents-extractor]]
-For adding TOC to document (pdf and djvu):
[[http://handyoutlinerfo.sourceforge.net/][HandyOutliner]]
+- For TOC extraction:
[[https://pypi.org/project/document-contents-extractor/][documents-contents-extractor]]
+- For adding TOC to document (pdf and djvu):
[[http://handyoutlinerfo.sourceforge.net/][HandyOutliner]]
*** Donate
diff --git a/toc-mode.el b/toc-mode.el
index b3c45968e9..08adc9b4f7 100644
--- a/toc-mode.el
+++ b/toc-mode.el
@@ -44,18 +44,6 @@
;; Usage:
-;; For 'software-generated' (i.e. PDF's not created from scans) PDF-files it is
-;; recommend to use `toc-extract-with-pdf-tocgen'. To use this function you
-;; first have to provide the font properties for the different headline levels.
-;; For that select the word in a headline of a certain level and then type M-x
-;; `toc-gen-set-level'. This function will ask which level you are setting, the
-;; highest level should be level 1. After you have set the various levels (1,2,
-;; etc.) then it is time to run M-x `toc-extract-with-pdf-tocgen'. If a TOC is
-;; extracted succesfully, then in the pdftocgen-mode buffer simply press C-c
C-c
-;; to add the contents to the PDF. The contents will be added to a copy of the
-;; original PDF with the filename output.pdf and this copy will be opened in a
-;; new buffer. If the pdf-tocgen option does not work well then continue with
-;; the steps below.
;; In each step below, check out available shortcuts using C-h m. Additionally
;; you can find available functions by typing the M-x mode-name (e.g. M-x
@@ -69,20 +57,24 @@
;; 3 adjust/correct pagenumbers
;; 4 add TOC to document
-;; 1. Extraction Open some pdf or djvu file in Emacs (pdf-tools and djvu
package
-;; recommended). Find the pagenumbers for the TOC. Then type M-x
-;; `toc-extract-pages', or M-x `toc-extract-pages-ocr' if doc has no text layer
-;; or text layer is bad, and answer the subsequent prompts by entering the
-;; pagenumbers for the first and the last page each followed by RET. For PDF
-;; extraction with OCR, currently it is required to view all contents pages
once
-;; before extraction (toc-mode uses the cached file data). Also the languages
-;; used for tesseract OCR can be customized via the `toc-ocr-languages'
-;; variable. A buffer with the, somewhat cleaned up, extracted text will open
in
-;; TOC-cleanup mode. Prefix command with the universal argument (C-u) to omit
-;; clean and get the raw text. If the extracted text is of too low quality you
-;; either can hack/extend the `toc-extract-pages-ocr' definition, or
-;; alternatively you can try to extract the text with the python
-;; document-contents-extractor script (see URL
+;; 1. Extraction For PDFs without TOC pages, with a very complicated TOC (i.e.
+;; that require much cleanup work) or with headlines well fitted for automatic
+;; extraction (you will have to decide for yourself by trying it) consider to
+;; use the pdf.tocgen (URL `https://krasjet.com/voice/pdf.tocgen/')
+;; functionality described below. Otherwise, start with opening some pdf or
djvu
+;; file in Emacs (pdf-tools and djvu package recommended). Find the pagenumbers
+;; for the TOC. Then type M-x `toc-extract-pages', or M-x
+;; `toc-extract-pages-ocr' if doc has no text layer or text layer is bad, and
+;; answer the subsequent prompts by entering the pagenumbers for the first and
+;; the last page each followed by RET. For PDF extraction with OCR, currently
it
+;; is required to view all contents pages once before extraction (toc-mode uses
+;; the cached file data). Also the languages used for tesseract OCR can be
+;; customized via the `toc-ocr-languages' variable. A buffer with the, somewhat
+;; cleaned up, extracted text will open in TOC-cleanup mode. Prefix command
with
+;; the universal argument (C-u) to omit clean and get the raw text. If the
+;; extracted text is of too low quality you either can hack/extend the
+;; `toc-extract-pages-ocr' definition, or alternatively you can try to extract
+;; the text with the python document-contents-extractor script (see URL
;; `https://pypi.org/project/document-contents-extractor/'), which is more
;; configurable (you are also welcome to hack and improve that script).
@@ -90,6 +82,20 @@
;; `https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html' might be
;; useful.
+;; Software-generated PDF's with pdf.tocgen
+;; For 'software-generated' (i.e. PDF's not created from scans) PDF-files it is
+;; sometimes easier to use `toc-extract-with-pdf-tocgen'. To use this function
+;; you first have to provide the font properties for the different headline
+;; levels. For that select the word in a headline of a certain level and then
+;; type M-x `toc-gen-set-level'. This function will ask which level you are
+;; setting, the highest level should be level 1. After you have set the various
+;; levels (1,2, etc.) then it is time to run M-x `toc-extract-with-pdf-tocgen'.
+;; If a TOC is extracted succesfully, then in the pdftocgen-mode buffer simply
+;; press C-c C-c to add the contents to the PDF. The contents will be added to
a
+;; copy of the original PDF with the filename output.pdf and this copy will be
+;; opened in a new buffer. If the pdf-tocgen option does not work well then
+;; continue with the steps below.
+
;; If you merely want to extract text without further processing then you can
;; use the command `toc-extract-only'.
- [elpa] externals/doc-toc f430243a88 43/84: Add version: 0 header, (continued)
- [elpa] externals/doc-toc f430243a88 43/84: Add version: 0 header, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 59f4471e6a 50/84: Update README.org, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 2d95c466a3 48/84: Add MELPA and GPL3 badges, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 734043bdc7 47/84: Improve documentation in toc-mode.el, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc ae455b4863 52/84: Implement language customization for OCR, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 23e1fb2fde 54/84: Implement HandyOutliner option, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc b45b78102c 55/84: Update README, add extract-only documentation, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc b1a843fd6f 57/84: Implement roman-to-arabic and add pdf djvu keybindings, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 719f6a03a7 64/84: Return page text when pdfxmeta fails, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc dd1dfd83ac 63/84: Fix docstrings and warnings for MELPA, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 7e2e6be947 69/84: Update/improve README,
ELPA Syncer <=
- [elpa] externals/doc-toc 782d0cd6b5 80/84: Update README.org, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 977bec00d8 74/84: Tiny bug fix in toc--tablist-to-handyoutliner, ELPA Syncer, 2022/09/26
- [elpa] externals/doc-toc 448a0ac00c 82/84: Small fixes before release on ELPA (fix compiler warnings), ELPA Syncer, 2022/09/26