DOC: Post-processing page (#2052)

Closes #2046
py-pdf · Aug 2, 2023 · c04a6bb · c04a6bb
1 parent edb38a3
commit c04a6bb
Show file tree

Hide file tree

Showing 3 changed files with 117 additions and 2 deletions.
diff --git a/docs/index.rst b/docs/index.rst
@@ -24,6 +24,7 @@ You can contribute to `pypdf on GitHub <https://github.com/py-pdf/pypdf>`_.
  user/suppress-warnings
  user/metadata
  user/extract-text
+ user/post-processing-in-text-extraction
  user/extract-images
  user/extract-attachments
  user/encryption-decryption

diff --git a/docs/user/post-processing-in-text-extraction.md b/docs/user/post-processing-in-text-extraction.md
@@ -0,0 +1,113 @@
+# Post-Processing in Text Extraction
+
+Post-processing can recognizably improve the results of text extraction.
+It is, however, outside of the scope of pypdf itself. Hence the library will
+not give any direct support for it. It is a natural language processing (NLP)
+task.
+
+This page lists a few examples what can be done as well as a community
+recipie that can be used as a best-practice general purpose post processing
+step. If you know more about the specific domain of your documents, e.g. the
+language, it is likely that you can find custom solutions that work better in
+your context
+
+## Ligature Replacement
+
+```python
+def replace_ligatures(text: str) -> str:
+ ligatures = {
+ "ﬀ": "ff",
+ "ﬁ": "fi",
+ "ﬂ": "fl",
+ "ﬃ": "ffi",
+ "ﬄ": "ffl",
+ "ﬅ": "ft",
+ "ﬆ": "st",
+ # "Ꜳ": "AA",
+ # "Æ": "AE",
+ "ꜳ": "aa",
+ }
+ for search, replace in ligatures.items():
+ text = text.replace(search, replace)
+ return text
+```
+
+## De-Hyphenation
+
+Hyphens are used to break words up so that the appearance of the page is nicer.
+
+```python
+from typing import List
+
+
+def remove_hyphens(text: str) -> str:
+ """
+
+ This fails for:
+ * Natural dashes: well-known, self-replication, use-cases, non-semantic,
+ Post-processing, Window-wise, viewpoint-dependent
+ * Trailing math operands: 2 - 4
+ * Names: Lopez-Ferreras, VGG-19, CIFAR-100
+ """
+ lines = [line.rstrip() for line in text.split("\n")]
+
+ # Find dashes
+ line_numbers = []
+ for line_no, line in enumerate(lines[:-1]):
+ if line.endswith("-"):
+ line_numbers.append(line_no)
+
+ # Replace
+ for line_no in line_numbers:
+ lines = dehyphenate(lines, line_no)
+
+ return "\n".join(lines)
+
+
+def dehyphenate(lines: List[str], line_no: int) -> List[str]:
+ next_line = lines[line_no + 1]
+ word_suffix = next_line.split(" ")[0]
+
+ lines[line_no] = lines[line_no][:-1] + word_suffix
+ lines[line_no + 1] = lines[line_no + 1][len(word_suffix) :]
+ return lines
+```
+
+## Header/Footer Removal
+
+The following header/footer removal has several drawbacks:
+
+* False-positives, e.g. for the first page when there is a date like 2021.
+* False-negatives in many cases:
+ * Dynamic part, e.g. page label is in the header
+ * Even/odd pages have different headers
+ * Some pages, e.g. the first one or chapter pages, don't have a header
+
+```python
+def remove_footer(extracted_texts: list[str], page_labels: list[str]):
+ def remove_page_labels(extracted_texts, page_labels):
+ processed = []
+ for text, label in zip(extracted_texts, page_labels):
+ text_left = text.lstrip()
+ if text_left.startswith(label):
+ text = text_left[len(label) :]
+
+ text_right = text.rstrip()
+ if text_right.endswith(label):
+ text = text_right[: -len(label)]
+
+ processed.append(text)
+ return processed
+
+ extracted_texts = remove_page_labels(extracted_texts, page_labels)
+ return extracted_texts
+```
+
+## Other ideas
+
+* Whitespaces between Units: Between a number and it's unit should be a space
+ ([source](https://tex.stackexchange.com/questions/20962/should-i-put-a-space-between-a-number-and-its-unit)).
+ That means: 42 ms, 42 GHz, 42 GB.
+* Percent: English style guides prescribe writing the percent sign following the number without any space between (e.g. 50%).
+* Whitespaces before dots: Should typically be removed
+* Whitespaces after dots: Should typically be added
diff --git a/pypdf/_page.py b/pypdf/_page.py
@@ -557,13 +557,14 @@ def images(self) -> List[ImageFile]:
  Examples:
  reader.pages[0].images[0] # return fist image
  reader.pages[0].images['/I0'] # return image '/I0'
- reader.pages[0].images['/TP1','/Image1'] # return image '/Image1'
-  within '/TP1' Xobject/Form
+ # return image '/Image1' within '/TP1' Xobject/Form:
+ reader.pages[0].images['/TP1','/Image1']
  for img in reader.pages[0].images: # loop within all objects
 
  images.keys() and images.items() can be used.
 
  The ImageFile has the following properties:
+
  `.name` : name of the object
  `.data` : bytes of the object
  `.image` : PIL Image Object