From b108484e29eb16e4e117f36338f132e186b0893c Mon Sep 17 00:00:00 2001 From: j-t-1 <120829237+j-t-1@users.noreply.github.com> Date: Tue, 26 Mar 2024 07:19:56 +0000 Subject: [PATCH 1/3] DOC: Minor improvements --- docs/dev/pdf-format.md | 16 ++++----- docs/dev/pypdf-parsing.md | 8 ++--- .../post-processing-in-text-extraction.md | 34 +++++++++---------- docs/user/streaming-data.md | 2 +- 4 files changed, 29 insertions(+), 31 deletions(-) diff --git a/docs/dev/pdf-format.md b/docs/dev/pdf-format.md index 2585dba2f..9af1d4625 100644 --- a/docs/dev/pdf-format.md +++ b/docs/dev/pdf-format.md @@ -1,6 +1,6 @@ # The PDF Format -It's recommended to look in the PDF specification for details and clarifications. +It is recommended to look in the PDF specification for details and clarifications. This is only intended to give a very rough overview of the format. ## Overall Structure @@ -32,7 +32,7 @@ Let's go through it step-by-step: * `xref` is just a keyword that specifies the start of the xref table. * `42` is the numerical ID of the first object in this xref section; `5` is the number of entries in the xref table. -* Now every object has 3 entries `nnnnnnnnnn ggggg n`: The 10-digit byte offset, +* Now every object has 3 entries `nnnnnnnnnn ggggg n`: a 10-digit byte offset, a 5-digit generation number, and a literal keyword which is either `n` or `f`. * `nnnnnnnnnn` is the byte offset of the object. It tells the reader where the object is in the file. @@ -49,10 +49,10 @@ Let's go through it step-by-step: The body is a sequence of indirect objects: -`counter generationnumber << the_object >> endobj` +`counter generation_number << the_object >> endobj` * `counter` (integer) is a unique identifier for the object. -* `generationnumber` (integer) is the generation number of the object. +* `generation_number` (integer) is the generation number of the object. * `the_object` is the object itself. It can be empty. Starts with `/Keyword` to specify which kind of object it is. * `endobj` marks the end of the object. @@ -91,11 +91,11 @@ Let's go through it: * `%%EOF` is the end-of-file marker. The trailer dictionary is a key-value list. The keys are specified in -Table 3.13 of the PDF Reference 1.7, e.g. `/Root` and `/Size` (both are required). +Table 15 of the PDF Reference 1.7, e.g. `/Root` and `/Size` (both are required). * `/Root` (dictionary) contains the document catalog. - * The `5` is the object number of the catalog dictionary - * `0` is the generation number of the catalog dictionary + * The `5` is the object number of the catalog dictionary. + * `0` is the generation number of the catalog dictionary. * `R` is the keyword that indicates that the object is a reference to the catalog dictionary. * `/Size` (integer) contains the total number of entries in the files xref table. @@ -110,4 +110,4 @@ pdftk crazyones.pdf output crazyones-uncomp.pdf uncompress ``` Then rename `crazyones-uncomp.pdf` to `crazyones-uncomp.txt` and open it in -our favorite IDE / text editor. +your favorite IDE / text editor. diff --git a/docs/dev/pypdf-parsing.md b/docs/dev/pypdf-parsing.md index 844aab789..7fa2c6f95 100644 --- a/docs/dev/pypdf-parsing.md +++ b/docs/dev/pypdf-parsing.md @@ -13,14 +13,14 @@ structure of parsing: proceeds to parse the objects in the PDF. Objects in a PDF can be of various types such as dictionaries, arrays, streams, and simple data types (e.g., integers, strings). pypdf parses these objects and stores them in - {py:meth}`PdfReader.resolved_objects ` - via {py:meth}`cache_indirect_object `. + {py:meth}`PdfReader.resolved_objects `, + populated by {py:meth}`cache_indirect_object `. 3. **Decoding content streams**: The content of a PDF is typically stored in content streams, which are sequences of PDF operators and operands. pypdf decodes these content streams by applying filters (e.g., `FlateDecode`, `LZWDecode`) specified in the stream's dictionary. This is only done when the - object is requested via {py:meth}`PdfReader.get_object - ` in the `PdfReader._get_object_from_stream` method. + object is requested by {py:meth}`PdfReader.get_object + ` which uses the `PdfReader._get_object_from_stream` method. ## References diff --git a/docs/user/post-processing-in-text-extraction.md b/docs/user/post-processing-in-text-extraction.md index 464181823..614b6cc31 100644 --- a/docs/user/post-processing-in-text-extraction.md +++ b/docs/user/post-processing-in-text-extraction.md @@ -1,15 +1,13 @@ -# Post-Processing in Text Extraction +# Post-Processing of Text Extraction -Post-processing can recognizably improve the results of text extraction. -It is, however, outside of the scope of pypdf itself. Hence the library will -not give any direct support for it. It is a natural language processing (NLP) -task. +Post-processing can recognizably improve the results of text extraction. It is, +however, outside of the scope of pypdf itself. Hence the library will not give +any direct support for it. It is a natural language processing (NLP) task. -This page lists a few examples what can be done as well as a community -recipie that can be used as a best-practice general purpose post processing -step. If you know more about the specific domain of your documents, e.g. the -language, it is likely that you can find custom solutions that work better in -your context +This page lists a few examples what can be done as well as a community recipe +that can be used as a general purpose post-processing step. If you know more +about the specific domain of your documents, e.g. the language, it is likely +that you can find custom solutions that work better in your context. ## Ligature Replacement @@ -32,7 +30,7 @@ def replace_ligatures(text: str) -> str: return text ``` -## De-Hyphenation +## Dehyphenation Hyphens are used to break words up so that the appearance of the page is nicer. @@ -77,11 +75,11 @@ def dehyphenate(lines: List[str], line_no: int) -> List[str]: The following header/footer removal has several drawbacks: -* False-positives, e.g. for the first page when there is a date like 2021. +* False-positives, e.g. for the first page when there is a date like 2024. * False-negatives in many cases: - * Dynamic part, e.g. page label is in the header - * Even/odd pages have different headers - * Some pages, e.g. the first one or chapter pages, don't have a header + * Dynamic part, e.g. page label is in the header. + * Even/odd pages have different headers. + * Some pages, e.g. the first one or chapter pages, do not have a header. ```python def remove_footer(extracted_texts: list[str], page_labels: list[str]): @@ -105,9 +103,9 @@ def remove_footer(extracted_texts: list[str], page_labels: list[str]): ## Other ideas -* Whitespaces between Units: Between a number and it's unit should be a space +* Whitespaces in units: Between a number and its unit should be a space. ([source](https://tex.stackexchange.com/questions/20962/should-i-put-a-space-between-a-number-and-its-unit)). That means: 42 ms, 42 GHz, 42 GB. * Percent: English style guides prescribe writing the percent sign following the number without any space between (e.g. 50%). -* Whitespaces before dots: Should typically be removed -* Whitespaces after dots: Should typically be added +* Whitespaces before dots: Should typically be removed. +* Whitespaces after dots: Should typically be added. diff --git a/docs/user/streaming-data.md b/docs/user/streaming-data.md index 9a0032f08..3b044ddec 100644 --- a/docs/user/streaming-data.md +++ b/docs/user/streaming-data.md @@ -73,4 +73,4 @@ obj = s3.get_object(Body=csv_buffer.getvalue(), Bucket="my-bucket", Key="my/doc. reader = PdfReader(BytesIO(obj["Body"].read())) ``` -It works similarly for Google Cloud Storage ([example](https://stackoverflow.com/a/68403628/562769)) +It works similarly for Google Cloud Storage ([example](https://stackoverflow.com/a/68403628/562769)). From 3368cd5ac48750150d3fabc39bed78f2928142c7 Mon Sep 17 00:00:00 2001 From: j-t-1 <120829237+j-t-1@users.noreply.github.com> Date: Tue, 26 Mar 2024 07:56:32 +0000 Subject: [PATCH 2/3] DOC: Minor improvements --- docs/user/extract-text.md | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/docs/user/extract-text.md b/docs/user/extract-text.md index 573824d2b..16803c0fe 100644 --- a/docs/user/extract-text.md +++ b/docs/user/extract-text.md @@ -1,6 +1,6 @@ # Extract Text from a PDF -You can extract text from a PDF like this: +You can extract text from a PDF: ```python from pypdf import PdfReader @@ -10,7 +10,7 @@ page = reader.pages[0] print(page.extract_text()) ``` -You can also choose to limit the text orientation you want to extract, e.g: +You can also choose to limit the text orientation you want to extract: ```python # extract only text oriented up @@ -42,7 +42,7 @@ Refer to [extract\_text](../modules/PageObject.html#pypdf._page.PageObject.extra ## Using a visitor -You can use visitor-functions to control which part of a page you want to process and extract. The visitor-functions you provide will get called for each operator or for each text fragment. +You can use visitor functions to control which part of a page you want to process and extract. The visitor functions you provide will get called for each operator or for each text fragment. The function provided in argument visitor_text of function extract_text has five arguments: * text: the current text (as long as possible, can be up to a full line) @@ -51,19 +51,19 @@ The function provided in argument visitor_text of function extract_text has five * font-dictionary: full font dictionary * font-size: the size (in text coordinate space) -The matrix stores 6 parameters. The first 4 provide the rotation/scaling matrix and the last two provide the translation (horizontal/vertical) +The matrix stores six parameters. The first four provide the rotation/scaling matrix and the last two provide the translation (horizontal/vertical). It is recommended to use the user_matrix as it takes into all transformations. Notes : - - as indicated in the PDF 1.7 reference, page 204 the user matrix applies to text space/image space/form space/pattern space. - - if you want to get the full transformation from text to user space, you can use the `mult` function (availalbe in global import) as follows: -`txt2user = mult(tm, cm))` -The font-size is the raw text size, that is affected by the `user_matrix` + - As indicated in §8.3.3 of the PDF 1.7 or PDF 2.0 specification, the user matrix applies to text space/image space/form space/pattern space. + - If you want to get the full transformation from text to user space, you can use the `mult` function (availalbe in global import) as follows: +`txt2user = mult(tm, cm))`. +The font-size is the raw text size, that is affected by the `user_matrix`. The font-dictionary may be None in case of unknown fonts. -If not None it may e.g. contain key "/BaseFont" with value "/Arial,Bold". +If not None it could contain something like key "/BaseFont" with value "/Arial,Bold". **Caveat**: In complicated documents the calculated positions may be difficult to (if you move from multiple forms to page user space for example). @@ -72,7 +72,7 @@ operator, operand-arguments, current transformation matrix and text matrix. ### Example 1: Ignore header and footer -The following example reads the text of page 4 of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores header (y < 720) and footer (y > 50). +The following example reads the text of page four of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores the header (y < 720) and footer (y > 50). ```python from pypdf import PdfReader @@ -97,10 +97,10 @@ print(text_body) ### Example 2: Extract rectangles and texts into a SVG-file -The following example converts page 3 of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf) into a +The following example converts page three of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf) into a [SVG file](https://en.wikipedia.org/wiki/Scalable_Vector_Graphics). -Such a SVG export may help to understand whats going on in a page. +Such a SVG export may help to understand what is going on in a page. ```python from pypdf import PdfReader @@ -131,13 +131,13 @@ dwg.save() The SVG generated here is bottom-up because the coordinate systems of PDF and SVG differ. -Unfortunately in complicated PDF documents the coordinates given to the visitor-functions may be wrong. +Unfortunately in complicated PDF documents the coordinates given to the visitor functions may be wrong. ## Why Text Extraction is hard ### Unclear Objective -Extracting text from a PDF can be pretty tricky. In several cases there is no +Extracting text from a PDF can be tricky. In several cases there is no clear answer what the expected result should look like: 1. **Paragraphs**: Should the text of a paragraph have line breaks at the same places @@ -191,7 +191,7 @@ printing. It was not created for parsing the content. PDF files don't contain a semantic layer. Specifically, there is no information what the header, footer, page numbers, -tables, and paragraphs are. The visual appearence is there and people might +tables, and paragraphs are. The visual appearance is there and people might find heuristics to make educated guesses, but there is no way of being certain. This is a shortcoming of the PDF file format, not of pypdf. From 139840b55b931eab17a587c2499fa123c27300b2 Mon Sep 17 00:00:00 2001 From: Stefan <96178532+stefan6419846@users.noreply.github.com> Date: Tue, 26 Mar 2024 13:37:27 +0100 Subject: [PATCH 3/3] Apply suggestions from code review --- docs/user/extract-text.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/user/extract-text.md b/docs/user/extract-text.md index 16803c0fe..21ddf73ed 100644 --- a/docs/user/extract-text.md +++ b/docs/user/extract-text.md @@ -57,9 +57,9 @@ It is recommended to use the user_matrix as it takes into all transformations. Notes : - As indicated in §8.3.3 of the PDF 1.7 or PDF 2.0 specification, the user matrix applies to text space/image space/form space/pattern space. - - If you want to get the full transformation from text to user space, you can use the `mult` function (availalbe in global import) as follows: + - If you want to get the full transformation from text to user space, you can use the `mult` function (available in global import) as follows: `txt2user = mult(tm, cm))`. -The font-size is the raw text size, that is affected by the `user_matrix`. +The font size is the raw text size and affected by the `user_matrix`. The font-dictionary may be None in case of unknown fonts.