Skip to content

Commit

Permalink
Fix typos in converting_pdf_to_text.rst (pdfminer#611)
Browse files Browse the repository at this point in the history
* Fix typos in converting_pdf_to_text.rst

* The word "pdfminer.six" as a whole should not be separated by newline, otherwise they are treated as two separated words by renderer, and incorrectly displayed as separated.

* Trim redundant spaces

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
  • Loading branch information
MapleCCC and pietermarsman authored Aug 31, 2021
1 parent 46fa214 commit 8ea9f10
Showing 1 changed file with 8 additions and 8 deletions.
16 changes: 8 additions & 8 deletions docs/source/topic/converting_pdf_to_text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ the characters and their placement.
This makes extracting meaningful pieces of text from PDF files difficult.
The characters that compose a paragraph are no different from those that
compose the table, the page footer or the description of a figure. Unlike
other documents formats, like a `.txt` file or a word document, the PDF format
other document formats, like a `.txt` file or a word document, the PDF format
does not contain a stream of text.

A PDF document does consists of a collection of objects that together describe
Expand All @@ -29,10 +29,10 @@ PDFMiner attempts to reconstruct some of those structures by using heuristics
on the positioning of characters. This works well for sentences and
paragraphs because meaningful groups of nearby characters can be made.

The layout analysis consist of three different stages: it groups characters
The layout analysis consists of three different stages: it groups characters
into words and lines, then it groups lines into boxes and finally it groups
textboxes hierarchically. These stages are discussed in the following
sections. The resulting output of the layout analysis is an ordered hierarchy
sections. The resulting output of the layout analysis is an ordered hierarchy
of layout objects on a PDF page.

.. figure:: ../_static/layout_analysis_output.png
Expand All @@ -48,8 +48,8 @@ Grouping characters into words and lines

The first step in going from characters to text is to group characters in a
meaningful way. Each character has an x-coordinate and a y-coordinate for its
bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer
.six uses these bounding boxes to decide which characters belong together.
bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer.six
uses these bounding boxes to decide which characters belong together.

Characters that are both horizontally and vertically close are grouped onto
one line. How close they should be is determined by the `char_margin`
Expand All @@ -74,7 +74,7 @@ relative to the maximum width or height of the new character. Having a smaller
least be smaller than the `char_margin` otherwise none of the characters will
be separated by a space.

The result of this stage is a list of lines. Each line consists a list of
The result of this stage is a list of lines. Each line consists of a list of
characters. These characters are either original `LTChar` characters that
originate from the PDF file, or inserted `LTAnno` characters that
represent spaces between words or newlines at the end of each line.
Expand All @@ -91,14 +91,14 @@ Lines that are both horizontally overlapping and vertically close are grouped.
How vertically close the lines should be is determined by the `line_margin`.
This margin is specified relative to the height of the bounding box. Lines
are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms
(see L :sub:`2`) in the figure) of the bounding boxes are closer together
(see L :sub:`2`) in the figure) of the bounding boxes is closer together
than the absolute line margin, i.e. the `line_margin` multiplied by the
height of the bounding box.

.. raw:: html
:file: ../_static/layout_analysis_group_lines.html

The result of this stage is a list of text boxes. Each box consist of a list
The result of this stage is a list of text boxes. Each box consists of a list
of lines.

Grouping textboxes hierarchically
Expand Down

0 comments on commit 8ea9f10

Please sign in to comment.