Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix typos in converting_pdf_to_text.rst #611

Merged
merged 4 commits into from
Aug 31, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions docs/source/topic/converting_pdf_to_text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ the characters and their placement.
This makes extracting meaningful pieces of text from PDF files difficult.
The characters that compose a paragraph are no different from those that
compose the table, the page footer or the description of a figure. Unlike
other documents formats, like a `.txt` file or a word document, the PDF format
other document formats, like a `.txt` file or a word document, the PDF format
does not contain a stream of text.

A PDF document does consists of a collection of objects that together describe
Expand All @@ -29,10 +29,10 @@ PDFMiner attempts to reconstruct some of those structures by using heuristics
on the positioning of characters. This works well for sentences and
paragraphs because meaningful groups of nearby characters can be made.

The layout analysis consist of three different stages: it groups characters
The layout analysis consists of three different stages: it groups characters
into words and lines, then it groups lines into boxes and finally it groups
textboxes hierarchically. These stages are discussed in the following
sections. The resulting output of the layout analysis is an ordered hierarchy
sections. The resulting output of the layout analysis is an ordered hierarchy
of layout objects on a PDF page.

.. figure:: ../_static/layout_analysis_output.png
Expand All @@ -48,8 +48,8 @@ Grouping characters into words and lines

The first step in going from characters to text is to group characters in a
meaningful way. Each character has an x-coordinate and a y-coordinate for its
bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer
.six uses these bounding boxes to decide which characters belong together.
bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer.six
uses these bounding boxes to decide which characters belong together.

Characters that are both horizontally and vertically close are grouped onto
one line. How close they should be is determined by the `char_margin`
Expand All @@ -74,7 +74,7 @@ relative to the maximum width or height of the new character. Having a smaller
least be smaller than the `char_margin` otherwise none of the characters will
be separated by a space.

The result of this stage is a list of lines. Each line consists a list of
The result of this stage is a list of lines. Each line consists of a list of
characters. These characters are either original `LTChar` characters that
originate from the PDF file, or inserted `LTAnno` characters that
represent spaces between words or newlines at the end of each line.
Expand All @@ -91,14 +91,14 @@ Lines that are both horizontally overlapping and vertically close are grouped.
How vertically close the lines should be is determined by the `line_margin`.
This margin is specified relative to the height of the bounding box. Lines
are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms
(see L :sub:`2`) in the figure) of the bounding boxes are closer together
(see L :sub:`2`) in the figure) of the bounding boxes is closer together
than the absolute line margin, i.e. the `line_margin` multiplied by the
height of the bounding box.

.. raw:: html
:file: ../_static/layout_analysis_group_lines.html

The result of this stage is a list of text boxes. Each box consist of a list
The result of this stage is a list of text boxes. Each box consists of a list
of lines.

Grouping textboxes hierarchically
Expand Down