ENH: Extract Text Enhancement (whitespaces) #1084

pubpub-zz · 2022-07-09T20:31:15Z

ENH : extract width from CIDFontType0/2
ENH : improve cr/lf and space extraction
BUG : fix error in decoding error in decoding #1075
FIX: in ToUnicode ignore comments (starting with %)
FIX: extend utf16 for min of 4 characters

Improves #234
Improves #957
Closes #1003
Closes #1019

Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing

2 types of fix : a) ignore comments - starting with % b) extend utf16 for min of 4 characters testfile to be added checking https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf. checking pages [0, 5, 8, 9, 12, 13, 14, 15, 17, 18, 20, 23, 24, 26, 27, 29, 30, 35, 38, 50, 52, 60, 61, 72, 87, 88, 90, 93, 94, 96, 99, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 118, 123, 124, 125, 126, 128, 129, 132, 133, 156, 160, 162, 178, 189, 195, 198, 235, 254, 255, 256, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 275, 276, 277, 278, 280, 281, 282, 283, 285, 287, 290, 292, 293, 295, 296, 297, 298, 299, 301, 302, 306, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 326, 327, 329, 330, 331, 332, 333, 336, 338, 339, 340, 341, 342, 344, 345, 346, 367, 368, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 381, 382, 383, 384, 385, 386, 387, 388, 404, 405, 406, 414, 415, 416, 417, 419, 420]

PyPDF2/_page.py

codecov · 2022-07-09T21:00:21Z

Codecov Report

Merging #1084 (5715653) into main (af5a0c3) will increase coverage by 0.30%.
The diff coverage is 92.74%.

@@            Coverage Diff             @@
##             main    #1084      +/-   ##
==========================================
+ Coverage   91.64%   91.94%   +0.30%     
==========================================
  Files          24       24              
  Lines        4559     4642      +83     
  Branches      932      957      +25     
==========================================
+ Hits         4178     4268      +90     
+ Misses        241      229      -12     
- Partials      140      145       +5

Impacted Files	Coverage Δ
PyPDF2/_page.py	`92.13% <91.48%> (-0.61%)`	⬇️
PyPDF2/_cmap.py	`93.54% <96.66%> (+9.15%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af5a0c3...5715653. Read the comment docs.

MartinThoma · 2022-07-10T12:23:39Z

Running the benchmark, we have one outlier which is way worse and the rest being similar:

    "1601.03642": 98.8% -> 98.9%  +0.1
    "1602.06541": 98.0% -> 97.8%  -0.2
    "1707.09725": 94.0% -> 94.1%  +0.1
    "2201.00021": 96.7% -> 96.5%  -0.2
    "2201.00022": 97.2% -> 97.3%  +0.1
    "2201.00029": 97.6% -> 96.7%  -0.9 ---
    "2201.00037": 93.8% -> 94.0%  +0.2
    "2201.00069": 96.3% -> 86.7%  -9.6 !!!!
    "2201.00151": 93.3% -> 93.8%  +0.5
    "2201.00178": 92.9% -> 93.1%  +0.2
    "2201.00200": 97.4% -> 97.4%  ----
    "2201.00201": 98.2% -> 98.2%  -----
    "2201.00214": 97.2% -> 97.4%  +0.2
  "GeoTopo-book": 86.5% -> 87.1%  +0.7

MartinThoma · 2022-07-10T12:25:38Z

Changes where it improved:

MartinThoma · 2022-07-10T12:29:20Z

Where it became so much worse:

MartinThoma · 2022-07-10T12:30:36Z

Hence PyPDF2 now extracts a lot of ␣ characters.

MartinThoma · 2022-07-10T19:49:02Z

@pubpub-zz Should I test again? I'm always excited when I see a PR from you 😄

pubpub-zz · 2022-07-10T19:55:06Z

You can : just normally to fix the "underscores"
However some commits may have been undone

MartinThoma · 2022-07-11T05:50:08Z

The latest results:

            "1601.03642": 98.8% -> 98.9% +0.1
            "1602.06541": 98.0% -> 97.8% -0.2
            "1707.09725": 94.0% -> 94.1% +0.1
            "2201.00021": 96.7% -> 96.5% -0.2
            "2201.00022": 97.2% -> 97.3% +0.1
            "2201.00029": 97.6% -> 96.7% -0.9
            "2201.00037": 93.8% -> 94.0% +0.2
            "2201.00069": 96.3% -> 96.4% +0.1
            "2201.00151": 93.3% -> 93.8% +0.5
            "2201.00178": 92.9% -> 93.1% +0.2
            "2201.00200": 97.4% -> 97.4% ---
            "2201.00201": 98.2% -> 98.2% ----
            "2201.00214": 97.2% -> 97.4% +0.2
          "GeoTopo-book": 86.5% -> 87.1% +0.6

MartinThoma · 2022-07-11T05:58:00Z

The worst:

The best: Having a space between text and math content (formulas and variables)

MartinThoma · 2022-07-11T05:58:39Z

@pubpub-zz To me, this looks good 👍 If you think it's ready, I would merge it :-)

pubpub-zz · 2022-07-11T06:08:10Z

Let me check tonight it covers most/all opened issues remaining
Now the list is limited it should be quick 😊

MartinThoma · 2022-07-11T06:16:12Z

Please have a look at your first comment - I've edited it to contain what I would use as a squash commit message. Feel free to adjust it :-)

pubpub-zz · 2022-07-11T06:33:31Z

Let me check tonight it covers most/all opened issues remaining
Now the list is limited it should be quick 😊

MartinThoma · 2022-07-12T20:26:20Z

Interestingly, the multiplication error didn't make a noticable difference

            "1601.03642": 98.8 -> 98.9% +0.1
            "1602.06541": 98.0 -> 97.8% -0.2
            "1707.09725": 94.0 -> 94.1% +0.1
            "2201.00021": 96.7 -> 96.5% -0.2
            "2201.00022": 97.2 -> 97.3% +0.1
            "2201.00029": 97.6 -> 96.7% -0.9
            "2201.00037": 93.8 -> 94.0% +0.2
            "2201.00069": 96.3 -> 96.4% +0.1
            "2201.00151": 93.3 -> 93.8% +0.5
            "2201.00178": 92.9 -> 93.1% +0.2
            "2201.00200": 97.4 -> 97.4% ---
            "2201.00201": 98.2 -> 98.2% ---
            "2201.00214": 97.2 -> 97.4% +0.2
          "GeoTopo-book": 86.5 -> 87.1% +0.6

pubpub-zz · 2022-07-12T20:31:02Z

Don't think I will be able to go further for the moment.
I've having a look on #1091. out of the "Float Object" I'm getting poor results (eg on page 1) but I do not understand how this file can be interpreted correctly.
For me I'm good to merge

MartinThoma · 2022-07-12T21:02:38Z

Nice! Thank you so much! I'll merge tomorrow morning :-)

MartinThoma · 2022-07-13T05:21:31Z

@pubpub-zz Your PR is now part of main and will be in PyPDF2==2.5.1. I will likely release on Sunday the 17.07.2022.

Thank you for improving the text extraction once again 🤗

* ENH : extract width from CIDFontType0/2 * ENH : improve cr/lf and space extraction * BUG : fix error in decoding py-pdf#1075 * FIX: in ToUnicode ignore comments (starting with %) * FIX: extend utf16 for min of 4 characters Improves py-pdf#234 Improves py-pdf#957 Closes py-pdf#1003 Closes py-pdf#1019 Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing

New Features (ENH): - Add color and font_format to PdfReader.outlines[i] (#1104) - Extract Text Enhancement (whitespaces) (#1084) Bug Fixes (BUG): - Use `build_destination` for named destination outlines (#1128) - Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar (#1118) - Prevent deduplication of PageObject (#1105) - None-check in DictionaryObject.read_from_stream (#1113) - Avoid IndexError in _cmap.parse_to_unicode (#1110) Documentation (DOC): - Explanation for git submodule - Watermark and stamp (#1095) Maintenance (MAINT): - Text extraction improvements (#1126) - Destination.color returns ArrayObject instead of tuple as fallback (#1119) - Use add_bookmark_destination in add_bookmark (#1100) - Use add_bookmark_destination in add_bookmark_dict (#1099) Testing (TST): - Remove xfail from test_outline_title_issue_1121 - Add test for arab text (#1127) - Add xfail for decryption fail (#1125) - Add xfail test for IndexError when extracting text (#1124) - Add MCVE showing outline title issue (#1123) Code Style (STY): - Apply black and isort - Use IntFlag for permissions_flag / update_page_form_field_values (#1094) - Simplify code (#1101) Full Changelog: 2.5.0...2.6.0

biredel · 2024-05-08T18:49:36Z

PyPDF2/_page.py

+ if isinstance(operands[0], str):
+ text += operands[0]
+ else:
+ t: str = ""
+ tt: bytes = (
+ encode_pdfdocencoding(operands[0])
+ if isinstance(operands[0], str)
+ else operands[0]
 )


second isinstance(operands[0], str) branch looks unreachable here (since moved over here)

Good catch. Do you want to submit a corresponding PR?

@stefan6419846 At least not immediately. Clearly there was or is something to that code that I do not understand enough to just delete it from a version that no longer contains the explanation - #2440 was merged much later!

@biredel
I'm confused :
you are looking at an obsolete branch (PyPDF2) instead of pypdf.
The code seems to be this one
https://github.com/py-pdf/pypdf/blob/a435eaaa08c71e3f66320edd06be24637ef32986/pypdf/_text_extraction/__init__.py#L225C18-L234C18

Codecov indicates some test coverage.

pubpub-zz added 7 commits June 10, 2022 22:46

mypy fix attempt

72e24ca

FIX : prevent warning in test

89170b1

ENH : extract width from CIDFontType0/2

ac145bd

ENH : improve cr/lf and space extraction

99abc53

Merge remote-tracking branch 'origin/main' into extract_text_enh

589b4ee

fix mypy

e74b4e2

MartinThoma reviewed Jul 9, 2022

View reviewed changes

PyPDF2/_page.py Outdated Show resolved Hide resolved

fix flake8

fa7d8fe

This was referenced Jul 9, 2022

Inter/Intra Word width issue with block-formatted text #1019

Closed

compute_space_width: Is 'if "/W" in ft' unreachable? #1003

Closed

v2.1 extract_text() misses newline characters #957

Closed

MartinThoma changed the title ~~Extract Text Enhancement~~ ENH: Extract Text Enhancement Jul 10, 2022

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jul 10, 2022

pubpub-zz mentioned this pull request Jul 10, 2022

BUG: Added line-breaks at dashes #234

Closed

fix odd space

758738d

pubpub-zz force-pushed the main branch from 0e7ae71 to 758738d Compare July 10, 2022 19:15

MartinThoma changed the title ~~ENH: Extract Text Enhancement~~ ENH: Extract Text Enhancement (whitespaces) Jul 11, 2022

Fix : mult error

5959cf7

Merge branch 'main' into main

5715653

MartinThoma merged commit 682eff9 into py-pdf:main Jul 13, 2022

pubpub-zz deleted the main branch July 19, 2022 19:51

MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Jan 14, 2023

biredel reviewed May 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Extract Text Enhancement (whitespaces) #1084

ENH: Extract Text Enhancement (whitespaces) #1084

pubpub-zz commented Jul 9, 2022 •

edited

Loading

codecov bot commented Jul 9, 2022 •

edited

Loading

MartinThoma commented Jul 10, 2022

MartinThoma commented Jul 10, 2022

MartinThoma commented Jul 10, 2022

MartinThoma commented Jul 10, 2022

MartinThoma commented Jul 10, 2022

pubpub-zz commented Jul 10, 2022

MartinThoma commented Jul 11, 2022

MartinThoma commented Jul 11, 2022

MartinThoma commented Jul 11, 2022

pubpub-zz commented Jul 11, 2022

MartinThoma commented Jul 11, 2022

pubpub-zz commented Jul 11, 2022

MartinThoma commented Jul 12, 2022 •

edited

Loading

pubpub-zz commented Jul 12, 2022

MartinThoma commented Jul 12, 2022

MartinThoma commented Jul 13, 2022

biredel May 8, 2024 •

edited

Loading

stefan6419846 May 8, 2024

biredel May 8, 2024

pubpub-zz May 9, 2024

ENH: Extract Text Enhancement (whitespaces) #1084

ENH: Extract Text Enhancement (whitespaces) #1084

Conversation

pubpub-zz commented Jul 9, 2022 • edited Loading

codecov bot commented Jul 9, 2022 • edited Loading

Codecov Report

MartinThoma commented Jul 10, 2022

MartinThoma commented Jul 10, 2022

MartinThoma commented Jul 10, 2022

MartinThoma commented Jul 10, 2022

MartinThoma commented Jul 10, 2022

pubpub-zz commented Jul 10, 2022

MartinThoma commented Jul 11, 2022

MartinThoma commented Jul 11, 2022

MartinThoma commented Jul 11, 2022

pubpub-zz commented Jul 11, 2022

MartinThoma commented Jul 11, 2022

pubpub-zz commented Jul 11, 2022

MartinThoma commented Jul 12, 2022 • edited Loading

pubpub-zz commented Jul 12, 2022

MartinThoma commented Jul 12, 2022

MartinThoma commented Jul 13, 2022

biredel May 8, 2024 • edited Loading

Choose a reason for hiding this comment

stefan6419846 May 8, 2024

Choose a reason for hiding this comment

biredel May 8, 2024

Choose a reason for hiding this comment

pubpub-zz May 9, 2024

Choose a reason for hiding this comment

pubpub-zz commented Jul 9, 2022 •

edited

Loading

codecov bot commented Jul 9, 2022 •

edited

Loading

MartinThoma commented Jul 12, 2022 •

edited

Loading

biredel May 8, 2024 •

edited

Loading