DOC: Post-processing page #2052

MartinThoma · 2023-07-31T21:17:15Z

See https://pypdf--2052.org.readthedocs.build/en/2052/user/post-processing-in-text-extraction.html

Closes #2046

codecov · 2023-07-31T21:31:54Z

Codecov Report

Patch and project coverage have no change.

Comparison is base (534c7b4) 94.17% compared to head (4e7519e) 94.17%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2052   +/-   ##
=======================================
  Coverage   94.17%   94.17%           
=======================================
  Files          41       41           
  Lines        7332     7332           
  Branches     1441     1441           
=======================================
  Hits         6905     6905           
  Misses        266      266           
  Partials      161      161

Files Changed	Coverage Δ
pypdf/_page.py	`93.61% <ø> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

MartinThoma · 2023-08-01T06:05:16Z

I could improve the ground truth of the benchmark with those post-processing steps: py-pdf/benchmarks@38a4fa6

MartinThoma · 2023-08-01T06:30:53Z

@pubpub-zz With post-processing, pypdf now has a slightly better score than Tika 🎉 py-pdf/benchmarks@e7fb117 (only results)

We're doing noticably worse on https://github.com/py-pdf/sample-files/raw/main/009-pdflatex-geotopo/GeoTopo.pdf and https://arxiv.org/pdf/2201.00178.pdf , but a lot better on https://arxiv.org/pdf/1602.06541.pdf .

In the evening I will fix the ground truth more + check why we are worse on those two PDFs than Tika.

MartinThoma · 2023-08-02T20:26:36Z

I've tried the following to find / remove a common header:

    def find_common_prefix(texts: List[str]) -> str:
        if not texts:
            return ""
        common_prefix = ""
        for i in range(0, min([len(text) for text in texts])):
            parts = [p[i] for p in texts]
            if len(set(parts)) == 1:
                common_prefix += texts[0][i]
            else:
                print(f"i={i}, {parts}")
                break
        return common_prefix

    def remove_common_prefix(texts: List[str]) -> List[str]:
        common_prefix = find_common_prefix(texts)
        print(f"common_prefix={common_prefix}")
        return [text[len(common_prefix):] for text in texts]

There are 3 issues:

Several headers / footers contain the page label. That means the string is not that simple
Even and odd pages might have different headers/footers
Chapter pages and similar don't have headers/footers. Same for the first page.

DOC: Post processing page

db95224

MartinThoma marked this pull request as draft July 31, 2023 21:18

Add more details

4e7519e

Update

86017d9

MartinThoma changed the title ~~DOC: Post processing page~~ DOC: Post-processing page Aug 2, 2023

MartinThoma marked this pull request as ready for review August 2, 2023 20:41

MartinThoma merged commit c04a6bb into main Aug 2, 2023
11 checks passed

MartinThoma deleted the postprocessing branch August 2, 2023 20:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Post-processing page #2052

DOC: Post-processing page #2052

MartinThoma commented Jul 31, 2023 •

edited

Loading

codecov bot commented Jul 31, 2023 •

edited

Loading

MartinThoma commented Aug 1, 2023

MartinThoma commented Aug 1, 2023

MartinThoma commented Aug 2, 2023

DOC: Post-processing page #2052

DOC: Post-processing page #2052

Conversation

MartinThoma commented Jul 31, 2023 • edited Loading

codecov bot commented Jul 31, 2023 • edited Loading

Codecov Report

MartinThoma commented Aug 1, 2023

MartinThoma commented Aug 1, 2023

MartinThoma commented Aug 2, 2023

MartinThoma commented Jul 31, 2023 •

edited

Loading

codecov bot commented Jul 31, 2023 •

edited

Loading