Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Post-processing page #2052

Merged
merged 3 commits into from
Aug 2, 2023
Merged

DOC: Post-processing page #2052

merged 3 commits into from
Aug 2, 2023

Conversation

MartinThoma
Copy link
Member

@MartinThoma MartinThoma commented Jul 31, 2023

@MartinThoma MartinThoma marked this pull request as draft July 31, 2023 21:18
@codecov
Copy link

codecov bot commented Jul 31, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (534c7b4) 94.17% compared to head (4e7519e) 94.17%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2052   +/-   ##
=======================================
  Coverage   94.17%   94.17%           
=======================================
  Files          41       41           
  Lines        7332     7332           
  Branches     1441     1441           
=======================================
  Hits         6905     6905           
  Misses        266      266           
  Partials      161      161           
Files Changed Coverage Δ
pypdf/_page.py 93.61% <ø> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MartinThoma
Copy link
Member Author

I could improve the ground truth of the benchmark with those post-processing steps: py-pdf/benchmarks@38a4fa6

@MartinThoma
Copy link
Member Author

@pubpub-zz With post-processing, pypdf now has a slightly better score than Tika 🎉 py-pdf/benchmarks@e7fb117 (only results)

We're doing noticably worse on https://github.com/py-pdf/sample-files/raw/main/009-pdflatex-geotopo/GeoTopo.pdf and https://arxiv.org/pdf/2201.00178.pdf , but a lot better on https://arxiv.org/pdf/1602.06541.pdf .

In the evening I will fix the ground truth more + check why we are worse on those two PDFs than Tika.

@MartinThoma
Copy link
Member Author

I've tried the following to find / remove a common header:

    def find_common_prefix(texts: List[str]) -> str:
        if not texts:
            return ""
        common_prefix = ""
        for i in range(0, min([len(text) for text in texts])):
            parts = [p[i] for p in texts]
            if len(set(parts)) == 1:
                common_prefix += texts[0][i]
            else:
                print(f"i={i}, {parts}")
                break
        return common_prefix

    def remove_common_prefix(texts: List[str]) -> List[str]:
        common_prefix = find_common_prefix(texts)
        print(f"common_prefix={common_prefix}")
        return [text[len(common_prefix):] for text in texts]

There are 3 issues:

  1. Several headers / footers contain the page label. That means the string is not that simple
  2. Even and odd pages might have different headers/footers
  3. Chapter pages and similar don't have headers/footers. Same for the first page.

@MartinThoma MartinThoma changed the title DOC: Post processing page DOC: Post-processing page Aug 2, 2023
@MartinThoma MartinThoma marked this pull request as ready for review August 2, 2023 20:41
@MartinThoma MartinThoma merged commit c04a6bb into main Aug 2, 2023
11 checks passed
@MartinThoma MartinThoma deleted the postprocessing branch August 2, 2023 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: Provide a post-processing function to replace ligatures
1 participant