Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate PDF parsing libraries #127

Open
djbrown opened this issue Oct 17, 2023 · 3 comments
Open

Evaluate PDF parsing libraries #127

djbrown opened this issue Oct 17, 2023 · 3 comments

Comments

@djbrown
Copy link
Owner

djbrown commented Oct 17, 2023

currently using https://github.com/chezou/tabula-py/tree/master
problems:

  • needs Java
  • doesn't handle some corner cases well

Alternatives:

@filgit
Copy link

filgit commented Sep 15, 2024

have tried both with camelot and tabula-py with the game reports. Both went well in my cases. Can you explain the "problem" cases? Is it related to correct_data.py ?

Why is needing java an issue and what are the corner cases you mention?

@djbrown
Copy link
Owner Author

djbrown commented Oct 1, 2024

@filgit the "problem" with java is that it's just an otherwise unnecessary technology/dependency in the system.
though it doesn't have a big impact on the code base:

1. Java (>=1.6) for parsing game report PDFs

&& apt install -y --no-install-recommends default-jre fonts-liberation gsfonts locales \

for the "corner cases" I don't remember what exactly they were, but I have a long list of "erroneous reports",
e.g. where some players names overflow the column/cell like here. I guess my hope was, that another library would handle them better (currently overflows are clipped off).

performance might be another reason, but that should be measured first to really count as an argument.

but this issue didn't really have priority, else it wouldn't celebrate birthday soon 😅

@filgit
Copy link

filgit commented Oct 7, 2024

okay, see. With camelot you will have ghostscript as an additional dependency.
And the error prone report you linked is a great example where the concept comes to its limits. I wonder, how the reports are created, btw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants