Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customize Minimum Column Number in find_tables #3987

Open
lmX2015 opened this issue Oct 23, 2024 · 1 comment
Open

Customize Minimum Column Number in find_tables #3987

lmX2015 opened this issue Oct 23, 2024 · 1 comment

Comments

@lmX2015
Copy link

lmX2015 commented Oct 23, 2024

Is your feature request related to a problem? Please describe.

I recently had examples of tables with one column only in forms / FAQ types of document (see below for an example of such documents).

This still makes sense to have this data treated table as the cells help to properly split the text.

However it seems that this snippet is removing such tables, despite the cells being properly recognized.

    # PyMuPDF modification:
    # Remove tables without text or having only 1 column
    for i in range(len(tables) - 1, -1, -1):
        r = EMPTY_RECT()
        x1_vals = set()
        x0_vals = set()
        for c in tables[i]:
            r |= c
            x1_vals.add(c[2])
            x0_vals.add(c[0])
        if (
            len(x1_vals) < 2
            or len(x0_vals) < 2
            or white_spaces.issuperset(
                page.get_textbox(
                    r,
                    textpage=TEXTPAGE,
                )
            )
        ):
            del tables[I]

Describe the solution you'd like

Is it possible to make the minimum column / row an attribute of the TableFinder Settings?

Describe alternatives you've considered
Are there several options for how your request could be met?

I can make a PR for that if you are ok with the idea.

Additional context
Add any other context or screenshots about the feature request here.

Basic example
Question.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants
@lmX2015 and others