Skip to content

Conversation

@MthwRobinson
Copy link
Contributor

@MthwRobinson MthwRobinson commented May 16, 2024

Summary

Closes #3021 . Turns table extraction for PDFs and images off by default. The default behavior originally changed in #2588 . The reason for reversion is that some users did not realize turning off table extraction was an option and experience long processing times for PDFs and images with the new default behavior.

MthwRobinson and others added 3 commits May 16, 2024 11:57
…images <- Ingest test fixtures update (#3036)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
@MthwRobinson MthwRobinson requested review from qued and scanny May 16, 2024 19:25
@MthwRobinson MthwRobinson marked this pull request as ready for review May 16, 2024 19:26
Copy link
Contributor

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matt, just the one comment but I think it's worth reading and removing Excel types from the list.

Approving in advance, you can make whatever changes you think appropriate based on my remarks before you merge :)

…images <- Ingest test fixtures update (#3046)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
@MthwRobinson MthwRobinson added this pull request to the merge queue May 17, 2024
Merged via the queue into main with commit ec987dc May 17, 2024
@MthwRobinson MthwRobinson deleted the 3021/default-table-off branch May 17, 2024 16:04
Comment on lines 143 to +144
headers: dict[str, str] = {},
skip_infer_table_types: list[str] = [],
skip_infer_table_types: list[str] = ["pdf", "jpg", "png", "heic"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should really never do this. Initializing mutable default parameters, such as skip_infer_table_types, with a mutable data type (e.g., list, dict, etc.) can lead to unexpected behavior. The default value is shared across all instances of the function call, meaning that any modification to this parameter within one function call will affect subsequent calls. This can introduce subtle bugs that are difficult to trace and debug. Instead, initialize the parameter with None and set the default value inside the function if it is None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion @MalteHB ! Opened #3063 to address.

github-merge-queue bot pushed a commit that referenced this pull request May 23, 2024
This PR aims to add backward compatibility for the deprecated
`pdf_infer_table_structure` parameter. A missing part of turning table
extraction for PDFs and Images off by default in
#3035, which was
turned on in #2588.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Turn table extraction off by default for PDFs and images

6 participants