Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text_x_tolerance ignored by extract_table? #176

Closed
jsfenfen opened this issue Feb 3, 2020 · 3 comments
Closed

text_x_tolerance ignored by extract_table? #176

jsfenfen opened this issue Feb 3, 2020 · 3 comments
Labels

Comments

@jsfenfen
Copy link
Contributor

jsfenfen commented Feb 3, 2020

I noticed that the text_tolerance and text_x_tolerance were not behaving the way I thought they should in the context of extract tables.

I was working on an example here--(how does one link to the middle of a notebook on github?) it's essentially at the bottom.

Extracting text from the relatively small font requires a text tolerance of 1 or 2 to realize word separation, but the same text tolerance appears not to work when in a table extraction context? Possibly I'm not understanding something important.

@jsvine jsvine added the bug label Feb 3, 2020
@jsvine
Copy link
Owner

jsvine commented Feb 3, 2020

Thanks for flagging, @jsfenfen. That's a good catch and, indeed, a bug. Table.extract(...) accepts x_tolerance and y_tolerance as parameters, but Page.extract_tables(...) does not pass those parameters to that method.

In the meantime, until I've pushed a fix for the bug, you can replace this in your notebook:

result = pg.crop(under_headers_region).extract_tables(config)

... with this:

result = [ table.extract(x_tolerance = 1)
  for table in pg.crop(under_headers_region).find_tables(config) ]

Does that do it for you?

@jsfenfen
Copy link
Contributor Author

jsfenfen commented Feb 4, 2020

Hey @jsvine that totally works for my purposes, thanks a ton!

@jsvine
Copy link
Owner

jsvine commented May 28, 2020

(Fixed in b498df2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants