Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: Have .extract_text() return an empty string ('') instead of None in the case of no text found in a PDF #482

Closed
tungph opened this issue Jul 23, 2021 · 4 comments
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"

Comments

@tungph
Copy link

tungph commented Jul 23, 2021

chars = to_list(chars)
if len(chars) == 0:
return None

I would like to make a small suggestion here to avoid for None checking in the calling code, extract_text utils must return an empty string '' instead of None in case of no text found in a pdf.

@tungph tungph added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Jul 23, 2021
@jsvine
Copy link
Owner

jsvine commented Aug 12, 2021

Thank you for this proposal, @tungph. I think it makes sense, but I want to be careful that we're not overlooking important use-cases. Especially: Are there instances where it would be important to distinguish between '' and None? Any thoughts on this, @samkit-jain?

@samkit-jain
Copy link
Collaborator

My preference would be to keep the existing None behaviour because None and '' can mean 2 different things. Also, this should be considered as a breaking change as there will be existing workflows that might be relying on the method returning a None.

An important question should also be whether pdfplumber can even return an empty string? I think so yes as there exists a Unicode code for an empty string and if a PDF consists of that single character, an empty string will be returned. It would be ambiguous as it can mean both that no character exists or a null character exists.

@jsvine
Copy link
Owner

jsvine commented Aug 12, 2021

Thanks @samkit-jain. That (needing to distinguish between '' and None) had also been my initial instinct — and, if I recall correctly, why I had initially implemented it that way. I'm starting, however, to change my mind, for a couple of reasons:

  • Is there a meaningful, end-user-relevant difference between a page (or table cell, etc.) that contains no characters vs. one that contains only null/'' characters? In that vast majority of cases, I think not; what's relevant to the user is just that there is no text there.

  • There may be some instances where that distinction is important, but I think that a simple examination of page.chars is better suited to the detection of null/'' characters that .extract_text(). One reason: .extract_text() (in its current implementation) still condenses multiple adjacent null/'' characters into a single ''.

Still, I'm not quite 100% there. Is '' a more Pythonic/philosophically-accurate way than None of representing a page/cell/etc. with no characters in it? I'd be interested for more pdfplumber users/watchers to weigh in.

And, as you point out: This, if implemented, would constitute a breaking change for many workflows. It would probably have to wait for 0.6.0.

@jsvine jsvine changed the title extract_text utils must return an empty string '' instead of None in case of no text found in a pdf Request: Have .extract_text() return an empty string ('') instead of None in the case of no text found in a PDF Aug 12, 2021
@jsvine
Copy link
Owner

jsvine commented Dec 24, 2021

Hi @tungph, and thanks again for this suggestion. This is now the default behavior for .extract_text(...) as of v0.6.0. (See cb9900b and CHANGELOG.md, with an h/t to you.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"
Projects
None yet
Development

No branches or pull requests

3 participants