Request: Have `.extract_text()` return an empty string (`''`) instead of `None` in the case of no text found in a PDF #482

tungph · 2021-07-23T17:38:18Z

Lines 372 to 374 in 002500a

    
           chars = to_list(chars) 
        
           if len(chars) == 0: 
        
               return None

I would like to make a small suggestion here to avoid for None checking in the calling code, extract_text utils must return an empty string '' instead of None in case of no text found in a pdf.

The text was updated successfully, but these errors were encountered:

jsvine · 2021-08-12T03:13:04Z

Thank you for this proposal, @tungph. I think it makes sense, but I want to be careful that we're not overlooking important use-cases. Especially: Are there instances where it would be important to distinguish between '' and None? Any thoughts on this, @samkit-jain?

samkit-jain · 2021-08-12T08:55:20Z

My preference would be to keep the existing None behaviour because None and '' can mean 2 different things. Also, this should be considered as a breaking change as there will be existing workflows that might be relying on the method returning a None.

An important question should also be whether pdfplumber can even return an empty string? I think so yes as there exists a Unicode code for an empty string and if a PDF consists of that single character, an empty string will be returned. It would be ambiguous as it can mean both that no character exists or a null character exists.

jsvine · 2021-08-12T13:39:17Z

Thanks @samkit-jain. That (needing to distinguish between '' and None) had also been my initial instinct — and, if I recall correctly, why I had initially implemented it that way. I'm starting, however, to change my mind, for a couple of reasons:

Is there a meaningful, end-user-relevant difference between a page (or table cell, etc.) that contains no characters vs. one that contains only null/'' characters? In that vast majority of cases, I think not; what's relevant to the user is just that there is no text there.
There may be some instances where that distinction is important, but I think that a simple examination of page.chars is better suited to the detection of null/'' characters that .extract_text(). One reason: .extract_text() (in its current implementation) still condenses multiple adjacent null/'' characters into a single ''.

Still, I'm not quite 100% there. Is '' a more Pythonic/philosophically-accurate way than None of representing a page/cell/etc. with no characters in it? I'd be interested for more pdfplumber users/watchers to weigh in.

And, as you point out: This, if implemented, would constitute a breaking change for many workflows. It would probably have to wait for 0.6.0.

jsvine · 2021-12-24T02:57:59Z

Hi @tungph, and thanks again for this suggestion. This is now the default behavior for .extract_text(...) as of v0.6.0. (See cb9900b and CHANGELOG.md, with an h/t to you.)

tungph added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Jul 23, 2021

jsvine changed the title ~~extract_text utils must return an empty string '' instead of None in case of no text found in a pdf~~ Request: Have .extract_text() return an empty string ('') instead of None in the case of no text found in a PDF Aug 12, 2021

jsvine closed this as completed Dec 24, 2021

mehaase mentioned this issue Feb 7, 2022

Text cannot be extracted from some PDFs center-for-threat-informed-defense/tram#137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: Have `.extract_text()` return an empty string (`''`) instead of `None` in the case of no text found in a PDF #482

Request: Have `.extract_text()` return an empty string (`''`) instead of `None` in the case of no text found in a PDF #482

tungph commented Jul 23, 2021

jsvine commented Aug 12, 2021

samkit-jain commented Aug 12, 2021

jsvine commented Aug 12, 2021

jsvine commented Dec 24, 2021

Request: Have .extract_text() return an empty string ('') instead of None in the case of no text found in a PDF #482

Request: Have .extract_text() return an empty string ('') instead of None in the case of no text found in a PDF #482

Comments

tungph commented Jul 23, 2021

jsvine commented Aug 12, 2021

samkit-jain commented Aug 12, 2021

jsvine commented Aug 12, 2021

jsvine commented Dec 24, 2021

Request: Have `.extract_text()` return an empty string (`''`) instead of `None` in the case of no text found in a PDF #482

Request: Have `.extract_text()` return an empty string (`''`) instead of `None` in the case of no text found in a PDF #482