Make whitespace checking more robust #88

OisinMoran · 2018-10-19T23:53:57Z

Currently only looks for the literal " " and fails on other whitespace characters like "\xa0" (non-breaking space). This causes extract_words() to not split words on these characters, and also to pick up stray whitespace as words. Python's builtin string method isspace() is perfect for this.

I've never actually written a test before but from looking at some of yours, this might do the trick. I've attached an example document below. Currently len(words) == 32 but from looking at the document you can see it should be 25, and indeed it is when using isspace().

import pdfplumber
filename = "whitespace_test.pdf"
with pdfplumber.open(filename) as pdf:
    first_page = pdf.pages[0]
    words = first_page.extract_words()
    assert(len(words) == 25)

whitespace_test.pdf

Currently only looks for the literal " " and fails on other whitespace characters like "\xa0" (non-breaking space). Python's builtin string method "isspace()" is perfect for this.

Make whitespace checking more robust

588b3f8

Currently only looks for the literal " " and fails on other whitespace characters like "\xa0" (non-breaking space). Python's builtin string method "isspace()" is perfect for this.

jsvine merged commit 588b3f8 into jsvine:master Nov 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make whitespace checking more robust #88

Make whitespace checking more robust #88

OisinMoran commented Oct 19, 2018

Make whitespace checking more robust #88

Make whitespace checking more robust #88

Conversation

OisinMoran commented Oct 19, 2018