-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a way to get something similar to pdftotext's layout? #10
Comments
I was able to do something like this using an alternate utils.collate_line_layout.
Output is below (I used the default of 90 chars wide). Not as pretty -- and you probably would wanna remove the left margin. How do you think this should work? One could, of course, just use pdftotext -layout, but I've come across stuff where just being able to grab a region and deal with it in a way that spaces matter seems like a good idea...
|
It would be more consistent if one used |
🎉 This seems like a great feature to have/add. And seems quite doable. There are (at least) two options for implementation:
My slight preference is for the first option, since that seems like the logical place where people would look. Thoughts? And, yep @jsfenfen, seems like @dannguyen: Are you looking for something similar to Should be able to knock this out some evening or weekend soon. |
@jsvine Uh...I was only interested in it in the assumption that it was just some "standard" but you have to re-implement it yourself from scratch? I don't know why I assumed that, maybe I was just really optimistic and intoxicated at the time...mostly, I was hoping for a cross-platform way to parse PDFs to text with the layout, just for regex exercises that didn't require installing poppler. But the layout option was never perfect and would sometimes mangle lines anyway. |
Intoxicated Optimism will be the name of my next band. But to your point:
|
Did anyone got the time to work on this feature? Having |
I am also interested in this. Did you give this a try @jsvine ? |
I am also interested in this. Did you give this a try @jsvine ? |
I've sketched out some possible implementations, but haven't made the time yet to code/debug them. Thank you for noting your interest, however; it's useful information. |
Can you please explain how can I use this function as per my use case. |
See the docstring in utils.words_to_layout for details on the implementation. Addresses issue #10 and related issues.
See the docstring in utils.words_to_layout for details on the implementation. Addresses issue #10 and related issues.
More than five years after this issue was first opened (thank you @dannguyen), ... and more extensively in the code itself: pdfplumber/pdfplumber/utils.py Lines 345 to 382 in 95c049d
|
Is there an option similar to pdftotext's
-layout
flag, which "maintain[s] original physical layout"? I understand that's probably up to the pdfminer engine...Here's what I mean:
Original PDF
http://www.supremecourt.gov/oral_arguments/argument_transcripts/07-1315.pdf
pdftotext with
-layout
Output:
pdfminer
Output:
The text was updated successfully, but these errors were encountered: