-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extracting text from a two columns page #244
Comments
Hi @fdq09eca , thanks for showing interest in the library. This is something not possible since when extracting text, it goes top to down and left to right. To get the text in the way you desire, you would have to come up with your own logic. One possible workaround could be to (assuming that the 2 columns section is a table) extract table and then read the text column by column. |
Another possible solution, specific to the page 63 could be:
|
@samkit-jain excellent work-around! |
Is there no general solution to this problem though? I have a dataset of 100 or so scientific papers I'm trying to parse, and having text extraction just smush the columns together is a bit problematic when I'm trying to see certain parts in context. |
@cmicek1 Could you please share a sample PDF that you are dealing with and elaborate on the problem that you are facing along with a reproducible code and the result that you are getting? |
This StackOverflow page provides some interesting insight into this very question: https://stackoverflow.com/questions/22675690/if-identifying-text-structure-in-pdf-documents-is-so-difficult-how-do-pdf-reade |
How might I do this conditionally? Only if when a page does have column text @samkit-jain |
@danielbellhv It would depend on the PDFs you are dealing with. A sophisticated solution might be to use a layout analysis algorithm to identify whether a page is multi column or not. A simpler solution could be to crop the page keeping the middle 5% and run text extraction on it to see if there's any text or not. If no text, then there could be 2 columns. Of course, you will have to tweak the 5% and see what best fits your need. This also assumes that the full page is in a 2 column layout. |
Would you mind copying and pasting that as a comment under my SO post, please? I will try middle 5% :) |
I have since solved my own sub-issue, using pdfminer. Answered here. Thanks for your input |
You can use PyPDF2 instead of pdfplumber there it reads the pdf left side first and then the right side just like humans do .Hence if you use PyPDF2 no need splitting the page . |
@ameymn can you post a link to a example code of that specific PyPDF2 usage? |
You can read this documentation if you need any help |
did you got any solution for this? I need extract text from pdf containing 2 pdf format. |
hi, @cmicek1 Did you get any solution for this? |
I extract the text of the following page:

I used the following code
it produce this

I want to turn two columns ending section, see below, into two rows

such that it is
not sure if it is possible. It will be great that if there is a function which returns boolean that show if the ending is a two-columns. Any suggestion will be appreciated!
The text was updated successfully, but these errors were encountered: