Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem extracting text on a two columns layout #112

Closed
manentai opened this issue Dec 26, 2023 · 6 comments
Closed

problem extracting text on a two columns layout #112

manentai opened this issue Dec 26, 2023 · 6 comments

Comments

@manentai
Copy link

I have opened a question here: https://discourse.julialang.org/t/how-to-extract-data-from-pdf-with-two-columns/108008
but maybe this is the right place...

from this slide

using this:

doc = pdDocOpen(src)
docinfo = pdDocGetInfo(doc)
npage = pdDocGetPageCount(doc)
io = IOBuffer()
for i = 1:npage
    page = pdDocGetPage(doc, i)
    pdPageExtractText(io, page)
end
pdDocClose(doc)

I get this:
gives a poor performance: ● SAM: The solution ○ EU incubators / accelerators: ~1200 market size ○ EU VC: ~500 ● SOM: ○ IT incubators / ● TAM:○ 1.35M tech startups accelerators: ~250, worldwide ○ IT VCs: ~60 ○ ~7000 incubators / accelerators worldwide, ○ proxies to enter and counting 5000 startups ○ ~2500 VC firms, half of them on growth giano.rocks

which is kind of ignoring the two columns layout. I am trying to understand from the API if there's a way to simply extract all the objects in the page and then extract texts from these objects.
What am I missing?

@sambitdash
Copy link
Owner

#17 and #2 are enhancements recorded on similar requirements. Hence, closing this issue as a duplicate.

@manentai
Copy link
Author

manentai commented Dec 26, 2023

thanks for the pointers, since pdPageEvalContent from #17 does not seem to exist anymore, is pdPageGetContentObjects the right one to get the structure now? if this is the case, how to iterate over the PDPageObjectGroup returned?

@sambitdash
Copy link
Owner

sambitdash commented Dec 26, 2023

pdPageEvalContent
is exported and is very much available. However, this method is a callback and depends on the PDF page object type. It is not documented and can only be used by someone who understands PDF specifications well enough to work with the callback.

@manentai
Copy link
Author

manentai commented Dec 26, 2023

sorry, I cannot find it here: https://docs.juliahub.com/PDFIO/cmOJE/0.1.14/, and when I try to use it I get UndefVarError: pdPageEvalContent` not defined

Stacktrace:
[1] top-level scope`

@manentai
Copy link
Author

pdPageEvalContent

is exported and is very much available. However, this method is a callback and depends on the PDF page object type. It is not documented and can only be used by someone who understands PDF specifications well enough to work with the callback.

ok fair enough I guess...

@sambitdash
Copy link
Owner

function pdPageExtractText(io::IO, page::PDPage)
    state = pdPageEvalContent(page)
    show_text_layout!(io, state)
    return io
end

pdPageExtractText calls pdPageEvalContent that populates the content objects in the state object. show_text_layout! does the actual layout translation. It has no understanding of multi-column text. It just tries to map the text from left to right and top to bottom. PDF as a format does not understand columns. However, you can take the content objects and render the text objects using your heuristics. That is the suggestion given in #17.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants