problem extracting text on a two columns layout #112

manentai · 2023-12-26T09:56:56Z

I have opened a question here: https://discourse.julialang.org/t/how-to-extract-data-from-pdf-with-two-columns/108008
but maybe this is the right place...

from this slide

using this:

doc = pdDocOpen(src)
docinfo = pdDocGetInfo(doc)
npage = pdDocGetPageCount(doc)
io = IOBuffer()
for i = 1:npage
    page = pdDocGetPage(doc, i)
    pdPageExtractText(io, page)
end
pdDocClose(doc)

I get this:
gives a poor performance: ● SAM: The solution ○ EU incubators / accelerators: ~1200 market size ○ EU VC: ~500 ● SOM: ○ IT incubators / ● TAM:○ 1.35M tech startups accelerators: ~250, worldwide ○ IT VCs: ~60 ○ ~7000 incubators / accelerators worldwide, ○ proxies to enter and counting 5000 startups ○ ~2500 VC ﬁrms, half of them on growth giano.rocks

which is kind of ignoring the two columns layout. I am trying to understand from the API if there's a way to simply extract all the objects in the page and then extract texts from these objects.
What am I missing?

The text was updated successfully, but these errors were encountered:

sambitdash · 2023-12-26T16:07:56Z

#17 and #2 are enhancements recorded on similar requirements. Hence, closing this issue as a duplicate.

manentai · 2023-12-26T18:01:11Z

thanks for the pointers, since pdPageEvalContent from #17 does not seem to exist anymore, is pdPageGetContentObjects the right one to get the structure now? if this is the case, how to iterate over the PDPageObjectGroup returned?

sambitdash · 2023-12-26T18:29:44Z

PDFIO.jl/src/PDPage.jl

Line 11 in 826552f

pdPageEvalContent

is exported and is very much available. However, this method is a callback and depends on the PDF page object type. It is not documented and can only be used by someone who understands PDF specifications well enough to work with the callback.

manentai · 2023-12-26T18:34:18Z

sorry, I cannot find it here: https://docs.juliahub.com/PDFIO/cmOJE/0.1.14/, and when I try to use it I get UndefVarError: pdPageEvalContent` not defined

Stacktrace:
[1] top-level scope`

manentai · 2023-12-26T18:37:16Z

PDFIO.jl/src/PDPage.jl

Line 11 in 826552f

pdPageEvalContent

is exported and is very much available. However, this method is a callback and depends on the PDF page object type. It is not documented and can only be used by someone who understands PDF specifications well enough to work with the callback.

ok fair enough I guess...

sambitdash · 2023-12-26T18:54:15Z

function pdPageExtractText(io::IO, page::PDPage)
    state = pdPageEvalContent(page)
    show_text_layout!(io, state)
    return io
end

pdPageExtractText calls pdPageEvalContent that populates the content objects in the state object. show_text_layout! does the actual layout translation. It has no understanding of multi-column text. It just tries to map the text from left to right and top to bottom. PDF as a format does not understand columns. However, you can take the content objects and render the text objects using your heuristics. That is the suggestion given in #17.

sambitdash closed this as completed Dec 26, 2023

sambitdash added the duplicate label Dec 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problem extracting text on a two columns layout #112

problem extracting text on a two columns layout #112

manentai commented Dec 26, 2023

sambitdash commented Dec 26, 2023

manentai commented Dec 26, 2023 •

edited

Loading

sambitdash commented Dec 26, 2023 •

edited

Loading

manentai commented Dec 26, 2023 •

edited

Loading

manentai commented Dec 26, 2023

sambitdash commented Dec 26, 2023

problem extracting text on a two columns layout #112

problem extracting text on a two columns layout #112

Comments

manentai commented Dec 26, 2023

sambitdash commented Dec 26, 2023

manentai commented Dec 26, 2023 • edited Loading

sambitdash commented Dec 26, 2023 • edited Loading

manentai commented Dec 26, 2023 • edited Loading

manentai commented Dec 26, 2023

sambitdash commented Dec 26, 2023

manentai commented Dec 26, 2023 •

edited

Loading

sambitdash commented Dec 26, 2023 •

edited

Loading

manentai commented Dec 26, 2023 •

edited

Loading