Extracting text with a specific font with a rectangular region as selection area #55

kskyten · 2019-04-24T09:05:39Z

It is common to use different fonts to denote semantic meaning (e.g italics for emphasis or larger font size for section titles). Is it possible to extract text that is in a specific font and size? Also, is it possible to specify a region where to extract from? I would like to be able to, for example, extract all the italic text inside a region.

sambitdash · 2019-04-24T09:14:27Z

Not as an API. However, it's not hard to implement or extend pdPageExtractText for these purposes. If you plan to submit a PR, please feel free to do so.

kskyten · 2019-04-24T09:20:19Z

Can you give some hints on how to implement this?

sambitdash · 2019-04-24T09:32:57Z

pdPageEvalContent is essentially the method to evaluate the content stream stack and populates intermittent values to the graphic state the stack. This stack / state is called GState.

You have to pass a rectangular selection area or font name or attribute as a parameter on the GState. Internally, GState is a stack of Dictionary objects. In evalContent!(tr::PDPageTextRun, state::GState) method filter the text which does not fit into your selection criteria and just pick up the TextRuns that are relevant. Once, that's done, show_text_layout! will sort the relevant text area and show only the selected text which is there in the GState[end][:text_layout].

kskyten · 2019-04-24T17:30:42Z

I made some progress with your advice, but then I got stuck. I added a query font to the state and I check whether it matches. The problem is that now the following text commands for non-matching fonts will not work, so I also added a variable to the state to indicate whether the current font matches the query. This did not work. I get a string "\0\0\0\0...\0" back. The code is also not very elegant, but I just wanted to see if I could get it working first. I mostly just used the existing functions, except for adding the state variables.

Here is my attempt:

import PDFIO.PD: evalContent!, GState,
       show_text_layout!, PDXObject, get_font, get_TextBox,
       TextLayout, offset_text_pos!

import Base.==

function (==)(a::PDPageElement, b::PDPageElement)
    ret = true
    for f in fieldnames(typeof(a))
        if getproperty(a, f) != getproperty(b, f)
            ret = false
        end
    end
    return ret
end

@inline function evalContent!(pdo::PDPageElement{:Tf}, state::GState)
    src = get(state, :source, Union{PDPage, PDXObject})
    fontname = pdo.operands[1]
    font = get_font(src, fontname)
    query = get(state, :query, PDPageElement{:Tf})

    if (font === CosNull) || (font != query)
        state[:matching_font] = false
        return state
    end

    state[:matching_font] = true
    state[:font] = (fontname, font)
    fontsize = get(pdo.operands[2])
    # PDF Spec expects any number so better to standardize to Float32
    state[:fontsize] = Float32(fontsize)
    return state
end

@inline function evalContent!(tr::PDPageTextRun, state::GState)
    if get(state, :matching_font, Bool)
        evalContent!(tr.elem, state)
        tfs = get(state, :fontsize, 0f0)
        th  = get(state, :Tz, Float32)/100f0
        ts  = get(state, :Ts, Float32)
        tc  = get(state, :Tc, Float32)
        tw  = get(state, :Tw, Float32)
        tm  = get(state, :Tm, Matrix{Float32})
        ctm = get(state, :CTM, Matrix{Float32})
        trm = tm*ctm

        (fontname, font) = get(state, :font,
                               (cn"", CosNull),
                               Tuple{CosName, PDFont})
        heap = get(state, :text_layout, Vector{TextLayout})
        text, w, h = get_TextBox(tr.ss, font, tfs, tc, tw, th)
        d = get(state, :h_profile, Dict{Int, Int})
        ih = round(Int, h*10)
        d[ih] = get(d, ih, 0) + length(text)
        tb = [0f0 0f0 1f0; w 0f0 1f0; w h 1f0; 0f0 h 1f0]*trm
        if !get(state, :in_artifact, false)
            tl = TextLayout(tb[1,1], tb[1,2], tb[2,1], tb[2,2],
                            tb[3,1], tb[3,2], tb[4,1], tb[4,2],
                            text, fontname, font.flags)
            push!(heap, tl)
        end
        offset_text_pos!(w, 0f0, state)
        return state
    else
        return state
    end
end

@inline function evalContent!(pdo::PDPageElement{:TD}, state::GState)
    if get(state, :matching_font, Bool)
        tx = Float32(get(pdo.operands[1]))
        ty = Float32(get(pdo.operands[2]))

        state[:TL] = -ty
        set_text_pos!(tx, ty, state)
    else
        return state
    end
end

function evaluate(src, objs, query)
    state = GState{:PDFIO}()
    state[:source] = src
    state[:query] = query
    state[:matching_font] = false

    for o in objs
        evalContent!(o, state)
    end

    io = IOBuffer()
    show_text_layout!(io, state)
    String(io.data)
end

sambitdash · 2019-04-24T18:17:11Z

You can initialize :clipping_rect in pdPageExtractText

You can go to this location:

PDFIO.jl/src/PDPageElement.jl

Line 653 in 95000b6

if !get(state, :in_artifact, false)

This code will for example exclude all Italic fonts.

    if !get(state, :in_artifact, false) && !pdFontIsItalic(font)
        tl = TextLayout(tb[1,1], tb[1,2], tb[2,1], tb[2,2],
                        tb[3,1], tb[3,2], tb[4,1], tb[4,2],
                        text, fontname, font.flags)
        r = CDRect(tl)
        if intersects(r, get(state, :clipping_rect, CDRect{Float32})
            push!(heap, tl)
        end
    end

You do not need to override any other method to my belief as overriding them will clearly affect the PDF graphics state and transformation matrices may be seriously affected, thus affecting the renderer logic. Without understanding PDF specification page rendering any such changes can detrimental to the output. Will recommend reading the chapter on PDF text rendering for the same.

kskyten · 2019-04-24T19:20:34Z

Thank you! It works great.

sambitdash · 2019-04-24T20:06:47Z

Now that it worked, you can make a modification to the pdPageExtractText which can take a clipping rectangle path as input or certain font characteristics as input parameter and submit a PR.

Nosferican · 2020-11-09T22:07:07Z

Is there an example of how to extract all bold text in a pdf page?

sambitdash · 2020-11-11T06:35:41Z

You can initialize :clipping_rect in pdPageExtractText

You can go to this location:

PDFIO.jl/src/PDPageElement.jl

Line 653 in 95000b6

if !get(state, :in_artifact, false)

This code will for example exclude all Italic fonts.
    if !get(state, :in_artifact, false) && !pdFontIsItalic(font)
        tl = TextLayout(tb[1,1], tb[1,2], tb[2,1], tb[2,2],
                        tb[3,1], tb[3,2], tb[4,1], tb[4,2],
                        text, fontname, font.flags)
        r = CDRect(tl)
        if intersects(r, get(state, :clipping_rect, CDRect{Float32})
            push!(heap, tl)
        end
    end
You do not need to override any other method to my belief as overriding them will clearly affect the PDF graphics state and transformation matrices may be seriously affected, thus affecting the renderer logic. Without understanding PDF specification page rendering any such changes can detrimental to the output. Will recommend reading the chapter on PDF text rendering for the same.

Use pdFontIsBold instead of pdFontIsItalic in the code.

Nosferican · 2020-11-12T18:56:35Z

I ended up working with the text itself but it would be nice to have a code snippet / example from the PDF page to running the custom pdPageExtractText.

sambitdash changed the title ~~Extracting text with a specific font~~ Extracting text with a specific font with a rectangular region as selection area Apr 24, 2019

sambitdash added the enhancement label Apr 24, 2019

sambitdash mentioned this issue Apr 24, 2019

Extracting boxes #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting text with a specific font with a rectangular region as selection area #55

Extracting text with a specific font with a rectangular region as selection area #55

kskyten commented Apr 24, 2019

sambitdash commented Apr 24, 2019

kskyten commented Apr 24, 2019

sambitdash commented Apr 24, 2019 •

edited

Loading

kskyten commented Apr 24, 2019

sambitdash commented Apr 24, 2019 •

edited

Loading

kskyten commented Apr 24, 2019

sambitdash commented Apr 24, 2019

Nosferican commented Nov 9, 2020

sambitdash commented Nov 11, 2020

Nosferican commented Nov 12, 2020

Extracting text with a specific font with a rectangular region as selection area #55

Extracting text with a specific font with a rectangular region as selection area #55

Comments

kskyten commented Apr 24, 2019

sambitdash commented Apr 24, 2019

kskyten commented Apr 24, 2019

sambitdash commented Apr 24, 2019 • edited Loading

kskyten commented Apr 24, 2019

sambitdash commented Apr 24, 2019 • edited Loading

kskyten commented Apr 24, 2019

sambitdash commented Apr 24, 2019

Nosferican commented Nov 9, 2020

sambitdash commented Nov 11, 2020

Nosferican commented Nov 12, 2020

sambitdash commented Apr 24, 2019 •

edited

Loading

sambitdash commented Apr 24, 2019 •

edited

Loading