-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracting text with a specific font with a rectangular region as selection area #55
Comments
Not as an API. However, it's not hard to implement or extend |
Can you give some hints on how to implement this? |
You have to pass a rectangular selection area or font name or attribute as a parameter on the |
I made some progress with your advice, but then I got stuck. I added a query font to the state and I check whether it matches. The problem is that now the following text commands for non-matching fonts will not work, so I also added a variable to the state to indicate whether the current font matches the query. This did not work. I get a string Here is my attempt: import PDFIO.PD: evalContent!, GState,
show_text_layout!, PDXObject, get_font, get_TextBox,
TextLayout, offset_text_pos!
import Base.==
function (==)(a::PDPageElement, b::PDPageElement)
ret = true
for f in fieldnames(typeof(a))
if getproperty(a, f) != getproperty(b, f)
ret = false
end
end
return ret
end
@inline function evalContent!(pdo::PDPageElement{:Tf}, state::GState)
src = get(state, :source, Union{PDPage, PDXObject})
fontname = pdo.operands[1]
font = get_font(src, fontname)
query = get(state, :query, PDPageElement{:Tf})
if (font === CosNull) || (font != query)
state[:matching_font] = false
return state
end
state[:matching_font] = true
state[:font] = (fontname, font)
fontsize = get(pdo.operands[2])
# PDF Spec expects any number so better to standardize to Float32
state[:fontsize] = Float32(fontsize)
return state
end
@inline function evalContent!(tr::PDPageTextRun, state::GState)
if get(state, :matching_font, Bool)
evalContent!(tr.elem, state)
tfs = get(state, :fontsize, 0f0)
th = get(state, :Tz, Float32)/100f0
ts = get(state, :Ts, Float32)
tc = get(state, :Tc, Float32)
tw = get(state, :Tw, Float32)
tm = get(state, :Tm, Matrix{Float32})
ctm = get(state, :CTM, Matrix{Float32})
trm = tm*ctm
(fontname, font) = get(state, :font,
(cn"", CosNull),
Tuple{CosName, PDFont})
heap = get(state, :text_layout, Vector{TextLayout})
text, w, h = get_TextBox(tr.ss, font, tfs, tc, tw, th)
d = get(state, :h_profile, Dict{Int, Int})
ih = round(Int, h*10)
d[ih] = get(d, ih, 0) + length(text)
tb = [0f0 0f0 1f0; w 0f0 1f0; w h 1f0; 0f0 h 1f0]*trm
if !get(state, :in_artifact, false)
tl = TextLayout(tb[1,1], tb[1,2], tb[2,1], tb[2,2],
tb[3,1], tb[3,2], tb[4,1], tb[4,2],
text, fontname, font.flags)
push!(heap, tl)
end
offset_text_pos!(w, 0f0, state)
return state
else
return state
end
end
@inline function evalContent!(pdo::PDPageElement{:TD}, state::GState)
if get(state, :matching_font, Bool)
tx = Float32(get(pdo.operands[1]))
ty = Float32(get(pdo.operands[2]))
state[:TL] = -ty
set_text_pos!(tx, ty, state)
else
return state
end
end
function evaluate(src, objs, query)
state = GState{:PDFIO}()
state[:source] = src
state[:query] = query
state[:matching_font] = false
for o in objs
evalContent!(o, state)
end
io = IOBuffer()
show_text_layout!(io, state)
String(io.data)
end |
You can initialize You can go to this location: Line 653 in 95000b6
This code will for example exclude all Italic fonts.
You do not need to override any other method to my belief as overriding them will clearly affect the PDF graphics state and transformation matrices may be seriously affected, thus affecting the renderer logic. Without understanding PDF specification page rendering any such changes can detrimental to the output. Will recommend reading the chapter on PDF text rendering for the same. |
Thank you! It works great. |
Now that it worked, you can make a modification to the |
Is there an example of how to extract all bold text in a pdf page? |
Use |
I ended up working with the text itself but it would be nice to have a code snippet / example from the PDF page to running the custom |
It is common to use different fonts to denote semantic meaning (e.g italics for emphasis or larger font size for section titles). Is it possible to extract text that is in a specific font and size? Also, is it possible to specify a region where to extract from? I would like to be able to, for example, extract all the italic text inside a region.
The text was updated successfully, but these errors were encountered: