-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
word-level font names and heights #28
Comments
I guess with word heights I'm going back and forth on averaging them or taking the mode; left the latter in for the moment. |
Thanks! I like this. For testing's sake: Do you have shareable examples of PDFs where chars that should belong to the same word either have different heights or fontnames? |
So I still haven't heard back about the files that originally required this. I could pretty easily just make up a sample pdf that failed the font height test, though obviously having an example would be better... The other time this stuff (can) come up is when the word tolerance is set too high and words run together inadvertently--though only if adjacent cells have different fonts. Will look around a bit. |
No worries. Thinking through this a bit. I'm tempted to, by default, group words by fonts, size, and color. (Yes, upcoming versions of def extract_words(chars,
x_tolerance=DEFAULT_X_TOLERANCE,
y_tolerance=DEFAULT_Y_TOLERANCE,
keep_blank_chars=False,
match_fontsize=True,
match_fontcolor=True,
match_fontname=True
) That'd mean losing some of the flexibility of, e.g., page.extract_words() .... might return ... [ {
"text": "Hello",
"fontsize": 12,
"fontname": "ArialBold",
"fontcolor": "#000000"
} ] ... while ... page.extract_words(match_fontsize=False) .... would return ... [ {
"text": "Hello",
"fontname": "ArialBold",
"fontcolor": "#000000"
} ] What do you think? Too inflexible? |
I think that's great! Also, I think whatever adjustments might be needed will become more obvious the more pdfs we trawl through... |
I got a different sample of the docs with the font height thing! Going through them, uh, soonish. |
Ok, I have this working in the word_fonts branch here using made up pdfs as tests. Trying to dig up the sample observed in the wild. Am doing this with a custom WordFontError subclassed from RuntimeError, but am open to suggestions... No idea if this will be at all helpful ahead of 0.60 rewrite, but... |
Ooh, thanks! Will definitely aim to incorporate this (or something close to it) into the next big release. |
Is this in the current version? I am looking for font name and font size per work and not per letter. |
hey @krishnakt031990 I don't think so, though the version I did of it is still here: https://github.com/jsfenfen/pdfplumber/tree/master . I guess there's a minor release that's been added since, I will update when I've got a sec. |
Works perfectly! thanks @jsfenfen. Just have another question regarding the document. Did you try to reverse engineer to build a pdf out of the extracted properties of text? Just wanted some tips to create one if you did look into doing it. |
"Did you try to reverse engineer to build a pdf out of the extracted properties of text?" |
@krishnakt031990 is this a pdf that's been OCR'ed? Fonts aren't very reliable in most of the OCR I've seen--could this have been set there? Also possible this is a pdfminer thing? Can you share a doc that does this? |
@jsvine Is this issue resolved and the functionality added. |
This functionality has not yet been added. I'm certainly open to adding it, but haven't had the time quite yet. |
I wanted this functionality in one of my project. I have done some changes in the repo code to support this functionality, should I push it in a branch and create pull request. So that we can discuss and add it. |
Thanks, @Saqhas! It's definitely worth a discussion and opening a pull request. I'm not certain I'll use your code, but it could definitely be helpful inspiration and I would certainly credit you for that. |
can we capture based on the font size, for eg if my font size is 12 I need the relevant words from that? |
@ibrahimshuail See my response to the separate issue you opened, #234 |
Closing this now-done issue. Per merged PR above, this feature was added last year! 🎉 |
Having a font for an entire word helps parsing. A lot. Height also helps some.
I took a crack at this here, with some settings. Defaults also may need adjustment.
If you've got thoughts, @jsvine, lemme know and I can clean this up into a pr. Haven't gotten the testing set up yet.
jsfenfen@847a3bb
The text was updated successfully, but these errors were encountered: