-
Notifications
You must be signed in to change notification settings - Fork 1.5k
ENH: Add all font metrics for base 14 Type 1 PDF fonts. #3363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
[about failing tests] EDIT 2 The essentials in my view: having more complete font metrics available so that we can properly generate right-aligned and centered output. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3363 +/- ##
==========================================
+ Coverage 96.98% 96.99% +0.01%
==========================================
Files 54 56 +2
Lines 9347 9384 +37
Branches 1713 1714 +1
==========================================
+ Hits 9065 9102 +37
Misses 168 168
Partials 114 114 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e2658b8 to
8cb825d
Compare
|
Thanks for the further changes. There still are some further changes I would like to see:
|
No problem!
I think about 20%? The resulting formatting is still very like pdfminer (although their original script did not produce unicode codes but ints). EDIT:
This is how it is in the original AFM files:
and
So, the same copyright is in two locations, one after Comment and one after Notice. Only the second, however, includes mention of the trademark. And do notice the lack of a space between "Reserved." and "Helvetica". So, with all that, you indeed get a rather strange copyright detection logic!
OK. Then I think I need more information about what that would entail. What I think it does:
Is that correct? If not, perhaps it might be easier to reach out via discord, where I use the same user name. |
|
In the first step, it would be sufficient for me to just create a new dataclass with the properties outlined previously. Merging it with the existing class can always happen later, but I would like to see real structured data for now. We basically need to define the new dataclass and adapt the generator script to call the dataclass constructor accordingly. Regarding the pdfminer.six code and the copyrights, I will need to have another look at it in the next days. |
|
I'm afraid that means I'm officially out of my depth, I think I just lack the prerequisite knowledge to understand what you intend the generator script to do. If possible, please contact me on Discord one of these days so that we can discuss. And as always, thanks for your comments! |
|
P.S., this is what Google Gemini thinks: |
|
I have just used your script and adapted it to show what I mean: https://gist.github.com/stefan6419846/3d368b26ee5260a7886657909f26ca15 The This will generate the code to create instances of the dataclass shown in #3363 (comment) together with the copyrights. The only things are are currently missing are the dataclass definition itself and the necessary imports as well as the mapping in the footer, but this should be easy enough to add to the script. By the way: We should probably keep the script outside of the original code, but run some linting and testing on it nevertheless to ensure it matches our standards and does not break for some reason. |
|
@stefan6419846 OK, now I understand better! Basically, you want the script to immediately produce the specific Font instances, instead of ending up in an intermediate form. With regard to linting / typing, as far as I can tell, I can just run ruff and mypy locally where it can find the script...? At least, that seems to work here. I understand also, that using this, I can do something like: and then do stuff like: One question; Why do you only add 255 widths? The AFMs contain about 314 widths. Second; so, this really collects a lot of information in the Font class. In my local patches, I need the widths (and I expect also some other metrics) for text wrapping a text stream when flattening an annotation. With my original patches, I changed build_font_width_map to include the character widths that are now in the Font instances in FONT_METRICS. However, the new Font class would not include character widths that are available for embedded fonts. So, how would you proceed with this? I see two ways forward:
EDIT
|
Yes, although for the repository and its CI/CD, we might need to extend the current configuration. I am open to help with this if desired/required.
I used a mix of the original pdfminer.six code and your code for writing my script. The limitation is from pdfminer.six. This seems to mostly eliminate the characters with the code
I would split this into two separate topics. This PR should focus on the new container and retrieving the data from the AFM files to use them where appropriate. Possibly unifying this with handling embedded fonts could/should be a separate step in a PR afterwards. This way, we avoid too large PRs which simplifies reviewing the changes from my side. |
|
I used your script and tried to improve it, and I added your font class. To be fair, this PR more and more looks like you answering my bug report instead of me trying to contribute new code; please adjust copyright and attribution accordingly if and when you pull this. I added the font class to pypdf/_font.py. I noticed that other files in this directory have a copyright notice and attribution, please advise, and add your name if you needed! I just tested locally, and the following works: I noticed that codecov would like a test. I may be able to cobble up something, and otherwise it will be in a month or so. |
I am helping with getting the changes integrated. Without you doing the initial work, I would not have looked into this myself. Doing the changes to the parser has been a little side project, allowing me some insights into AFM files.
Your initial PR integrated the new functionality into the existing code and adapted some tests. Wouldn't this be sufficient or does this have any side effects? Codecov is correctly complaining because the new code is never executed. |
I wrote a very small test file now - tests/test_font.py: If you prefer, I'll add something to the writer test. |
|
I am okay with a simple test covering one of the basic entries and possibly one of the explicitly mapped ones as well. |
Sounds like what I just pasted, right? |
|
I redid the PR. I changed the dataclass name to FontDescriptor. I also added some code to actually use the fore font metrics in practice, by adding a class method to initialise FontDescriptor from a font resource in a PDF document. The current PR is fully object oriented. I hope you like this version! It would offer me a good basis to implement text wrapping, text centering and right alignment in _writer.gen_appearance_stream(). For the future, it might be possible to:
If I understand correctly, this would remove some code duplication and be a bit more object oriented. |
73da1bf to
c02e4f8
Compare
|
@stefan6419846 Thanks for your previous review and especially for asking about the defaults. It's clear to me now that most defaults offered by Google Gemini were quite alright, but the whole flags thing was very probably wrong. I should have been more critical myself. Interestingly, the original code that set the font flags to 64 for Courier (as copied from pdfminer.six) appears wrong as well; it tests for the IsFixed value, but that corresponds to the first bit of the flags (in other words, 1, not 64). I don't think the flags are really important, otherwise the people at pdfminer.six would have noticed. I do think that I now have the correct values, having looked at Section 9.8.2 Font descriptor flags of the pdf specification. |
|
Thanks for looking into this again. I will have to delay the next proper review of this larger set of changes until September as I am on the move in the next weeks. |
|
Enjoy your travels! |
|
@stefan6419846 I implemented the change you requested! |
This patch adds a new FontDescriptor dataclass. Its initial use, for now, is to act as a dataclass for the font metrics of the 14 Adobe core fonts. These fonts are usually not embedded in PDF documents, while PDF readers are expected to carry that information themselves.
This patch adds a new file with the font metrics for the 14 core Type 1 pdf fonts. The file was inspired by the pdfminer.six project, where a very similar one is called fontmetrics.py. The information itself is generated by a separate file added with this patch: resources/afm_to_dataclass.py The PDF specification expects a pdf reader to include these font metrics.
Version 1.7 of the PDF reference lists various alternatives names as accepted for the 14 core fonts, such as Arial for Helvetica and CourierNew for Courier. Add these alternative names to the font metrics.
This patch introduces some defaults for the FontDescriptor. This might be useful if we would like to migrate existing code for collecting font widths to the class, while not having all code in place to also collect all associated other information about the font. The defaults I took from Google Gemini...
This patch adds a class method to initialise FontDescriptor using a font resource dictionary. For now, this returns either a font name and FontDescriptor defaults, or, in case of one of the 14 core fonts, the associated font metrics. For future development, this class can be extended with code that collects character widths, which is now present in multple places, such as: - pypdf/_text_extraction/_layout_mode/_font.py - pypdf/_cmap.py ... and code that collects font descriptor information, which is now present nowhere yet, but would be useful for generating text streams.
This patch adds a little test for the FontDescriptor dataclass to make codecov happy.
stefan6419846
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience.
Same to you! |
## What's new ### New Features (ENH) - Enhance XMP metadata handling with creation and setter methods (#3410) by @Arya-A-Nair - Add all font metrics for base 14 Type 1 PDF fonts (#3363) by @PJBrs - Allow deleting embedded files (#3461) by @stefan6419846 - Add support for Python in FIPS mode for document identifier (#3438) by @xnox ### Bug Fixes (BUG) - Fix handling of UTF-16 encoded destination titles (#3463) by @stefan6419846 - Guard empty input to prevent IndexError (#3448) by @KyleJung0828 ### Developer Experience (DEV) - Fix type hint for XMP metadata setter to add bytes type (#3464) by @stefan6419846 [Full Changelog](6.0.0...6.1.0)
## What's new ### New Features (ENH) - Enhance XMP metadata handling with creation and setter methods (py-pdf#3410) by @Arya-A-Nair - Add all font metrics for base 14 Type 1 PDF fonts (py-pdf#3363) by @PJBrs - Allow deleting embedded files (py-pdf#3461) by @stefan6419846 - Add support for Python in FIPS mode for document identifier (py-pdf#3438) by @xnox ### Bug Fixes (BUG) - Fix handling of UTF-16 encoded destination titles (py-pdf#3463) by @stefan6419846 - Guard empty input to prevent IndexError (py-pdf#3448) by @KyleJung0828 ### Developer Experience (DEV) - Fix type hint for XMP metadata setter to add bytes type (py-pdf#3464) by @stefan6419846 [Full Changelog](py-pdf/pypdf@6.0.0...6.1.0)
This patch includes font metrics for the standard 14 fonts. This is intended to be useful for generating a text appearance stream, especially if you want to take into account right-aligned or centred text. (I have some other patches that include this as well as text wrapping.)
Note that some of this information was already included in _font_widths.py, but that information is incomplete. I thought it better to copy this information from pdfminer.six and be able to potentially benefit from their work later on, than to improve on what already was included here.
The first three patches introduce the new functionality. The last three patches are for moving the Font class to the new font metrics information and removing the old _font_widths.py file.
This is what the spec has to say about it: