-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Added command update-offsets to adjust offsets and lengths. #15
Conversation
Two hours ago I edited a larger pdf-file created by
I remembered this old PR and used update_offsets.py to fix the offsets. I added
to fix a bug occuring if there are several streams. |
I think this would be a great addition to Are you still willing to work on this @srogmann? 🙂 |
@Lucas-C As mentioned above, in May 2024, I remembered pdfly and used update-offsets to correct a manually edited PDF file. In my view, the PR was also ready in May 2024. An example is in the attached file-in.pdf , I used it to test the text-extraction of documents with Tm operators. By update-offsets the XREF-section
will be converted into
The "target audience" for update_offsets are simple PDF documents that have been manually created using an editor. It is not suitable for complex or obfuscated PDFs. |
Thank you for your detailed answer @srogmann 👍 I'll be happy to review & merge this PR, but could you rebase it and solve the minor merge conflict, please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please:
- add a mention of the new command in
README.md
- add some unit tests in
tests/test_update_offsets.py
3f6b60e
to
2f4e11d
Compare
@Lucas-C During testing, I noticed an issue (specifically with pytest on Unix). In the |
Thank you for notifying this problem 👍 Could you please fix this as part of this PR? |
PS: I myself wrote a similar script some time ago: https://github.com/Lucas-C/dotfiles_and_notes/blob/master/languages/python/set_pdf_xref.py I'm really happy to include this feature in |
This command adjusts /Length-entries of stream objects and the xref-offsets in simple PDF files (ASCII only, one xref section only).
e0e405e
to
3429c2f
Compare
The GitHub Actions pipeline is currently failing due to
|
Co-authored-by: Lucas Cimon <925560+Lucas-C@users.noreply.github.com>
@Lucas-C In the tests, I have commented out four PDF documents that cannot be correctly processed with the current implementation. The current implementation is quite simple and works with regular expressions; it was originally intended to revise hand-edited PDF documents via an editor. The more accurately the script should work, the more it would be appropriate to parse the tokens according to chapter 3 of the PDF specification. Technically, this is possible, but it would far exceed the original goal of my implementation. |
Awesome! Good job 🙂
That's fine really. I added a commit on the branch to fix some minor typing related issues. |
b62f298
to
5032317
Compare
5032317
to
fc42eb4
Compare
## What's new ### New Features (ENH) - New `booklet` command to adjust offsets and lengths ([PR #77](#77)) - New `uncompress` command ([PR #75](#75)) - New `update-offsets` command to adjust offsets and lengths ([PR #15](#15)) - New `rm` command ([PR #59](#59)) - `metadata`: now also displaying CreationDate, Creator, Keywords & Subject ([PR #73](#73)) - Add warning for out-of-bounds page range in pdfly `cat` command ([PR #58](#58)) ### Bug Fixes (BUG) - `2-up` command, that only showed one page per sheet, on the left side, with blank space on the right ([PR #78](#78)) [Full Changelog](0.3.3...0.4.0)
This has been released in version |
This command adjusts /Length-entries of stream objects and the xref-offsets
in simple PDF files (ASCII only, one xref section only) to support writing PDF
files by means of a text editor.
I replaced the camelCase-variables by snake-case variables.