Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Added command update-offsets to adjust offsets and lengths. #15

Merged
merged 22 commits into from
Nov 7, 2024

Conversation

srogmann
Copy link
Contributor

This command adjusts /Length-entries of stream objects and the xref-offsets
in simple PDF files (ASCII only, one xref section only) to support writing PDF
files by means of a text editor.

I replaced the camelCase-variables by snake-case variables.

@srogmann
Copy link
Contributor Author

Two hours ago I edited a larger pdf-file created by

qpdf --stream-data=uncompress

I remembered this old PR and used update_offsets.py to fix the offsets. I added

len_stream = None

to fix a bug occuring if there are several streams.

@Lucas-C
Copy link
Member

Lucas-C commented Oct 29, 2024

I think this would be a great addition to pdfly 👍

Are you still willing to work on this @srogmann? 🙂

@srogmann
Copy link
Contributor Author

@Lucas-C
The essential work on update_offsets.py was completed in August 2022. As of August 2022, I was able to automatically supplement offsets in simple PDFs. I don't know why the PR was not merged at the time; perhaps I should have made it clearer that, in my view, it was ready.

As mentioned above, in May 2024, I remembered pdfly and used update-offsets to correct a manually edited PDF file. In my view, the PR was also ready in May 2024.

An example is in the attached file-in.pdf , I used it to test the text-extraction of documents with Tm operators.

By update-offsets the XREF-section

xref
0 7
0000000000 65535 f 
0000000015 00000 n 
0000000015 00000 n 
0000000015 00000 n 
0000000015 00000 n 
0000000015 00000 n 
0000000015 00000 n 
trailer << /Size 7 /Info 2 0 R /Root 1 0 R >>
startxref
000000
%%EOF

will be converted into

xref
0 7
0000000000 65535 f 
0000000015 00000 n 
0000000081 00000 n 
0000000158 00000 n 
0000000217 00000 n 
0000000380 00000 n 
0000001005 00000 n 
trailer << /Size 7 /Info 2 0 R /Root 1 0 R >>
startxref
1086
%%EOF

The "target audience" for update_offsets are simple PDF documents that have been manually created using an editor. It is not suitable for complex or obfuscated PDFs.

@Lucas-C
Copy link
Member

Lucas-C commented Nov 1, 2024

Thank you for your detailed answer @srogmann 👍

I'll be happy to review & merge this PR, but could you rebase it and solve the minor merge conflict, please?

Copy link
Member

@Lucas-C Lucas-C left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please:

  • add a mention of the new command in README.md
  • add some unit tests in tests/test_update_offsets.py

pdfly/cli.py Outdated Show resolved Hide resolved
pdfly/update_offsets.py Outdated Show resolved Hide resolved
pdfly/update_offsets.py Outdated Show resolved Hide resolved
@srogmann
Copy link
Contributor Author

srogmann commented Nov 3, 2024

@Lucas-C
I updated this PR.

During testing, I noticed an issue (specifically with pytest on Unix). In the Makefile, the test directory is written as 'Tests' with a capital 'T', but in the filesystem, it is written in lowercase.

pdfly/cli.py Outdated Show resolved Hide resolved
pdfly/update_offsets.py Outdated Show resolved Hide resolved
@Lucas-C
Copy link
Member

Lucas-C commented Nov 4, 2024

During testing, I noticed an issue (specifically with pytest on Unix). In the Makefile, the test directory is written as 'Tests' with a capital 'T', but in the filesystem, it is written in lowercase.

Thank you for notifying this problem 👍

Could you please fix this as part of this PR?

@Lucas-C
Copy link
Member

Lucas-C commented Nov 4, 2024

PS: I myself wrote a similar script some time ago: https://github.com/Lucas-C/dotfiles_and_notes/blob/master/languages/python/set_pdf_xref.py

I'm really happy to include this feature in pdfly, it's a very good idea and would come handy to many people 👍 🙂

@Lucas-C
Copy link
Member

Lucas-C commented Nov 4, 2024

The GitHub Actions pipeline is currently failing due to black not being applied on some files, to ensure a consistent code style:

$ black --extend-exclude sample-files .
would reformat /home/runner/work/pdfly/pdfly/pdfly/cli.py
would reformat /home/runner/work/pdfly/pdfly/tests/test_update_offsets.py
would reformat /home/runner/work/pdfly/pdfly/pdfly/update_offsets.py

@srogmann
Copy link
Contributor Author

srogmann commented Nov 6, 2024

@Lucas-C
I have slightly revised the implementation of update_offsets.py. The lines are now read in binary mode so that line breaks are not distorted. This is important when reading binary data in streams. As a result, a PDF file like 002-trivial-libre-office-writer.pdf now matches the original byte for byte after processing.

In the tests, I have commented out four PDF documents that cannot be correctly processed with the current implementation. The current implementation is quite simple and works with regular expressions; it was originally intended to revise hand-edited PDF documents via an editor. The more accurately the script should work, the more it would be appropriate to parse the tokens according to chapter 3 of the PDF specification. Technically, this is possible, but it would far exceed the original goal of my implementation.

@Lucas-C
Copy link
Member

Lucas-C commented Nov 7, 2024

I have slightly revised the implementation of update_offsets.py. The lines are now read in binary mode so that line breaks are not distorted. This is important when reading binary data in streams. As a result, a PDF file like 002-trivial-libre-office-writer.pdf now matches the original byte for byte after processing.

Awesome! Good job 🙂

In the tests, I have commented out four PDF documents that cannot be correctly processed with the current implementation. The current implementation is quite simple and works with regular expressions; it was originally intended to revise hand-edited PDF documents via an editor. The more accurately the script should work, the more it would be appropriate to parse the tokens according to chapter 3 of the PDF specification. Technically, this is possible, but it would far exceed the original goal of my implementation.

That's fine really.
You did an excellent job on this, thank you for your contribution! 👍

I added a commit on the branch to fix some minor typing related issues.

@Lucas-C Lucas-C force-pushed the update_offsets branch 3 times, most recently from b62f298 to 5032317 Compare November 7, 2024 16:13
@Lucas-C Lucas-C merged commit da75816 into py-pdf:main Nov 7, 2024
6 checks passed
MartinThoma added a commit that referenced this pull request Dec 8, 2024
## What's new

### New Features (ENH)
- New `booklet` command to adjust offsets and lengths ([PR #77](#77))
- New `uncompress` command ([PR #75](#75))
- New `update-offsets` command to adjust offsets and lengths ([PR #15](#15))
- New `rm` command ([PR #59](#59))
- `metadata`: now also displaying CreationDate, Creator, Keywords & Subject ([PR #73](#73))
- Add warning for out-of-bounds page range in pdfly `cat` command ([PR #58](#58))

### Bug Fixes (BUG)
- `2-up` command, that only showed one page per sheet, on the left side, with blank space on the right ([PR #78](#78))

[Full Changelog](0.3.3...0.4.0)
@Lucas-C
Copy link
Member

Lucas-C commented Dec 8, 2024

This has been released in version 0.4.0: https://pypi.org/project/pdfly/#history

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants