Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Non-deterministic accidental object reuse #1995

Closed
wants to merge 3 commits into from

Conversation

sjoerdjob
Copy link
Contributor

@sjoerdjob sjoerdjob commented Jul 21, 2023

The following section of code could sometimes have unintended effects where pages of an earlier integrated PDF file were used instead.

writer = PdfWriter()

reader1 = PdfReader(some_file)
id_reader1 = id(reader1)
writer.add_page(reader1.pages[0])
del reader1

reader2 = PdfReader(other_file)
id_reader2 = id(reader2)
writer.add_page(reader2.pages[0])
del reader2

writer.write(third_file)

because the reader1 is no longer in memory when reader2 gets initialized, the area in memory is free, so id_reader1 and id_reader2 might end up having the same value.

Due to PyPDF using id(reader) internally for an object-cache, it sometimes happened that writer.add_page(reader2.pages[0]) would result in duplicating reader1.pages[0] instead.

fixes #1788 .

The following section of code could sometimes have unintended effects
where pages of an earlier integrated PDF file were used instead.

```
writer = PdfWriter()

reader1 = PdfReader(some_file)
id_reader1 = id(reader1)
writer.add_page(reader1.pages[0])
del reader1

reader2 = PdfReader(other_file)
id_reader2 = id(reader2)
writer.add_page(reader2.pages[0])
del reader2

writer.write(third_file)
```

because the `reader1` is no longer in memory when `reader2` gets
initialized, the area in memory is free, so `id_reader1` and
`id_reader2` might end up having the same value.

Due to PyPDF using `id(reader)` internally for an object-cache, it
sometimes happened that `writer.add_page(reader2.pages[0])` would result
in duplicating `reader1.pages[0]` instead.
Using a WeakKeyDictionary was an implementation detail that does not
have to be matched in the type.
@MartinThoma
Copy link
Member

@sjoerdjob Your PR cannot be merged like this as the CI fails:

pypdf/_protocols.py:68: error: Name "WeakKeyDictionary" is not defined

There are now also a couple of merge conflicts.

@pubpub-zz
Copy link
Collaborator

@MartinThoma this looks similar (different approach) to a PR you've already about random error. not sure this is still required

@MartinThoma
Copy link
Member

#1788 was fixed via #1841. test_merging_many_temporary_files succeeds as well.

MartinThoma added a commit that referenced this pull request Oct 8, 2023
Full credit to sjoerdjob for this contribution via #1995

See #1788

Co-authored-by: Sjoerd Job Postmus <sjoerdjob@sjec.nl>
MartinThoma added a commit that referenced this pull request Oct 8, 2023
#2244)

Full credit to sjoerdjob for this contribution via #1995

See #1788

Co-authored-by: Sjoerd Job Postmus <sjoerdjob@sjec.nl>
@MartinThoma
Copy link
Member

We didn't have a test that could detect this type of issue, hence I added the test from this PR via #2244

@sjoerdjob Sorry that it took so long. I'm closing this PR now as the problem was fixed via #1841 and your test was added via #2244

@MartinThoma MartinThoma closed this Oct 8, 2023
@MartinThoma
Copy link
Member

Thank you for your support 🤗 If you want, I'll add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Random bug with add_page
3 participants