Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent endless loops during Page->getText() because of faulty recursionStack #457

Merged
merged 14 commits into from
Sep 23, 2021

Conversation

Nickmanbear
Copy link
Contributor

@Nickmanbear Nickmanbear commented Sep 3, 2021

The $recursionStack used in the getText() function of PDFObject to prevent "Do" commands from causing endless loops removes items from it's stack prematurely during the parsing of pages. Meaning endless loops can and are still being caused by the Do command.

Only reseting the recursionStack at the end of the getText() function for every page prevents this bug.

@amooij
Copy link
Collaborator

amooij commented Sep 6, 2021

This great, @Nickmanbear. This solves headaches for many people, great work!

@Nickmanbear
Copy link
Contributor Author

@amooij Thanks!

@k00ni
Copy link
Collaborator

k00ni commented Sep 6, 2021

@Nickmanbear thank you for your pull request!

Is there a way to "demonstrate" your change? Like a test which triggers an endless loop?

@Nickmanbear
Copy link
Contributor Author

@k00ni These are a few of the files we have that cause endless loops. The first two will get stuck when you call getText() on the first page and the third one when you call getText() on the second page.

20210514101853.pdf
2021081795357.pdf
202108201456185.pdf

With my fix these files are all able to be parsed just fine. I could add one of these files to the Page integration tests to prove it now works and make sure it continues to work in the future. But I'm not sure how I would write a meaningful test to trigger this endless loop when this fix should prevent it from occurring.

@k00ni
Copy link
Collaborator

k00ni commented Sep 6, 2021

But I'm not sure how I would write a meaningful test to trigger this endless loop when this fix should prevent it from occurring.

Its a good start to have such a test, so one can deactivate the fix and see if such a loop gets triggered.

Nickmanbear and others added 2 commits September 6, 2021 13:33
@Nickmanbear
Copy link
Contributor Author

@k00ni Are there any more changes you think I should make?

@k00ni
Copy link
Collaborator

k00ni commented Sep 10, 2021

I plan to get back to you next week. For now all I can say is that it is looking good.

@k00ni
Copy link
Collaborator

k00ni commented Sep 20, 2021

I am still super busy and try to find some time to finalize my review. Hope its not urgent to merge it.

@j0k3r @smalot feel free to take over and get this one going.

@Nickmanbear
Copy link
Contributor Author

@k00ni We have switched to using this fork for our production environment. So merging this is not that urgent to us. Of course we would still like it to be in the next release, so we can switch back to that.

Co-authored-by: Jérémy Benoist <j0k3r@users.noreply.github.com>
Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for you work!

@k00ni k00ni added fix and removed enhancement labels Sep 23, 2021
@k00ni k00ni merged commit 1b3b6eb into smalot:master Sep 23, 2021
senasi pushed a commit to senasi/pdfparser that referenced this pull request Oct 17, 2021
…ionStack (smalot#457)

* Reset recursionStack when getting text

* Don't pop the recursionStack during text parse

* Move recursionStack cleanup to after page getText()

* Missing semicolon

* Add file containing containing loop for testing

* Add test for issue fixed by Pull Request 457

* Update tests/Integration/PageTest.php

Co-authored-by: Konrad Abicht <hi@inspirito.de>

* Add comment to recursionStack fix

* Update src/Smalot/PdfParser/Page.php

Co-authored-by: Konrad Abicht <hi@inspirito.de>

* Clear memory to not influence other tests

* Call gc_collect_cycles() after testGetTextPullRequest457()

* Set a baseline memory level before memory usage test

* use baseline as minimum memory as well

* Update src/Smalot/PdfParser/Page.php

Co-authored-by: Jérémy Benoist <j0k3r@users.noreply.github.com>

Co-authored-by: Konrad Abicht <hi@inspirito.de>
Co-authored-by: Kees Vaes <kees@superscanner.nl>
Co-authored-by: Jérémy Benoist <j0k3r@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants