-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent endless loops during Page->getText() because of faulty recursionStack #457
Conversation
This great, @Nickmanbear. This solves headaches for many people, great work! |
@amooij Thanks! |
@Nickmanbear thank you for your pull request! Is there a way to "demonstrate" your change? Like a test which triggers an endless loop? |
@k00ni These are a few of the files we have that cause endless loops. The first two will get stuck when you call getText() on the first page and the third one when you call getText() on the second page. 20210514101853.pdf With my fix these files are all able to be parsed just fine. I could add one of these files to the Page integration tests to prove it now works and make sure it continues to work in the future. But I'm not sure how I would write a meaningful test to trigger this endless loop when this fix should prevent it from occurring. |
Its a good start to have such a test, so one can deactivate the fix and see if such a loop gets triggered. |
Co-authored-by: Konrad Abicht <hi@inspirito.de>
Co-authored-by: Konrad Abicht <hi@inspirito.de>
@k00ni Are there any more changes you think I should make? |
I plan to get back to you next week. For now all I can say is that it is looking good. |
@k00ni We have switched to using this fork for our production environment. So merging this is not that urgent to us. Of course we would still like it to be in the next release, so we can switch back to that. |
Co-authored-by: Jérémy Benoist <j0k3r@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for you work!
…ionStack (smalot#457) * Reset recursionStack when getting text * Don't pop the recursionStack during text parse * Move recursionStack cleanup to after page getText() * Missing semicolon * Add file containing containing loop for testing * Add test for issue fixed by Pull Request 457 * Update tests/Integration/PageTest.php Co-authored-by: Konrad Abicht <hi@inspirito.de> * Add comment to recursionStack fix * Update src/Smalot/PdfParser/Page.php Co-authored-by: Konrad Abicht <hi@inspirito.de> * Clear memory to not influence other tests * Call gc_collect_cycles() after testGetTextPullRequest457() * Set a baseline memory level before memory usage test * use baseline as minimum memory as well * Update src/Smalot/PdfParser/Page.php Co-authored-by: Jérémy Benoist <j0k3r@users.noreply.github.com> Co-authored-by: Konrad Abicht <hi@inspirito.de> Co-authored-by: Kees Vaes <kees@superscanner.nl> Co-authored-by: Jérémy Benoist <j0k3r@users.noreply.github.com>
The $recursionStack used in the getText() function of PDFObject to prevent "Do" commands from causing endless loops removes items from it's stack prematurely during the parsing of pages. Meaning endless loops can and are still being caused by the Do command.
Only reseting the recursionStack at the end of the getText() function for every page prevents this bug.