-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashes and timeouts on bug tracker corpus files #14303
Comments
Interesting; thank you for doing this! We'll see what we can improve from this report. |
The vast majority of the referenced PDF documents have either been fixed, or works acceptably in the viewer now (note that Node.js is lacking web-worker support). There's really only one kind of problem remaining, specifically regarding PDF documents that are very large (e.g. hundred of megabytes or even gigabytes). [1] Please keep in mind that having a bunch of subtly different problems reported in the same issue often makes it quite difficult to know when an issue is actually fixed. |
Y, got it. Thank you for all of your work on this! |
I recently ran pdf.js via node against our bug tracker corpus files described by Peter Wyatt here and here.
The full corpus is available here, and a prepackaged subset of PDFs is available here.
I'm running a slight modification of your
getinfo.js
example: code with the latest release: pdf.js-2.11.338. The environment is specified via Docker.I should open separate tickets, but I don't want to spam your issue tracker.
Please let me know how I can help. Many thanks for pdf.js!
Some of these files are HUGE and some are corrupt (fuzzed).
ExitValue=1
https://corpora.tika.apache.org/base/docs/bug_trackers/poppler/poppler-91414-0.zip-2.gz-53.pdfFixed by #14312https://corpora.tika.apache.org/base/docs/bug_trackers/poppler/poppler-91414-0.zip-2.gz-54.pdfFixed by #14312https://corpora.tika.apache.org/base/docs/bug_trackers/poppler/poppler-91414-1.zip-2.gz-53.pdfIdentical topoppler-91414-0.zip-2.gz-53.pdf
https://corpora.tika.apache.org/base/docs/bug_trackers/poppler/poppler-91414-1.zip-2.gz-54.pdfIdentical topoppler-91414-0.zip-2.gz-54.pdf
ExitValue=134
https://corpora.tika.apache.org/base/docs/bug_trackers/poppler/poppler-67295-0.pdfFixed by #14311https://corpora.tika.apache.org/base/docs/bug_trackers/poppler/poppler-85140-0.pdfFixed by #14311https://corpora.tika.apache.org/base/docs/bug_trackers/poppler-gitlab/poppler-878-0.gz-0.pdfTracked in bug 1611202ExitValue=137
https://corpora.tika.apache.org/base/docs/bug_trackers/poppler-gitlab/poppler-878-1.gz-0.pdfDuplicate ofpoppler-878-0.gz-0.pdf
Timeouts at 2 minutes (sorted via size ascending...the small ones are likely very problematic)
https://corpora.tika.apache.org/base/docs/bug_trackers/REDHAT/1525652-1549079/REDHAT-1531897-0.pdf (871b)Fixed by #14310https://corpora.tika.apache.org/base/docs/bug_trackers/PDFBOX/PDFBOX-4352-0.pdf (1k)Fixed by #14304These should be ignored...they are just enormous...
https://corpora.tika.apache.org/base/docs/bug_trackers/GHOSTSCRIPT/694748-703060/GHOSTSCRIPT-700953-0.pdf (10MB)WFM, when using the viewer[2]https://corpora.tika.apache.org/base/docs/bug_trackers/LIBRE_OFFICE/58331-70624/LIBRE_OFFICE-59360-1.bz2-0.pdf (19MB)WFM, when using the viewer[1]https://corpora.tika.apache.org/base/docs/bug_trackers/GHOSTSCRIPT/226943-694743/GHOSTSCRIPT-693101-0.zip-0.pdf (19MB)WFM, although the initialization takes a little time, when using the viewer[2]https://corpora.tika.apache.org/base/docs/bug_trackers/GHOSTSCRIPT/226943-694743/GHOSTSCRIPT-688926-0.bz2-0.pdf (22MB)WFM, although the initialization takes a little time, when using the viewer[2]https://corpora.tika.apache.org/base/docs/bug_trackers/pdf.js/pdf.js-LINK-5586-1.pdf (30MB)WFM, when using the viewer[1]https://corpora.tika.apache.org/base/docs/bug_trackers/PDFBOX/PDFBOX-4319-0.zip-0.pdf (36MB)WFM, although the initialization takes a little time, when using the viewer[2]https://corpora.tika.apache.org/base/docs/bug_trackers/pdf.js/pdf.js-LINK-5890-0.pdf (39MB)WFM, when using the viewer[1]https://corpora.tika.apache.org/base/docs/bug_trackers/sumatrapdf/sumatrapdf-LINK-150-0.pdf (39MB)Identical topdf.js-LINK-5890-0.pdf
https://corpora.tika.apache.org/base/docs/bug_trackers/poppler-gitlab/poppler-878-2.gz-0.pdf (100MB)Duplicate ofpoppler-878-0.gz-0.pdf
https://corpora.tika.apache.org/base/docs/bug_trackers/PDFBOX/PDFBOX-1226-0.7z-0.pdf (400MB)
https://corpora.tika.apache.org/base/docs/bug_trackers/poppler/poppler-44085-1.xz-0.pdf (6GB)
[1] The file size, the number of pages, and the lack of Worker-support in Node.js means that 2 minutes (most likely) just isn't enough time to parse all of the pages in the document.
[2] Tested using PDF.js 3.7.95 [cbc4b20] in Firefox Nightly 115.0a1 on Windows 11. Given the large number of pages, the viewer forces page-scrolling to prevent issues with too many DOM-elements.
The text was updated successfully, but these errors were encountered: