Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes and timeouts on bug tracker corpus files #14303

Closed
tballison opened this issue Nov 24, 2021 · 4 comments
Closed

Crashes and timeouts on bug tracker corpus files #14303

tballison opened this issue Nov 24, 2021 · 4 comments

Comments

@tballison
Copy link

tballison commented Nov 24, 2021

I recently ran pdf.js via node against our bug tracker corpus files described by Peter Wyatt here and here.

The full corpus is available here, and a prepackaged subset of PDFs is available here.

I'm running a slight modification of your getinfo.js example: code with the latest release: pdf.js-2.11.338. The environment is specified via Docker.

I should open separate tickets, but I don't want to spam your issue tracker.

Please let me know how I can help. Many thanks for pdf.js!

Some of these files are HUGE and some are corrupt (fuzzed).

ExitValue=1
https://corpora.tika.apache.org/base/docs/bug_trackers/poppler/poppler-91414-0.zip-2.gz-53.pdf Fixed by #14312
https://corpora.tika.apache.org/base/docs/bug_trackers/poppler/poppler-91414-0.zip-2.gz-54.pdf Fixed by #14312
https://corpora.tika.apache.org/base/docs/bug_trackers/poppler/poppler-91414-1.zip-2.gz-53.pdf Identical to poppler-91414-0.zip-2.gz-53.pdf
https://corpora.tika.apache.org/base/docs/bug_trackers/poppler/poppler-91414-1.zip-2.gz-54.pdf Identical to poppler-91414-0.zip-2.gz-54.pdf

ExitValue=134
https://corpora.tika.apache.org/base/docs/bug_trackers/poppler/poppler-67295-0.pdf Fixed by #14311
https://corpora.tika.apache.org/base/docs/bug_trackers/poppler/poppler-85140-0.pdf Fixed by #14311
https://corpora.tika.apache.org/base/docs/bug_trackers/poppler-gitlab/poppler-878-0.gz-0.pdf Tracked in bug 1611202

ExitValue=137
https://corpora.tika.apache.org/base/docs/bug_trackers/poppler-gitlab/poppler-878-1.gz-0.pdf Duplicate of poppler-878-0.gz-0.pdf

Timeouts at 2 minutes (sorted via size ascending...the small ones are likely very problematic)
https://corpora.tika.apache.org/base/docs/bug_trackers/REDHAT/1525652-1549079/REDHAT-1531897-0.pdf (871b) Fixed by #14310
https://corpora.tika.apache.org/base/docs/bug_trackers/PDFBOX/PDFBOX-4352-0.pdf (1k) Fixed by #14304

These should be ignored...they are just enormous...
https://corpora.tika.apache.org/base/docs/bug_trackers/GHOSTSCRIPT/694748-703060/GHOSTSCRIPT-700953-0.pdf (10MB) WFM, when using the viewer[2]
https://corpora.tika.apache.org/base/docs/bug_trackers/LIBRE_OFFICE/58331-70624/LIBRE_OFFICE-59360-1.bz2-0.pdf (19MB) WFM, when using the viewer[1]
https://corpora.tika.apache.org/base/docs/bug_trackers/GHOSTSCRIPT/226943-694743/GHOSTSCRIPT-693101-0.zip-0.pdf (19MB) WFM, although the initialization takes a little time, when using the viewer[2]
https://corpora.tika.apache.org/base/docs/bug_trackers/GHOSTSCRIPT/226943-694743/GHOSTSCRIPT-688926-0.bz2-0.pdf (22MB) WFM, although the initialization takes a little time, when using the viewer[2]
https://corpora.tika.apache.org/base/docs/bug_trackers/pdf.js/pdf.js-LINK-5586-1.pdf (30MB) WFM, when using the viewer[1]
https://corpora.tika.apache.org/base/docs/bug_trackers/PDFBOX/PDFBOX-4319-0.zip-0.pdf (36MB) WFM, although the initialization takes a little time, when using the viewer[2]
https://corpora.tika.apache.org/base/docs/bug_trackers/pdf.js/pdf.js-LINK-5890-0.pdf (39MB) WFM, when using the viewer[1]
https://corpora.tika.apache.org/base/docs/bug_trackers/sumatrapdf/sumatrapdf-LINK-150-0.pdf (39MB) Identical to pdf.js-LINK-5890-0.pdf
https://corpora.tika.apache.org/base/docs/bug_trackers/poppler-gitlab/poppler-878-2.gz-0.pdf (100MB) Duplicate of poppler-878-0.gz-0.pdf
https://corpora.tika.apache.org/base/docs/bug_trackers/PDFBOX/PDFBOX-1226-0.7z-0.pdf (400MB)
https://corpora.tika.apache.org/base/docs/bug_trackers/poppler/poppler-44085-1.xz-0.pdf (6GB)


[1] The file size, the number of pages, and the lack of Worker-support in Node.js means that 2 minutes (most likely) just isn't enough time to parse all of the pages in the document.

[2] Tested using PDF.js 3.7.95 [cbc4b20] in Firefox Nightly 115.0a1 on Windows 11. Given the large number of pages, the viewer forces page-scrolling to prevent issues with too many DOM-elements.

@timvandermeij
Copy link
Contributor

Interesting; thank you for doing this! We'll see what we can improve from this report.

@tballison
Copy link
Author

tballison commented Dec 1, 2021

These are some newer files that were generated by fuzzing the original bugtracker corpus files and running against the latest master branch. The files are named for the triggering file and the shasum of the fuzzed version.

ExitValue=1
poppler-742-0.pdf-e651252b0d4fb2556c957c13844f5630b32352b0483fb2cc1592cd778290a01a.pdf Fixed by #14333

ExitValue=134
poppler-937-0.pdf-732311f7b606c6360121644b9b7709dc8d369f37157704dc16e6791a4ca6e680.pdf Fixed by #14333

ExitValue=137 (couldn't reproduce 137 post-hoc but got it during the main run because of non-reproducible memory-pressure in the container given other threads/processes, etc.)
poppler-395-0.pdf-e2ec919e787c0069948598f9c9d7a39787acde602aacd2bb263298a85e2e0c64.pdf Fixed by #14335
GHOSTSCRIPT-698804-1.pdf-62e518ca7894e8b8b8ed9927da3007cdc5e9ccc5b111bbb22d6101d4a8d7490e.pdf Fixed by #14335

Timeouts (sorted in asc order of size, sampled from much larger batch, with preference for seed diversity)

These documents are effectively duplicates of the "poppler-395-0.pdf..." and/or "GHOSTSCRIPT-698804-1.pdf..." documents above.

poppler-327-0.zip-0.pdf-4db8992a5117f499dd67cc3f784d32cfdd197fba3d1f6f9ad072ff7e15fbfcc4.pdf
poppler-751-0.tgz-0.pdf-d4e90fa14fd1706dc9bfc22664285664b8ee31371b61510463dfa513ae9385e3.pdf
poppler-670-0.pdf-50244a524ec0f2632de41904ffd56d241464bb63dd24caee4aa032c10c337ead.pdf
poppler-365-0.pdf-07bc652377d9bd1647500ae82c4ac6d30fd470481bd2e376e8c0bc86a6baf51b.pdf
poppler-814-0.tgz-4.pdf-37a03b2478284994417ae48e7d6eba475564667b56baa2a07d1b5b9b8799bb12.pdf
poppler-776-0.pdf-5825fe4c5b59161e424ca002de858a019bb90c3bf89ffbf79dcefc9584dbf307.pdf
GHOSTSCRIPT-699652-0.pdf-609198627e39958a502dac545b49420f749657cb45a8824d287892277a0a97c7.pdf
poppler-51-0.pdf-b4fa01a131c754f7fb012f39c7518f8937fb863d8009f652a13cb4030596aa8e.pdf
poppler-624-0.pdf-823851837af1d72b01be6ba4194fc11df42d6b8e6ecb473967b43025b67483f8.pdf
poppler-327-0.zip-0.pdf-183e6697314f9ee2ebdcc7323a16b23fe039c1ac82240784c738ec7362b3a5d1.pdf
poppler-976-0.pdf-d79dc1e10f120602c3e5fee79613411fe618dc143eb6de9766033c486779bc65.pdf
GHOSTSCRIPT-691679-0.pdf-62acf4de26a6ccd48929db986543532297d823ffcb5fc1559f80f8af8bcdab89.pdf
GHOSTSCRIPT-692307-1.zip-6.pdf-f9289915eeeaba748214ff6d9faf651f8573fea177d02ca527ab93d48ac9a8d0.pdf
PDFBOX-4878-0.pdf-4ed6a546a8bf536122f88de93459efe23988594f853f7e97c9cc9e7b85f92327.pdf

@Snuffleupagus
Copy link
Collaborator

The vast majority of the referenced PDF documents have either been fixed, or works acceptably in the viewer now (note that Node.js is lacking web-worker support).

There's really only one kind of problem remaining, specifically regarding PDF documents that are very large (e.g. hundred of megabytes or even gigabytes).
Given that those aren't particularly common, and that there should be existing reports about such cases (e.g. in Bugzilla), let's close this issue now since it's mostly fixed. Any further problems would be best to file separately.[1]


[1] Please keep in mind that having a bunch of subtly different problems reported in the same issue often makes it quite difficult to know when an issue is actually fixed.

@tballison
Copy link
Author

Y, got it. Thank you for all of your work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants