-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GROBID 0.7.0 100% CPU loop/hang/timeout with some PDFs #867
Comments
Here is a reference to regex implementation changes in Java 9, related to slow patterns and the "regex denial of service" issue. I'm not sure if they apply here or not.
Here is the |
Hi @bnewbold ! Thank you so much for the detailed issue with the example document, this is super clear and helpful. So as you have found, this is a catastrophic backtracking problem with this regex (and another one similar). The fact that it works with JDK 11 is related to their new safety feature for regex, which cancels suspicious regex applications to improve stability (this is a very nice circuit breaker feature !). With JDK 8, the regex will run as long as necessary, so 10 minutes or a few years :) This part is indeed something I changed with 0.7.0 to improve citation callout recognition (and it improved it ! despite this bug :) ). It seems that there are always possible pathological input strings that can cause this kind of problem for non trivial regex. Here we are not in the easy case of nested regex expression:
After reading some more regex doc, it appears that this can be fixed (without the JDK 11 solution) using atomic grouping in the regex: PR to follow. |
Great! I gave the PR #868 a try with several dozen PDFs which were causing problems before, and all were processed successfully. |
Thanks a lot for the feedback ! |
In our production systems at archive.org, I upgraded from GROBID ~0.5.5 to ~0.7.0 (git commit
beebd9a6b
to be exact) a few weeks ago. Since then, when processing hundreds of thousands of PDFs, I have regularly experienced GROBID "looping" with a couple threads consuming 100% CPU on individual cores. The symptoms are consistent with a small fraction of PDFs causing this behavior repeatably, while most PDFs continue to processed successfully. If we push a large number of PDFs through in parallel, eventually GROBID ends up spinning all the cores. The mitigation has been to restart GROBID every few hours.It has been a struggle to get this issue to reproduce reliably in a minimum test case, even after a few hours debugging, and I apologize for any mistakes I have in debugging, or for lack of clarity in this issue write-up.
My best hypothesis right now is that there is an issue with a specific regex, when running GROBID with
openjdk-8-jre
(JVM 8). Withopenjdk-11-jre
(JVM 11), the issue does not reproduce.I have used a couple different clients and PDFs. A simple combination that has been reproducing the problem for me is the file
sha1:5fa184d7eee96a504bae646e6c045699530bc023
, which you can find an exact copy of at https://web.archive.org/web/20180719070056_/https://www.dovepress.com/getfile.php?fileID=14985. The API endpoint isprocessFulltextDocument
, with consolidation disabled, raw citations and affiliations included, a number of TEI coordinates enabled, and sentance segmentation on. I haven't experimented with removing most of these options. Here is an example command using thehttpie
CLI client:This command times out after two minutes, and the threads continue to spin for at least 10 minutes on the server. In other cases the CPU seems to continue to spin for hours, though I haven't isolated and reproduced exactly that behavior.
The specific regex is
PARENTHESIS_NUMBER_PATTERN
, which is called on line 66 of CalloutAnalyzer.java:grobid/grobid-core/src/main/java/org/grobid/core/engines/citations/CalloutAnalyzer.java
Lines 65 to 68 in beebd9a
To debug, I submit a couple parallel requests of the same PDF file, then send signal 3 to a high-CPU process, which causes java to dump a thread trace. These are long, and include many threads, but a common pattern is a trace like:
The
org.grobid.core.engines.citations.CalloutAnalyzer.getCalloutType(CalloutAnalyzer.java:66)
is common across a few repeats with different files, different requests, and multiple threads stuck in the same regex at the same point in time. This is what makes me suspect the regex linked above.When I run GROBID on my laptop, using the same artifacts and configuration, except lower concurrency, I can not reproduce the issue. I noted that my laptop (running Debian Linux, on an intel processor) has JDK 11, while the server (Ubuntu focal on intel processor) has JDK 8 (which is what GROBID supports, as per documentation). I tried installing JDK 11 on the server instead, and could no longer reproduce the problem.
I haven't really worked through what the specific root cause is... difficult regex for some body text on some platforms in a specific version of the JVM? But figured I would share what I have found so far here.
Thanks, as always, for maintaining GROBID, and apologies this bug report runs so long.
The text was updated successfully, but these errors were encountered: