Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GROBID 0.7.0 100% CPU loop/hang/timeout with some PDFs #867

Closed
bnewbold opened this issue Dec 4, 2021 · 4 comments
Closed

GROBID 0.7.0 100% CPU loop/hang/timeout with some PDFs #867

bnewbold opened this issue Dec 4, 2021 · 4 comments
Assignees
Labels
bug From Hemiptera and especially its suborder Heteroptera implemented The issue has been implemented

Comments

@bnewbold
Copy link
Contributor

bnewbold commented Dec 4, 2021

In our production systems at archive.org, I upgraded from GROBID ~0.5.5 to ~0.7.0 (git commit beebd9a6b to be exact) a few weeks ago. Since then, when processing hundreds of thousands of PDFs, I have regularly experienced GROBID "looping" with a couple threads consuming 100% CPU on individual cores. The symptoms are consistent with a small fraction of PDFs causing this behavior repeatably, while most PDFs continue to processed successfully. If we push a large number of PDFs through in parallel, eventually GROBID ends up spinning all the cores. The mitigation has been to restart GROBID every few hours.

It has been a struggle to get this issue to reproduce reliably in a minimum test case, even after a few hours debugging, and I apologize for any mistakes I have in debugging, or for lack of clarity in this issue write-up.

My best hypothesis right now is that there is an issue with a specific regex, when running GROBID with openjdk-8-jre (JVM 8). With openjdk-11-jre (JVM 11), the issue does not reproduce.

I have used a couple different clients and PDFs. A simple combination that has been reproducing the problem for me is the file sha1:5fa184d7eee96a504bae646e6c045699530bc023, which you can find an exact copy of at https://web.archive.org/web/20180719070056_/https://www.dovepress.com/getfile.php?fileID=14985. The API endpoint is processFulltextDocument, with consolidation disabled, raw citations and affiliations included, a number of TEI coordinates enabled, and sentance segmentation on. I haven't experimented with removing most of these options. Here is an example command using the httpie CLI client:

http --form --timeout 120 post http://localhost:8070/api/processFulltextDocument consolidateHeader=0 consolidateCitations=0 includeRawCitations=1 includeRawAffiliations=1 segmentSentences=1 teiCoordinates=ref teiCoordinates=figure teiCoordinates=biblStruct input@5fa184d7eee96a504bae646e6c045699530bc023.pdf

This command times out after two minutes, and the threads continue to spin for at least 10 minutes on the server. In other cases the CPU seems to continue to spin for hours, though I haven't isolated and reproduced exactly that behavior.

The specific regex is PARENTHESIS_NUMBER_PATTERN, which is called on line 66 of CalloutAnalyzer.java:

matcher = PARENTHESIS_NUMBER_PATTERN.matcher(calloutString);
if (matcher.find()) {
return MarkerType.PARENTHESIS_NUMBER;
}

To debug, I submit a couple parallel requests of the same PDF file, then send signal 3 to a high-CPU process, which causes java to dump a thread trace. These are long, and include many threads, but a common pattern is a trace like:

Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: "dw-35 - POST /api/processFulltextDocument" #35 prio=5 os_prio=0 tid=0x00007f534c001000 nid=0xc239e runnable [0x00007f53b828e000]
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]:    java.lang.Thread.State: RUNNABLE
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Curly.match(Pattern.java:4241)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupHead.match(Pattern.java:4672)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Branch.match(Pattern.java:4618)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupHead.match(Pattern.java:4672)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Loop.match(Pattern.java:4799)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupTail.match(Pattern.java:4731)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$BranchConn.match(Pattern.java:4582)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupTail.match(Pattern.java:4731)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Ques.match(Pattern.java:4196)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Curly.match0(Pattern.java:4286)

[... many java.util.regex lines ...]

Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Curly.match0(Pattern.java:4286)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Curly.match(Pattern.java:4248)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupHead.match(Pattern.java:4672)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Branch.match(Pattern.java:4618)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupHead.match(Pattern.java:4672)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Loop.matchInit(Pattern.java:4815)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Prolog.match(Pattern.java:4755)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Start.match(Pattern.java:3475)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Matcher.search(Matcher.java:1248)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Matcher.find(Matcher.java:637)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.citations.CalloutAnalyzer.getCalloutType(CalloutAnalyzer.java:66)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.FullTextParser.postProcessCallout(FullTextParser.java:2359)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:298)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:110)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:507)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:497)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:207)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:268)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:220)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

[... many service lines ...]

Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.eclipse.jetty.io.ManagedSelector$$Lambda$101/448084064.run(Unknown Source)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.lang.Thread.run(Thread.java:748)

The org.grobid.core.engines.citations.CalloutAnalyzer.getCalloutType(CalloutAnalyzer.java:66) is common across a few repeats with different files, different requests, and multiple threads stuck in the same regex at the same point in time. This is what makes me suspect the regex linked above.

When I run GROBID on my laptop, using the same artifacts and configuration, except lower concurrency, I can not reproduce the issue. I noted that my laptop (running Debian Linux, on an intel processor) has JDK 11, while the server (Ubuntu focal on intel processor) has JDK 8 (which is what GROBID supports, as per documentation). I tried installing JDK 11 on the server instead, and could no longer reproduce the problem.

I haven't really worked through what the specific root cause is... difficult regex for some body text on some platforms in a specific version of the JVM? But figured I would share what I have found so far here.

Thanks, as always, for maintaining GROBID, and apologies this bug report runs so long.

@bnewbold
Copy link
Contributor Author

bnewbold commented Dec 4, 2021

Here is a reference to regex implementation changes in Java 9, related to slow patterns and the "regex denial of service" issue. I'm not sure if they apply here or not.

Here is the PARENTHESIS_NUMBER_PATTERN regex itself: \(((\d+[a-f]?)|[,-;•])+\).

@kermitt2 kermitt2 self-assigned this Dec 4, 2021
@kermitt2 kermitt2 added the bug From Hemiptera and especially its suborder Heteroptera label Dec 4, 2021
@kermitt2
Copy link
Owner

kermitt2 commented Dec 4, 2021

Hi @bnewbold !

Thank you so much for the detailed issue with the example document, this is super clear and helpful.

So as you have found, this is a catastrophic backtracking problem with this regex (and another one similar). The fact that it works with JDK 11 is related to their new safety feature for regex, which cancels suspicious regex applications to improve stability (this is a very nice circuit breaker feature !). With JDK 8, the regex will run as long as necessary, so 10 minutes or a few years :)

This part is indeed something I changed with 0.7.0 to improve citation callout recognition (and it improved it ! despite this bug :) ). It seems that there are always possible pathological input strings that can cause this kind of problem for non trivial regex. Here we are not in the easy case of nested regex expression:

  • input string (callout candidate) from the example: (1915-1919,1920-1924,1925-1929,1930-1934,and1935-1939) (the space character is removed before applying the regex)

  • the regex you pointed will cause catastrophic backtracking in this case. Even a simpler, well-bounded, nothing nested, regex like this:
    \((\d{1,4}[a-f]?[\-;•,]?)+\)
    will cause a catastrophic backtracking with this input string !

After reading some more regex doc, it appears that this can be fixed (without the JDK 11 solution) using atomic grouping in the regex: \((?>[0-9]{1,4}[a-f]?[\-;•,]?)+\). The ?> indicates that no backtracking should take place for the defined token in this group, so we will save a few century of computation in the pathological cases.

PR to follow.

@bnewbold
Copy link
Contributor Author

bnewbold commented Dec 7, 2021

Great!

I gave the PR #868 a try with several dozen PDFs which were causing problems before, and all were processed successfully.

@kermitt2
Copy link
Owner

kermitt2 commented Dec 7, 2021

Thanks a lot for the feedback !
I merge the PR.

@kermitt2 kermitt2 added the implemented The issue has been implemented label Dec 7, 2021
kermitt2 added a commit that referenced this issue Dec 7, 2021
@bnewbold bnewbold closed this as completed Dec 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera implemented The issue has been implemented
Projects
None yet
Development

No branches or pull requests

2 participants