GROBID 0.7.0 100% CPU loop/hang/timeout with some PDFs #867

bnewbold · 2021-12-04T03:03:35Z

In our production systems at archive.org, I upgraded from GROBID ~0.5.5 to ~0.7.0 (git commit beebd9a6b to be exact) a few weeks ago. Since then, when processing hundreds of thousands of PDFs, I have regularly experienced GROBID "looping" with a couple threads consuming 100% CPU on individual cores. The symptoms are consistent with a small fraction of PDFs causing this behavior repeatably, while most PDFs continue to processed successfully. If we push a large number of PDFs through in parallel, eventually GROBID ends up spinning all the cores. The mitigation has been to restart GROBID every few hours.

It has been a struggle to get this issue to reproduce reliably in a minimum test case, even after a few hours debugging, and I apologize for any mistakes I have in debugging, or for lack of clarity in this issue write-up.

My best hypothesis right now is that there is an issue with a specific regex, when running GROBID with openjdk-8-jre (JVM 8). With openjdk-11-jre (JVM 11), the issue does not reproduce.

I have used a couple different clients and PDFs. A simple combination that has been reproducing the problem for me is the file sha1:5fa184d7eee96a504bae646e6c045699530bc023, which you can find an exact copy of at https://web.archive.org/web/20180719070056_/https://www.dovepress.com/getfile.php?fileID=14985. The API endpoint is processFulltextDocument, with consolidation disabled, raw citations and affiliations included, a number of TEI coordinates enabled, and sentance segmentation on. I haven't experimented with removing most of these options. Here is an example command using the httpie CLI client:

http --form --timeout 120 post http://localhost:8070/api/processFulltextDocument consolidateHeader=0 consolidateCitations=0 includeRawCitations=1 includeRawAffiliations=1 segmentSentences=1 teiCoordinates=ref teiCoordinates=figure teiCoordinates=biblStruct input@5fa184d7eee96a504bae646e6c045699530bc023.pdf

This command times out after two minutes, and the threads continue to spin for at least 10 minutes on the server. In other cases the CPU seems to continue to spin for hours, though I haven't isolated and reproduced exactly that behavior.

The specific regex is PARENTHESIS_NUMBER_PATTERN, which is called on line 66 of CalloutAnalyzer.java:

grobid/grobid-core/src/main/java/org/grobid/core/engines/citations/CalloutAnalyzer.java

Lines 65 to 68 in beebd9a

    
           matcher = PARENTHESIS_NUMBER_PATTERN.matcher(calloutString); 
        
           if (matcher.find()) { 
        
               return MarkerType.PARENTHESIS_NUMBER; 
        
           }

To debug, I submit a couple parallel requests of the same PDF file, then send signal 3 to a high-CPU process, which causes java to dump a thread trace. These are long, and include many threads, but a common pattern is a trace like:

Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: "dw-35 - POST /api/processFulltextDocument" #35 prio=5 os_prio=0 tid=0x00007f534c001000 nid=0xc239e runnable [0x00007f53b828e000]
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]:    java.lang.Thread.State: RUNNABLE
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Curly.match(Pattern.java:4241)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupHead.match(Pattern.java:4672)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Branch.match(Pattern.java:4618)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupHead.match(Pattern.java:4672)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Loop.match(Pattern.java:4799)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupTail.match(Pattern.java:4731)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$BranchConn.match(Pattern.java:4582)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupTail.match(Pattern.java:4731)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Ques.match(Pattern.java:4196)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Curly.match0(Pattern.java:4286)

[... many java.util.regex lines ...]

Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Curly.match0(Pattern.java:4286)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Curly.match(Pattern.java:4248)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupHead.match(Pattern.java:4672)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Branch.match(Pattern.java:4618)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupHead.match(Pattern.java:4672)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Loop.matchInit(Pattern.java:4815)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Prolog.match(Pattern.java:4755)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Start.match(Pattern.java:3475)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Matcher.search(Matcher.java:1248)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Matcher.find(Matcher.java:637)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.citations.CalloutAnalyzer.getCalloutType(CalloutAnalyzer.java:66)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.FullTextParser.postProcessCallout(FullTextParser.java:2359)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:298)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:110)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:507)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:497)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:207)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:268)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:220)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

[... many service lines ...]

Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.eclipse.jetty.io.ManagedSelector$$Lambda$101/448084064.run(Unknown Source)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
Dec  4 02:18:11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.lang.Thread.run(Thread.java:748)

The org.grobid.core.engines.citations.CalloutAnalyzer.getCalloutType(CalloutAnalyzer.java:66) is common across a few repeats with different files, different requests, and multiple threads stuck in the same regex at the same point in time. This is what makes me suspect the regex linked above.

When I run GROBID on my laptop, using the same artifacts and configuration, except lower concurrency, I can not reproduce the issue. I noted that my laptop (running Debian Linux, on an intel processor) has JDK 11, while the server (Ubuntu focal on intel processor) has JDK 8 (which is what GROBID supports, as per documentation). I tried installing JDK 11 on the server instead, and could no longer reproduce the problem.

I haven't really worked through what the specific root cause is... difficult regex for some body text on some platforms in a specific version of the JVM? But figured I would share what I have found so far here.

Thanks, as always, for maintaining GROBID, and apologies this bug report runs so long.

The text was updated successfully, but these errors were encountered:

bnewbold · 2021-12-04T03:11:34Z

Here is a reference to regex implementation changes in Java 9, related to slow patterns and the "regex denial of service" issue. I'm not sure if they apply here or not.

Here is the PARENTHESIS_NUMBER_PATTERN regex itself: \(((\d+[a-f]?)|[,-;•])+\).

kermitt2 · 2021-12-04T08:13:58Z

Hi @bnewbold !

Thank you so much for the detailed issue with the example document, this is super clear and helpful.

So as you have found, this is a catastrophic backtracking problem with this regex (and another one similar). The fact that it works with JDK 11 is related to their new safety feature for regex, which cancels suspicious regex applications to improve stability (this is a very nice circuit breaker feature !). With JDK 8, the regex will run as long as necessary, so 10 minutes or a few years :)

This part is indeed something I changed with 0.7.0 to improve citation callout recognition (and it improved it ! despite this bug :) ). It seems that there are always possible pathological input strings that can cause this kind of problem for non trivial regex. Here we are not in the easy case of nested regex expression:

input string (callout candidate) from the example: (1915-1919,1920-1924,1925-1929,1930-1934,and1935-1939) (the space character is removed before applying the regex)
the regex you pointed will cause catastrophic backtracking in this case. Even a simpler, well-bounded, nothing nested, regex like this:
\((\d{1,4}[a-f]?[\-;•,]?)+\)
will cause a catastrophic backtracking with this input string !

After reading some more regex doc, it appears that this can be fixed (without the JDK 11 solution) using atomic grouping in the regex: \((?>[0-9]{1,4}[a-f]?[\-;•,]?)+\). The ?> indicates that no backtracking should take place for the defined token in this group, so we will save a few century of computation in the pathological cases.

PR to follow.

bnewbold · 2021-12-07T02:55:44Z

Great!

I gave the PR #868 a try with several dozen PDFs which were causing problems before, and all were processed successfully.

kermitt2 · 2021-12-07T04:58:26Z

Thanks a lot for the feedback !
I merge the PR.

Fix regex, #867

kermitt2 self-assigned this Dec 4, 2021

kermitt2 added the bug From Hemiptera and especially its suborder Heteroptera label Dec 4, 2021

kermitt2 mentioned this issue Dec 4, 2021

Fix regex, #867 #868

Merged

kermitt2 added the implemented The issue has been implemented label Dec 7, 2021

kermitt2 added a commit that referenced this issue Dec 7, 2021

Merge pull request #868 from kermitt2/fix-867

6e5236e

Fix regex, #867

bnewbold closed this as completed Dec 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GROBID 0.7.0 100% CPU loop/hang/timeout with some PDFs #867

GROBID 0.7.0 100% CPU loop/hang/timeout with some PDFs #867

bnewbold commented Dec 4, 2021

bnewbold commented Dec 4, 2021

kermitt2 commented Dec 4, 2021

bnewbold commented Dec 7, 2021

kermitt2 commented Dec 7, 2021

GROBID 0.7.0 100% CPU loop/hang/timeout with some PDFs #867

GROBID 0.7.0 100% CPU loop/hang/timeout with some PDFs #867

Comments

bnewbold commented Dec 4, 2021

bnewbold commented Dec 4, 2021

kermitt2 commented Dec 4, 2021

bnewbold commented Dec 7, 2021

kermitt2 commented Dec 7, 2021