CDX indexer endless loop and memleak for certain warc files #402

vitezg · 2019-07-22T10:15:56Z

We have a few problematic warc files, created by heritrix 3, for which the CDX indexer seems to get stuck: CPU usage goes up to 100%, memory usage up to the -Xmx limit (16GB), and the CDX indexer stops producing output (for testing we just let it write to the standard output, and once stuck it stops writing). The last lines of the output are consistent, the indexer always gets stuck at the same place for the same file.

The catch is the problematic warc file is about 9 GB and should not be made public, I'm only authorized to send the URL for it in email.

Any chance you could take a look?

ato · 2019-07-22T10:34:31Z

A thread dump might give a rough indication of where the problem is. Try pressing Ctrl+\ (Linux/Mac) or Ctrl+Break (Windows) in the terminal running the indexer. Although if it is stuck trying to allocate that might not necessarily indicate the source of the leak.

vitezg · 2019-07-22T11:01:35Z

The first thread dump is right after the indexer getting stuck, the second one is from a bit later. All from the main thread only - if you need the date from the GC and other threads I can send those too.

"main" prio=10 tid=0x00007fa7cc00c000 nid=0x5208 runnable [0x00007fa7d3faf000] java.lang.Thread.State: RUNNABLE at org.htmlparser.nodes.TagNode.getTagName(TagNode.java:398) at org.archive.wayback.util.htmllex.NodeUtils.isCloseTagNodeNamed(NodeUtils.java:72) at org.archive.wayback.util.htmllex.ContextAwareLexer.nextNode(ContextAwareLexer.java:87) at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTMLContent(HTTPRecordAnnotater.java:156) at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTTPContent(HTTPRecordAnnotater.java:141) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:303) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:53) at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57) at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55) at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)

Second one:

"main" prio=10 tid=0x00007fa7cc00c000 nid=0x5208 runnable [0x00007fa7d3faf000] java.lang.Thread.State: RUNNABLE at org.htmlparser.lexer.InputStreamSource.fill(InputStreamSource.java:337) at org.htmlparser.lexer.InputStreamSource.read(InputStreamSource.java:396) at org.htmlparser.lexer.Page.getCharacter(Page.java:705) at org.htmlparser.lexer.Lexer.parseString(Lexer.java:735) at org.htmlparser.lexer.Lexer.nextNode(Lexer.java:398) at org.archive.wayback.util.htmllex.ContextAwareLexer.nextNode(ContextAwareLexer.java:72) at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTMLContent(HTTPRecordAnnotater.java:156) at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTTPContent(HTTPRecordAnnotater.java:141) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:303) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:53) at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57) at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55) at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)

anjackson · 2019-07-22T11:09:10Z

Just noting this related issue: #162

The indexer only does this parsing to poke around for robots.txt assertions that most (any?) of us don't make use of. Maybe we should modify things so it's optional/off-by-default?

openwayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/HTTPRecordAnnotater.java

Lines 137 to 143 in c49f8e7

    
           // Now the sticky part: If it looks like an HTML document, look for 
        
           // robot meta tags: 
        
           if(isHTML(mimeType)) { 
        
           	String fileContext = result.getFile() + ":" + result.getOffset(); 
        
           	annotateHTMLContent(is, encoding, fileContext, result); 
        
           } 
        
           robotFlags.apply(result);

vitezg · 2019-07-22T11:28:05Z

Just for testing I put a return; at the front of annotateHTTPContent, before robotFlags.reset(); (

openwayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/HTTPRecordAnnotater.java

Line 88 in c49f8e7

robotFlags.reset();

) and now it finishes as expected, no endlees loop, no mem leak.

The HTML parser can go into an infinite loop (#402, #162). Since robotflags are not used by most users let's disable it by default to make indexing more reliable.

The HTML parser can go into an infinite loop (#402, #162). Since robotflags are not used by most users let's disable it by default to make indexing more reliable. Adds a -parse-html option the cdx-indexer CLI tool to re-enable.

ldko · 2019-07-22T22:16:18Z

I am closing this issue now that #403 merged allows the endless loop to be avoided, and #162 is open to address there is still infinite loop potential on some WARCs and references this issue.

ato added a commit that referenced this issue Jul 22, 2019

By default disable html parsing during indexing

b94cd83

The HTML parser can go into an infinite loop (#402, #162). Since robotflags are not used by most users let's disable it by default to make indexing more reliable.

ato mentioned this issue Jul 22, 2019

By default disable html parsing during indexing #403

Merged

ldko closed this as completed Jul 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDX indexer endless loop and memleak for certain warc files #402

CDX indexer endless loop and memleak for certain warc files #402

vitezg commented Jul 22, 2019

ato commented Jul 22, 2019

vitezg commented Jul 22, 2019

anjackson commented Jul 22, 2019

vitezg commented Jul 22, 2019

ldko commented Jul 22, 2019

CDX indexer endless loop and memleak for certain warc files #402

CDX indexer endless loop and memleak for certain warc files #402

Comments

vitezg commented Jul 22, 2019

ato commented Jul 22, 2019

vitezg commented Jul 22, 2019

anjackson commented Jul 22, 2019

vitezg commented Jul 22, 2019

ldko commented Jul 22, 2019