Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDX indexer endless loop and memleak for certain warc files #402

Closed
vitezg opened this issue Jul 22, 2019 · 5 comments
Closed

CDX indexer endless loop and memleak for certain warc files #402

vitezg opened this issue Jul 22, 2019 · 5 comments

Comments

@vitezg
Copy link

vitezg commented Jul 22, 2019

We have a few problematic warc files, created by heritrix 3, for which the CDX indexer seems to get stuck: CPU usage goes up to 100%, memory usage up to the -Xmx limit (16GB), and the CDX indexer stops producing output (for testing we just let it write to the standard output, and once stuck it stops writing). The last lines of the output are consistent, the indexer always gets stuck at the same place for the same file.

The catch is the problematic warc file is about 9 GB and should not be made public, I'm only authorized to send the URL for it in email.

Any chance you could take a look?

@ato
Copy link
Member

ato commented Jul 22, 2019

A thread dump might give a rough indication of where the problem is. Try pressing Ctrl+\ (Linux/Mac) or Ctrl+Break (Windows) in the terminal running the indexer. Although if it is stuck trying to allocate that might not necessarily indicate the source of the leak.

@vitezg
Copy link
Author

vitezg commented Jul 22, 2019

The first thread dump is right after the indexer getting stuck, the second one is from a bit later. All from the main thread only - if you need the date from the GC and other threads I can send those too.

"main" prio=10 tid=0x00007fa7cc00c000 nid=0x5208 runnable [0x00007fa7d3faf000] java.lang.Thread.State: RUNNABLE at org.htmlparser.nodes.TagNode.getTagName(TagNode.java:398) at org.archive.wayback.util.htmllex.NodeUtils.isCloseTagNodeNamed(NodeUtils.java:72) at org.archive.wayback.util.htmllex.ContextAwareLexer.nextNode(ContextAwareLexer.java:87) at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTMLContent(HTTPRecordAnnotater.java:156) at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTTPContent(HTTPRecordAnnotater.java:141) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:303) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:53) at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57) at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55) at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)

Second one:

"main" prio=10 tid=0x00007fa7cc00c000 nid=0x5208 runnable [0x00007fa7d3faf000] java.lang.Thread.State: RUNNABLE at org.htmlparser.lexer.InputStreamSource.fill(InputStreamSource.java:337) at org.htmlparser.lexer.InputStreamSource.read(InputStreamSource.java:396) at org.htmlparser.lexer.Page.getCharacter(Page.java:705) at org.htmlparser.lexer.Lexer.parseString(Lexer.java:735) at org.htmlparser.lexer.Lexer.nextNode(Lexer.java:398) at org.archive.wayback.util.htmllex.ContextAwareLexer.nextNode(ContextAwareLexer.java:72) at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTMLContent(HTTPRecordAnnotater.java:156) at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTTPContent(HTTPRecordAnnotater.java:141) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:303) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:53) at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57) at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55) at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)

@anjackson
Copy link
Member

Just noting this related issue: #162

The indexer only does this parsing to poke around for robots.txt assertions that most (any?) of us don't make use of. Maybe we should modify things so it's optional/off-by-default?

// Now the sticky part: If it looks like an HTML document, look for
// robot meta tags:
if(isHTML(mimeType)) {
String fileContext = result.getFile() + ":" + result.getOffset();
annotateHTMLContent(is, encoding, fileContext, result);
}
robotFlags.apply(result);

@vitezg
Copy link
Author

vitezg commented Jul 22, 2019

Just for testing I put a return; at the front of annotateHTTPContent, before robotFlags.reset(); (

) and now it finishes as expected, no endlees loop, no mem leak.

ato added a commit that referenced this issue Jul 22, 2019
The HTML parser can go into an infinite loop (#402, #162). Since robotflags
are not used by most users let's disable it by default to make indexing
more reliable.
ato added a commit that referenced this issue Jul 22, 2019
The HTML parser can go into an infinite loop (#402, #162). Since robotflags
are not used by most users let's disable it by default to make indexing
more reliable.

Adds a -parse-html option the cdx-indexer CLI tool to re-enable.
@ldko
Copy link
Member

ldko commented Jul 22, 2019

I am closing this issue now that #403 merged allows the endless loop to be avoided, and #162 is open to address there is still infinite loop potential on some WARCs and references this issue.

@ldko ldko closed this as completed Jul 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants