-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDX indexer endless loop and memleak for certain warc files #402
Comments
A thread dump might give a rough indication of where the problem is. Try pressing Ctrl+\ (Linux/Mac) or Ctrl+Break (Windows) in the terminal running the indexer. Although if it is stuck trying to allocate that might not necessarily indicate the source of the leak. |
The first thread dump is right after the indexer getting stuck, the second one is from a bit later. All from the main thread only - if you need the date from the GC and other threads I can send those too.
Second one:
|
Just noting this related issue: #162 The indexer only does this parsing to poke around for Lines 137 to 143 in c49f8e7
|
Just for testing I put a Line 88 in c49f8e7
|
We have a few problematic warc files, created by heritrix 3, for which the CDX indexer seems to get stuck: CPU usage goes up to 100%, memory usage up to the -Xmx limit (16GB), and the CDX indexer stops producing output (for testing we just let it write to the standard output, and once stuck it stops writing). The last lines of the output are consistent, the indexer always gets stuck at the same place for the same file.
The catch is the problematic warc file is about 9 GB and should not be made public, I'm only authorized to send the URL for it in email.
Any chance you could take a look?
The text was updated successfully, but these errors were encountered: