-
-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JRuby XPATH Memory Usage #1749
Comments
Hi, Thanks for reporting this. The issue template that you deleted when filing this had a few key questions that will help us reproduce and diagnose this issue: What's the output from Can you provide a self-contained script that reproduces what you're seeing? Based on what you've written, I'm assuming we'll be able to reproduce it, but helping us out means we're likely to get to it sooner, and we'll have a test case to work against. |
A couple of things that a working example would clarify: the document structure (are the tags intended to be unclosed, or self-closing, or ...), the query (is the xpath really |
Here's my attempt to reproduce:
on MRI with this config:
memory usage stabilizes at:
on JRuby with this config:
the memory usage is much larger, stabilizing around:
does this match what you're seeing? |
(I'll note that just the JRuby interpreter is rather large, around
) |
Thanks for taking a look at this, sorry for not including some vital pieces of information. I wasn't sure this wouldn't be chalked up to general terribleness of xpath on java (ala #741 ). I'll have to collect the details tomorrow, but in my case I was using jruby-9.1.14.0 and nokogiri 1.8.2 on java8. I was also looking at memory dumps to get the exact size details because even with 2GB heap space 1 or 2 documents was running out of space. And of course there were XML namespaces involved so it'll take a bit of work to narrow down a representative example. |
Here's a representative script to demonstrate the problem. This script will blow a 2GB heap in a few seconds.
Nokogiri (1.8.2) |
We are also running out of memory on 8GB of heap and it seems to be this issue – heap dump shows XPathContexts with gigs of int[]s tied back to |
I'm just back from vacation and it will take a few days to catch up on everything. Thanks for your patience. |
Here is a modified reproducible example that is closer to the issue we're having. It uses HTML and uses On a 2GB heap this will crash in a few seconds.
|
@flavorjones we are open to trying to fix this ourselves, but are unfamiliar with the source. Do you have any idea where to even start looking here? I would really appreciate any help you can provide as this is causing our production server to crash on a regular basis. We've tried to workaround this issue by never doing nested xpath/css, but perhaps we missed something or a third-party lib (e.g. readability) is doing it because it keeps running out of memory. (even with a 12GB heap) |
also noticed increased memory usage under JRuby from ~ 1.6 - thought maybe some of my changes are to blame but there never was a production leak even with heavy xpath use on 1-2M xmls. @jeremyhaile you simply need to try understanding the xalan library used under the hood and why its keeping the internal state around, maybe it's xpath cache related. just some minor hints - its definitely resolvable if enough quality time is put in (no clear guess really as some internal pieces get tricky). |
I noticed that the class that is taking up all of the heap is actually in the nokogiri source, despite being in the Is this class overriding behavior from Xalan? |
@kares I noticed you wrote a lot of the code in |
@jeremyhaile can you try the branch in #1792 and let me know if it fixes your issue |
@jeremyhaile no ideas atm - would need to dive deep on this one to really understand what's going on |
@kares The branch from @jvshahid fixes the memory issue. However, as I outlined along with a reproducible test case – there is still a huge performance penalty incurred by nested xpath queries, and the penalty seems to grow exponentially based on the number of elements being searched. Here is my relevant comment on the PR: |
When using Nokogiri on jruby with a nested XPATH loop, the document memory footpring explodes in size.
For example, with a document like
<items> <item> <value1> <value2> <item>
... for many thousands of items
And attempting to use (for example) an xpath like: doc.xpath('items').each { |node| node.xpath('value1') }
You'll wind up with a document that could be hundreds of megabytes large due to caching in the CACHED_XPATH_CONTEXT layer. Specifically the nokogiri.internals.XalanDTMManagerPatch winds up with thousands of values in "m_dtm". I'm not an expert in this area and am unclear what that terminology is referencing. In my case I had a document with 4000 items taking 4GB of memory of cached xpath. And there appears to be no way to clear that specific cache.
This behavior is not present in the mri ruby version.
The text was updated successfully, but these errors were encountered: