Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JRuby XPATH Memory Usage #1749

Closed
peoplesmeat opened this issue Apr 3, 2018 · 16 comments · Fixed by #1792
Closed

JRuby XPATH Memory Usage #1749

peoplesmeat opened this issue Apr 3, 2018 · 16 comments · Fixed by #1792

Comments

@peoplesmeat
Copy link

peoplesmeat commented Apr 3, 2018

When using Nokogiri on jruby with a nested XPATH loop, the document memory footpring explodes in size.

For example, with a document like
<items> <item> <value1> <value2> <item>
... for many thousands of items

And attempting to use (for example) an xpath like: doc.xpath('items').each { |node| node.xpath('value1') }

You'll wind up with a document that could be hundreds of megabytes large due to caching in the CACHED_XPATH_CONTEXT layer. Specifically the nokogiri.internals.XalanDTMManagerPatch winds up with thousands of values in "m_dtm". I'm not an expert in this area and am unclear what that terminology is referencing. In my case I had a document with 4000 items taking 4GB of memory of cached xpath. And there appears to be no way to clear that specific cache.

This behavior is not present in the mri ruby version.

@flavorjones
Copy link
Member

Hi,

Thanks for reporting this. The issue template that you deleted when filing this had a few key questions that will help us reproduce and diagnose this issue:

What's the output from nokogiri -v?

Can you provide a self-contained script that reproduces what you're seeing?

Based on what you've written, I'm assuming we'll be able to reproduce it, but helping us out means we're likely to get to it sooner, and we'll have a test case to work against.

@flavorjones
Copy link
Member

A couple of things that a working example would clarify: the document structure (are the tags intended to be unclosed, or self-closing, or ...), the query (is the xpath really value1 or did you mean //value1, or ...)

@flavorjones
Copy link
Member

Here's my attempt to reproduce:

#!/usr/bin/env ruby

require 'nokogiri'

xml = '<items>' + '<item><value1/><value2/></item>'*4000

doc = Nokogiri::XML xml

puts "pid is #{$$}"

loop do
  doc.xpath('items').each { |node| node.xpath('//value1') }
  system "cat /proc/#{$$}/status | egrep 'VmSize|VmRSS'"
end

on MRI with this config:

# Nokogiri (1.8.2)
    ---
    warnings: []
    nokogiri: 1.8.2
    ruby:
      version: 2.4.1
      platform: x86_64-linux
      description: ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/home/flavorjones/.rvm/gems/ruby-2.4.1/gems/nokogiri-1.8.2/ports/x86_64-pc-linux-gnu/libxml2/2.9.7"
      libxslt_path: "/home/flavorjones/.rvm/gems/ruby-2.4.1/gems/nokogiri-1.8.2/ports/x86_64-pc-linux-gnu/libxslt/1.1.32"
      libxml2_patches: []
      libxslt_patches: []
      compiled: 2.9.7
      loaded: 2.9.7

memory usage stabilizes at:

VmSize:	   75316 kB
VmRSS:	   26968 kB

on JRuby with this config:

# Nokogiri (1.8.2)
    ---
    warnings: []
    nokogiri: 1.8.2
    ruby:
      version: 2.3.3
      platform: java
      description: jruby 9.1.15.0 (2.3.3) 2017-12-07 929fde8 OpenJDK 64-Bit Server VM
        25.162-b12 on 1.8.0_162-8u162-b12-0ubuntu0.16.04.2-b12 [linux-x86_64]
      engine: jruby
      jruby: 9.1.15.0
    xerces: Xerces-J 2.11.0
    nekohtml: NekoHTML 1.9.21

the memory usage is much larger, stabilizing around:

VmSize:	 3068068 kB
VmRSS:	  401372 kB

does this match what you're seeing?

@flavorjones
Copy link
Member

(I'll note that just the JRuby interpreter is rather large, around

VmSize:	 3065972 kB
VmRSS:	  168000 kB

)

@peoplesmeat
Copy link
Author

Thanks for taking a look at this, sorry for not including some vital pieces of information. I wasn't sure this wouldn't be chalked up to general terribleness of xpath on java (ala #741 ). I'll have to collect the details tomorrow, but in my case I was using jruby-9.1.14.0 and nokogiri 1.8.2 on java8. I was also looking at memory dumps to get the exact size details because even with 2GB heap space 1 or 2 documents was running out of space. And of course there were XML namespaces involved so it'll take a bit of work to narrow down a representative example.

@peoplesmeat
Copy link
Author

Here's a representative script to demonstrate the problem. This script will blow a 2GB heap in a few seconds.

#!/usr/bin/env ruby
require 'nokogiri'

items = %Q(
<net:Item>
  	<net:InformationBlock>
		derp
  	</net:InformationBlock>
</net:Item>
) * 5000

xml = %Q(
<net:Information xmlns:net="urn:payload:information">
  <net:Items>
	#{items}
  </net:Items>
</net:Information>
)
doc = Nokogiri::XML xml
puts "pid is #{$$}"
doc.xpath('./net:Information/net:Items/net:Item').each do |node| 
  node.xpath('./net:InformationBlock')
end

Nokogiri (1.8.2)
---
warnings: []
nokogiri: 1.8.2
ruby:
version: 2.3.3
platform: java
description: jruby 9.1.14.0 (2.3.3) 2017-11-08 2176f24 Java HotSpot(TM) 64-Bit Server
VM 25.91-b14 on 1.8.0_91-b14 +jit [darwin-x86_64]
engine: jruby
jruby: 9.1.14.0
xerces: Xerces-J 2.11.0
nekohtml: NekoHTML 1.9.21

@jeremyhaile
Copy link

We are also running out of memory on 8GB of heap and it seems to be this issue – heap dump shows XPathContexts with gigs of int[]s tied back to m_dtms.

@flavorjones
Copy link
Member

I'm just back from vacation and it will take a few days to catch up on everything. Thanks for your patience.

@jeremyhaile
Copy link

jeremyhaile commented Aug 30, 2018

Here is a modified reproducible example that is closer to the issue we're having. It uses HTML and uses css instead of xpath to select (which under the hood are the same). Also the .each from the original example is unnecessary.

On a 2GB heap this will crash in a few seconds.

require 'nokogiri'

items = %Q(
<div class="test">
  <a href="something">derp</a>
</div>
) * 5000

html = %Q(
<html>
  <body>
    #{items}
  </body>
</html>
)
doc = Nokogiri::HTML html
puts "pid is #{$$}"
links = doc.css('html body .test a')
puts "Without nesting got #{links.length} links" # This will return immediately

links = doc.css('html body .test').css('a')
puts "With nesting got #{links.length} links" # This will crash a 2GB heap and never output

@jeremyhaile
Copy link

@flavorjones we are open to trying to fix this ourselves, but are unfamiliar with the source. Do you have any idea where to even start looking here?

I would really appreciate any help you can provide as this is causing our production server to crash on a regular basis. We've tried to workaround this issue by never doing nested xpath/css, but perhaps we missed something or a third-party lib (e.g. readability) is doing it because it keeps running out of memory. (even with a 12GB heap)

The heap dump shows nokogiri objects filling up the heap:
image

@kares
Copy link
Contributor

kares commented Sep 1, 2018

also noticed increased memory usage under JRuby from ~ 1.6 - thought maybe some of my changes are to blame but there never was a production leak even with heavy xpath use on 1-2M xmls.

@jeremyhaile you simply need to try understanding the xalan library used under the hood and why its keeping the internal state around, maybe it's xpath cache related. just some minor hints - its definitely resolvable if enough quality time is put in (no clear guess really as some internal pieces get tricky).

@jeremyhaile
Copy link

I noticed that the class that is taking up all of the heap is actually in the nokogiri source, despite being in the org.apache namespace: https://github.com/sparklemotion/nokogiri/blob/master/ext/java/org/apache/xml/dtm/ref/dom2dtm/DOM2DTMExt.java

Is this class overriding behavior from Xalan?

@jeremyhaile
Copy link

@kares I noticed you wrote a lot of the code in XalanDTMManagerPatch and DOM2DTMExt. Do you have any ideas on where this might be occurring? I notice that the getDTM method was overridden to return Dom2DTMExt objects, and those are then added via addDTM. I'm not sure whether there should be less DTM nodes or whether it's a problem within the DTM nodes that causes it to retain too many objects. Would really appreciate some help!

jvshahid added a commit that referenced this issue Sep 5, 2018
Looks like we replaced DOM2DTM with DOM2DTMExt when we fixed
#1320 but forgot to replace it in
the DOM2DTM manager

fixes #1749
@jvshahid
Copy link
Member

jvshahid commented Sep 5, 2018

@jeremyhaile can you try the branch in #1792 and let me know if it fixes your issue

@kares
Copy link
Contributor

kares commented Sep 5, 2018

@jeremyhaile no ideas atm - would need to dive deep on this one to really understand what's going on
hopefully John's fix does resolve it - seems like there's some left-over to be handled ... if not get in touch.

@jeremyhaile
Copy link

jeremyhaile commented Sep 5, 2018

@kares The branch from @jvshahid fixes the memory issue. However, as I outlined along with a reproducible test case – there is still a huge performance penalty incurred by nested xpath queries, and the penalty seems to grow exponentially based on the number of elements being searched.

Here is my relevant comment on the PR:
#1792 (comment)

flavorjones pushed a commit that referenced this issue Dec 1, 2018
Looks like we replaced DOM2DTM with DOM2DTMExt when we fixed
#1320 but forgot to replace it in
the DOM2DTM manager

fixes #1749
flavorjones pushed a commit that referenced this issue Dec 1, 2018
Looks like we replaced DOM2DTM with DOM2DTMExt when we fixed
#1320 but forgot to replace it in
the DOM2DTM manager

fixes #1749
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants