Improve each_row_streaming memory usage #436

will89 · 2018-07-14T21:24:02Z

I ran into this issue, #179, when working with an excel file that had 180k rows and the xml for the sheet file was about 386MB on disk. The issue seemed to have narrowed it down to doc.xpath forcefully loading the whole xml into memory. This changes extract_hyperlinks to use Roo::Utils.each_element in order to stream the xml file.

coveralls · 2018-07-14T21:30:18Z

Coverage increased (+0.02%) to 94.045% when pulling 903dc65 on will89:issue/extract-hyperlinks-mem-consumption into 782420b on roo-rb:master.

chopraanmol1 · 2018-09-11T05:26:50Z

@will89 thank you for opening this PR. Can you provide a memory profiling and benchmark script with a sample file with masked data?

will89 · 2018-09-11T12:42:02Z

@chopraanmol1 Thanks for taking a look at this! I will work on providing a benchmark script with a sample file. Any preferred way for generating the memory profile.

chopraanmol1 · 2018-09-11T13:04:35Z

@will89 you can use memory_profiler gem

tgturner · 2018-09-15T19:16:14Z

@will89 is it possible for you to provide an excel spreadsheet/case that demonstrates the benefit? I ran the following test case with a spreadsheet that I had on hand and these were the results

# frozen_string_literal: true

begin
  require "bundler/inline"
rescue LoadError => e
  $stderr.puts "Bundler version 1.10 or later is required. Please update your Bundler"
  raise e
end

gemfile(true) do
  source "https://rubygems.org"

  git_source(:github) { |repo| "https://github.com/#{repo}.git" }

  gem "roo", path: "/Users/gturner/development/roo"
  gem "minitest"
  gem "memory_profiler"
end

require "roo"
require "minitest/autorun"
require "memory_profiler"

class BugTest < Minitest::Test
  def test_stuff
    # add your test here
    report = MemoryProfiler.report do
      sheet = Roo::Spreadsheet.open("/Users/gturner/Downloads/abm_codes.xlsx")
      sheet.each_row_streaming do |row|
        row[1]
      end
    end
    report.pretty_print
  end
end

# Master

Total allocated: 570191 bytes (4982 objects)
Total retained:  30925 bytes (569 objects)

allocated memory by gem
-----------------------------------
    264131  roo/lib
    161445  rubyzip-1.2.2
    143182  nokogiri-1.8.4
      1433  lib

# With PR

Total allocated: 577552 bytes (5492 objects)
Total retained:  30266 bytes (564 objects)

allocated memory by gem
-----------------------------------
    276874  roo/lib
    159463  rubyzip-1.2.2
    139782  nokogiri-1.8.4
      1433  lib     1433  lib

As you can see, no gain in allocated memory, and very little in retained.

will89 · 2018-09-15T21:10:31Z

Yeah I'm trying to work on generating one in my free time. The only file I have immediately on hand has customer data in it. @tgturner

will89 · 2018-09-16T00:20:58Z

I believe this file reproduces the issue,
foo.xlsx. @tgturner

will89 · 2018-09-16T18:02:07Z

I modified the test a bit, I think the MemoryProfiler might have issues completing since my process memory went completely out of control with it enabled.

I used the code snippet from, https://stackoverflow.com/questions/7220896/get-current-ruby-process-memory-usage, to print the memory usage at certain times. I modified the Roo::Excelx::SheetDoc#hyperlinks to print the before and after memory usage of calling hyperlinks.

require "roo"
require "minitest/autorun"
require "memory_profiler"

class BugTest < Minitest::Test
  def test_stuff
    Roo::Excelx::SheetDoc.class_eval do
      def hyperlinks(relationships)
        @hyperlinks ||= begin
          pid, size = `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i)
          puts "Before Extract #{pid} - #{size}"
          res = extract_hyperlinks(relationships)
          pid, size = `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i)
          puts "After Extract #{pid} - #{size}"
          res
        end
      end
    end
    # add your test here
    sheet = Roo::Spreadsheet.open("foo.xlsx")
    sheet.each_row_streaming do |row|
      row[1]
      break
    end
    pid, size = `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i)
    puts "#{pid} - #{size}"
  end
end

This is what I got when running with the file above.

# Master
Before Extract 28164 - 279,744
After Extract 28164 - 2,883,688
28164 - 2,888,068

# With PR
Before Extract 27936 - 177,476
After Extract 27936 - 274,868
27936 - 274,948

It's weird in that other very large excel files did not trigger this memory issue, so I'm not sure how to fully explain this. However, it seems that using xpath on any potentially large xml file is probably not advised.

chopraanmol1 · 2018-09-16T19:12:47Z

@will89 Result do look good, but I'm not sure if xpath is causing issue. Could you test another variant of extract_hyperlinks which uses xpath just like master but do not create intermediate array while building Hash.

So instead of Hash[hyperlinks.map{... [key, value]}] pattern to create hash, use hyperlinks.each{........ hsh[key] = value} like pattern (Used in your patch).

We could be sure if xpath is problematic or not if we compare the result of all 3 variant.

will89 · 2018-09-16T20:50:41Z

Modified extract_hyperlinks to

    def extract_hyperlinks(relationships)
      return {} unless (hyperlinks = doc.xpath('/worksheet/hyperlinks/hyperlink'))
      result = {}

      hyperlinks.each do |hyperlink|
         if hyperlink.attribute('id') && (relationship = relationships[hyperlink.attribute('id').text])
           result[::Roo::Utils.ref_to_key(hyperlink.attributes['ref'].to_s)] = relationship.attribute('Target').text
         end
      end

      result
    end

Got

Before Extract 64671 - 176,592
After Extract 64671 - 3,535,648
64671 - 3536064

chopraanmol1 · 2018-09-17T05:57:16Z

@will89 Thank you for posting memory result. I've also verified this and observed similar result, but this new implementation is terribly slow(3x slow). I personally do like this solution, but I can't merge it in its current state given the performance slowdown.

I've also reviewed the samle xlsx file you provided if this file is similar to file you are using (no hyperlink in the file or you don't care about hyperlink for your application) then, I'll suggest you to use no_hyperlink option for now:

Roo::Spreadsheet.open("foo.xlsx", no_hyperlinks: true)

I'm not closing this PR, just incase you want to improve this PR with regards to performance as well. I'll also look into improving performance for Roo::Utils.each_element implemetation

will89 · 2018-09-17T11:46:48Z

Thanks for pointing out that option! Is there more info on what it does? Is there an easy way to determine if I rely on hyperlinks in my application (like roo methods that are called that use them)?

In regards to this branch, is the branch 3x slower or was just extract_hyperlinks 3x slower? I just merged with latest master which I see has some new performance enhancements. Can you share the benchmarks?

chopraanmol1 · 2018-09-17T13:14:44Z

Sample xlsx file which includes hyperlink https://github.com/roo-rb/roo/blob/master/test/files/link.xlsx
You can check the above-specified file to see an example of value with a hyperlink..

When roo finds a hyperlinked cell it wraps value of the cell in Roo::Link which is a subclass of string.

Roo::Link class implements 3 methods url, href, to_uri. You can check your application for usage of this 3 methods in the context of cell's value or see if your applications check for value to be Roo::Link, if doesn't it will be safe to use no_hyperlinks option.

Benchmark Script:

require "roo"
require "minitest/autorun"
require "benchmark"

class BugTest < Minitest::Test
  def test_stuff
    Roo::Excelx::SheetDoc.class_eval do
      def hyperlinks(relationships)
        @hyperlinks ||= begin
          res = nil
          puts Benchmark.measure{
            res = extract_hyperlinks(relationships)
          }
          res
        end
      end
    end
    # add your test here
    sheet = Roo::Spreadsheet.open("foo.xlsx")
    sheet.each_row_streaming do |row|
      row[1]
      break
    end
    pid, size = `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i)
    puts "#{pid} - #{size}"
  end
end

will89 · 2018-09-17T13:18:16Z

I think you were right in that there was some performance issues in Roo::Utils.each_element. I think the way I called it was generating a new array for each node element in the tree. I modified it, 903dc65, to pull the array generation outside of the loop. Could you please try the benchmark with the latest version of this branch please?

chopraanmol1 · 2018-09-17T14:22:11Z

Approximate benchmark result of extract_hyperlinks when I originally tested was ~14 sec for master and ~42 sec for your orignal implementation. Considering noise your orignal implementation is 2.8-3.0x slower that that of master.

When I was benchmarking this I also tried tweaking Roo::Utils.each_element a bit including wrapping elements in array outside of block. Even though I don't remember result, it was still too slow.

will89 · 2018-09-17T14:43:33Z

Interesting, your computer must be significantly better than mine :)
Running that test against the file I linked above I got the following:

gem "roo", github: 'will89/roo', branch: 'issue/extract-hyperlinks-mem-consumption'

 50.088052   4.120130  54.208182 ( 59.366028)
10988 - 287764

Finished in 235.568521s, 0.0042 runs/s, 0.0000 assertions/s.

gem "roo", github: 'roo-rb/roo', branch: 'master'

 28.649450  12.480391  41.129841 ( 45.337095)
11292 - 3446960

Finished in 220.953135s, 0.0045 runs/s, 0.0000 assertions/s.

When you tested this branch did it include this commit, 5283e01? I think that commit brought in a lot of the performance increases you brought into this gem recently, which thanks for doing that!

will89 · 2018-09-17T14:48:22Z

Ah, I see now. I think I was looking at the wrong times. I was comparing the Finished from the test case instead of the output from the Benchmark.measure. 45 master vs 59 this branch.

will89 · 2018-09-19T18:14:32Z

Do you know if attempting to use this interface from nokogiri,
https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/SAX/Parser, to visit the xml would be faster than using the Nokogiri::XML::Reader that https://github.com/roo-rb/roo/blob/master/lib/roo/utils.rb#L87 uses?

chopraanmol1 · 2018-09-19T18:49:19Z

I've not looked into it yet. If you have some benchmark and memory comparison for both approach it can be great place to start with.

will89 · 2018-10-12T01:42:42Z

Sorry for leaving this neglected for so long.

I have been unable to alter each_element to be fast enough for this implementation of extract_hyperlinks to not be a performance regression. However, I think the option of no_hyperlinks is probably a good enough workaround for the memory problem.

Is the modification to lib/roo/utils.rb worth keeping otherwise happy to close this PR? thoughts @chopraanmol1

chopraanmol1 · 2018-10-12T06:40:35Z

@will89 let's close this PR now. Since there is still scope of improvement you can open a new PR focusing on improving each_row_streaming's memory allocation and performance. Few of the things which you can do in this PR will be as follow:

The modification you made in lib/roo/utils.rb in this PR.
Lazily create

roo/lib/roo/excelx/sheet_doc.rb

Line 127 in 5fc4378

excelx_type = [:numeric_or_formula, format.to_s]

this array only when required (maybe create a method name numeric_or_formula_or(format) and call it a bit later when required). And you could also memoize the frozen array (assuming that format value are repetitive).
You can also profile and find many more such optimization scope.

All of above if together make good performance & memory usage improvement we could go ahead with it.

I'll suggest benchmarking and profiling each_row_streaming method with no no_hyperlinks option.

chopraanmol1 · 2018-10-12T07:04:53Z

@will89 let me know if you're interested(or not) in opening another PR

will89 · 2018-10-12T11:38:05Z

@chopraanmol1 Yeah, I'll try to make a follow on PR to address these things.

Revisiting roo-rb#436 . This Patch uses relationships data to determine if a sheet includes hyperlinks or not. As extract_hyperlinks loads the whole document in memory it is quite problematic for each_row_streaming. This patch tries to skip extract_hyperlinks when not required.

chopraanmol1 · 2019-01-22T13:14:30Z

@will89 I've revisited one of the issues raised here in #488

With the linked patch you will no longer required to set no_hyperlinks option.

Revisiting roo-rb#436 . This Patch uses relationships data to determine if a sheet includes hyperlinks or not. As extract_hyperlinks loads the whole document in memory it is quite problematic for each_row_streaming. This patch tries to skip extract_hyperlinks when not required.

Use the streaming roo utils to get hyperlinks

7fc4409

chopraanmol1 self-requested a review September 12, 2018 04:51

tgturner assigned chopraanmol1 Sep 16, 2018

Merge branch 'master' into issue/extract-hyperlinks-mem-consumption

5283e01

will89 added 2 commits September 17, 2018 07:48

Add a newline here

41d785d

Pull Array(elements) outside of loop

903dc65

chopraanmol1 closed this Oct 12, 2018

will89 mentioned this pull request Oct 12, 2018

Reduce array allocations in Roo::Utils.each_element #466

Merged

will89 mentioned this pull request Oct 21, 2018

Reduce array allocations in create_cell_from_value #470

Closed

chopraanmol1 mentioned this pull request Jan 22, 2019

Skip extract_hyperlinks if not required #488

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve each_row_streaming memory usage #436

Improve each_row_streaming memory usage #436

will89 commented Jul 14, 2018

coveralls commented Jul 14, 2018 •

edited

Loading

chopraanmol1 commented Sep 11, 2018

will89 commented Sep 11, 2018

chopraanmol1 commented Sep 11, 2018

tgturner commented Sep 15, 2018

will89 commented Sep 15, 2018

will89 commented Sep 16, 2018

will89 commented Sep 16, 2018

chopraanmol1 commented Sep 16, 2018

will89 commented Sep 16, 2018

chopraanmol1 commented Sep 17, 2018

will89 commented Sep 17, 2018 •

edited

Loading

chopraanmol1 commented Sep 17, 2018 •

edited

Loading

will89 commented Sep 17, 2018

chopraanmol1 commented Sep 17, 2018

will89 commented Sep 17, 2018

will89 commented Sep 17, 2018

will89 commented Sep 19, 2018

chopraanmol1 commented Sep 19, 2018

will89 commented Oct 12, 2018

chopraanmol1 commented Oct 12, 2018

chopraanmol1 commented Oct 12, 2018

will89 commented Oct 12, 2018

chopraanmol1 commented Jan 22, 2019

Improve each_row_streaming memory usage #436

Improve each_row_streaming memory usage #436

Conversation

will89 commented Jul 14, 2018

coveralls commented Jul 14, 2018 • edited Loading

chopraanmol1 commented Sep 11, 2018

will89 commented Sep 11, 2018

chopraanmol1 commented Sep 11, 2018

tgturner commented Sep 15, 2018

will89 commented Sep 15, 2018

will89 commented Sep 16, 2018

will89 commented Sep 16, 2018

chopraanmol1 commented Sep 16, 2018

will89 commented Sep 16, 2018

chopraanmol1 commented Sep 17, 2018

will89 commented Sep 17, 2018 • edited Loading

chopraanmol1 commented Sep 17, 2018 • edited Loading

will89 commented Sep 17, 2018

chopraanmol1 commented Sep 17, 2018

will89 commented Sep 17, 2018

will89 commented Sep 17, 2018

will89 commented Sep 19, 2018

chopraanmol1 commented Sep 19, 2018

will89 commented Oct 12, 2018

chopraanmol1 commented Oct 12, 2018

chopraanmol1 commented Oct 12, 2018

will89 commented Oct 12, 2018

chopraanmol1 commented Jan 22, 2019

coveralls commented Jul 14, 2018 •

edited

Loading

will89 commented Sep 17, 2018 •

edited

Loading

chopraanmol1 commented Sep 17, 2018 •

edited

Loading