Fixed performance bug in large directories #426

jpivarski · 2021-08-28T01:26:30Z

After the release, I made a first performance test of writing histograms. Here's the script:

scaling-histograms-uproot.py

import time

import numpy as np
import uproot
import hist

with uproot.recreate("output.root", compression=None) as file:
    for nbins in 5, 5000, 10, 2000, 20, 1000, 50, 500, 100, 200:
        starttime = time.time()

        for i in range(1000):
            h = hist.Hist.new.Reg(nbins, -5, 5).Weight()
            file["h_%05d_%05d" % (nbins, i)] = h

        print(f"{nbins = }, time = {(time.time() - starttime)}")

The performance was terrible.

% python scaling-histograms-uproot.py 
nbins = 5, time = 2.9465341567993164
nbins = 5000, time = 7.75829005241394
nbins = 10, time = 12.314059019088745
nbins = 2000, time = 17.434439659118652
nbins = 20, time = 22.311167001724243
nbins = 1000, time = 27.74405813217163
nbins = 50, time = 33.42725396156311
nbins = 500, time = 38.9810357093811
nbins = 100, time = 45.44138026237488
nbins = 200, time = 51.959704875946045
% rm output.root 
rm: remove regular file 'output.root'? y

Notice that the number of bins in each test is not monotonically increasing, but the time to complete is. I thought there might have been something wrong in the cascade-writing (in which a change in state invokes a walk over changed tree nodes, writing the changes exactly once), but that wasn't it. It was purely algorithmic: the DirectoryData maintained a single list of keys, any kind of search iterated through them all, and adding one key meant serializing the whole DirectoryData, which is now larger. Each of these operations is individually O(n), so if you do them n times, as in this test, the total running time is O(n²). Classic algorithms stuff!

So I replaced the list with optimized data structures for searching (O(1) per search, rather than O(n)) and I replaced the "all keys" serialization with "only new keys" if keys have only been appended at the end. Other cases (replacing or deleting a key, which are more rare) still rewrite the whole set of keys. Although I'm optimizing the code for a specific performance test, the problem of "write a bunch of histograms in a directory" is a very common use-case, and the extra complexity in the DirectoryData class is worth it.

Particularly because these are the updated performance results with that one fix:

% python scaling-histograms-uproot.py 
nbins = 5, time = 0.6220302581787109
nbins = 5000, time = 0.8085470199584961
nbins = 10, time = 0.6134929656982422
nbins = 2000, time = 0.6801645755767822
nbins = 20, time = 0.6404979228973389
nbins = 1000, time = 0.6516284942626953
nbins = 50, time = 0.6299173831939697
nbins = 500, time = 0.6578629016876221
nbins = 100, time = 0.6180548667907715
nbins = 200, time = 0.6521506309509277
% rm output.root 
rm: remove regular file 'output.root'? y

In the end, after having written 10000 histograms, it's 80× faster, and that factor would continue to grow with the number of histograms written.

It also compares favorably with PyROOT:

scaling-histograms-pyroot.py

import time

import ROOT

file = ROOT.TFile("output.root", "RECREATE")
file.SetCompressionLevel(0)

for nbins in 5, 5000, 10, 2000, 20, 1000, 50, 500, 100, 200:
    starttime = time.time()

    for i in range(1000):
        h = ROOT.TH1D("h_%05d_%05d" % (nbins, i), "", nbins, -5, 5)
        h.Sumw2()
        file.Write()

    print(f"{nbins = }, time = {(time.time() - starttime)}")

% python scaling-histograms-pyroot.py 
nbins = 5, time = 4.220697402954102
nbins = 5000, time = 5.634121656417847
nbins = 10, time = 5.7256810665130615
nbins = 2000, time = 6.573720455169678
nbins = 20, time = 6.748169183731079
nbins = 1000, time = 6.8290839195251465
nbins = 50, time = 7.278932809829712
nbins = 500, time = 6.5945963859558105
nbins = 100, time = 6.932183027267456
nbins = 200, time = 7.021673679351807
% rm output.root 
rm: remove regular file 'output.root'? y

I don't know what's going on in PyROOT, since most of the work should be happening on the C++ side. This also grows with time, rather than with number of bins, so there might be some similar story: a registry that gets filled up with objects to manage or something.

Anyway, the C++ code is much faster, so most of this is in the Python-C++ bindings:

scaling-histograms.cpp

#include <stdio.h>
#include <iostream>
#include <chrono>

#include "TFile.h"
#include "TH1D.h"

int main(int argc, char** argv) {
  auto starttime = std::chrono::high_resolution_clock::now();

  auto file = new TFile("output.root", "RECREATE");
  file->SetCompressionLevel(0);

  int nbins_array[10] = {5, 5000, 10, 2000, 20, 1000, 50, 500, 100, 200};
  char name[14];

  for (int j = 0;  j < 10;  j++) {
    int nbins = nbins_array[j];

    for (int i = 0;  i < 1000;  i++) {
      sprintf(name, "h_%05d_%05d", nbins, i);
      auto h = new TH1D(name, "", nbins, -5, 5);
      h->Sumw2();
      h->Write();
    }

    auto stoptime = std::chrono::high_resolution_clock::now();

    std::cout << "nbins = " << nbins << ", "
              << "time = "
              << std::chrono::duration_cast<std::chrono::microseconds>(
                   stoptime - starttime
                 ).count() / 1e6
              << std::endl;
  }

  file->Close();
}

% g++ -O3 scaling-histograms.cpp `root-config --cflags --libs` -o scaling-histograms
% ./scaling-histograms 
nbins = 5, time = 0.170805
nbins = 5000, time = 0.301064
nbins = 10, time = 0.312349
nbins = 2000, time = 0.361317
nbins = 20, time = 0.377874
nbins = 1000, time = 0.411764
nbins = 50, time = 0.425719
nbins = 500, time = 0.444574
nbins = 100, time = 0.456827
nbins = 200, time = 0.471508
% rm output.root 
rm: remove regular file 'output.root'? y

There is a time-dependent increase in C++, and that might be a global registry of histogram objects, but the effect here in C++ is much smaller than in PyROOT, so there's something significant in the bindings.

Anyway, I got to change my code to fit the performance test, so on that subjective level, the comparison isn't fair. But this also shows that the fastest possible time is probably around 0.17 seconds for 1000 histograms, since that's ROOT's writing rate when the O(n²) effect hasn't kicked in, and I'm just happy that Uproot's 0.65 seconds is something close to that.

…aced all-keys rewriting with just-new-keys rewriting.

* Add large speedup of workspace conversion from JSON to XML + ROOT with uproot v4.1.1 to pyhf v0.6.3 release notes - c.f. scikit-hep/uproot5#426

Replaced DirectoryData.keys list with O(1) search structures and repl…

90576c9

…aced all-keys rewriting with just-new-keys rewriting.

jpivarski enabled auto-merge (squash) August 28, 2021 01:32

jpivarski merged commit faefa30 into main Aug 28, 2021

jpivarski deleted the jpivarski/performance-bug-in-large-directories branch August 28, 2021 01:51

jpivarski mentioned this pull request Aug 30, 2021

Final tweaks, bug fixes, and performance checks for this stage of Uproot-writing #428

Merged

matthewfeickert mentioned this pull request Sep 13, 2021

docs: Add uproot4 writing speedup to v0.6.3 release notes scikit-hep/pyhf#1601

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed performance bug in large directories #426

Fixed performance bug in large directories #426

jpivarski commented Aug 28, 2021

Fixed performance bug in large directories #426

Fixed performance bug in large directories #426

Conversation

jpivarski commented Aug 28, 2021