Replies: 19 comments 24 replies
-
Just an update: we can now create ROOT files with nested subdirectories (and no other kinds of objects!). Most importantly, ROOT can open these files and put objects in the subdirectories Uproot made, with Uproot → ROOT → Uproot round trips and no corruption. I'm fairly confident that the seek positions and map of free segments are being updated correctly because even adding a subdirectory requires significant movement of data as directory key-lists outgrow their original allocation. Also, ROOT has thankfully been verbose about any mistakes, even recoverable ones. A sample test from PR #322: Next, I'll work on defining streamers, which will make it possible to store actual objects (starting with |
Beta Was this translation helpful? Give feedback.
-
As of PR #329, updating files can go full circle: files made by ROOT can be updated by Uproot, the result of which can be updated by ROOT, etc. Here, ROOT made the initial file containing a string, Uproot added a subdirectory, ROOT put another string inside that subdirectory, and Uproot-reading shows that everything's there in the end. |
Beta Was this translation helpful? Give feedback.
-
It's very interesting to see such low level interaction with the root format. Thanks for sharing this! |
Beta Was this translation helpful? Give feedback.
-
Just to keep this thread updated, I managed to get back to Uproot file-writing and added TStreamerInfo handling in PR #341. It is still the case that you can't write any objects to the file (other than subdirectories of subdirectories), but TStreamerInfo is a prerequisite for objects because a type's serialization must be described for it to be readable. As usual, everything is being tested in "volleys"—a file is updated by ROOT, then by Uproot, then by ROOT again, etc. to ensure that Uproot is giving ROOT data that it can not only understand, but manipulate. |
Beta Was this translation helpful? Give feedback.
-
Today's update: Uproot can now copy objects from one ROOT file into another, taking all the TStreamerInfos that would be necessary to interpret those objects. This was the first step because it's a literal byte-for-byte copy of the original data, without interpretation (other than getting its classname to make a dependency tree of all the TStreamerInfos, but the classname is in the TKey, so the object data do not need to be interpreted). In ROOT-speak, this is a "fast copy," because it doesn't require any decompression and recompression. The destination_file["name"] = object_to_write syntax of Uproot 3 will be reimplemented, but it requires knowledge of how to serialize the The We'll probably want something similar for the categorical-axis histograms in Coffea and Boost-Histogram/Hist, so that they can be translated into a directory of conventional histograms without reallocating the directory as it grows in size. |
Beta Was this translation helpful? Give feedback.
-
Today's update: PR #349 adds the ability to write TObjStrings from scratch (they don't need to come from another file), assigning the TStreamerInfo appropriately. With this update, ROOT files can be used as persistent dicts of strings: Next is histograms. |
Beta Was this translation helpful? Give feedback.
-
After 3 months of not being able to get to this, I managed to start working on ROOT file writing in Uproot again. PR #405 writes a histogram. It will have to be generalized beyond TH1F version 3, but now I have a method for filling in the missing serialization code. def test(tmp_path):
original = os.path.join(tmp_path, "original.root")
newfile = os.path.join(tmp_path, "newfile.root")
f1 = ROOT.TFile(original, "recreate")
h1 = ROOT.TH1F("h1", "title", 8, -3.14, 2.71)
h1.SetBinContent(0, 0.0)
h1.SetBinContent(1, 1.1)
h1.SetBinContent(2, 2.2)
h1.SetBinContent(3, 3.3)
h1.SetBinContent(4, 4.4)
h1.SetBinContent(5, 5.5)
h1.SetBinContent(6, 6.6)
h1.SetBinContent(7, 7.7)
h1.SetBinContent(8, 8.8)
h1.SetBinContent(9, 9.9)
h1.Write()
f1.Close()
with uproot.open(original) as fin:
h2 = fin["h1"]
with uproot.recreate(newfile) as fout:
fout["h1"] = h2
f3 = ROOT.TFile(newfile)
h3 = f3.Get("h1")
assert h3.GetBinContent(0) == pytest.approx(0.0)
assert h3.GetBinContent(1) == pytest.approx(1.1)
assert h3.GetBinContent(2) == pytest.approx(2.2)
assert h3.GetBinContent(3) == pytest.approx(3.3)
assert h3.GetBinContent(4) == pytest.approx(4.4)
assert h3.GetBinContent(5) == pytest.approx(5.5)
assert h3.GetBinContent(6) == pytest.approx(6.6)
assert h3.GetBinContent(7) == pytest.approx(7.7)
assert h3.GetBinContent(8) == pytest.approx(8.8)
assert h3.GetBinContent(9) == pytest.approx(9.9) |
Beta Was this translation helpful? Give feedback.
-
Moving along, PR #406 creates an empty TTree (with a TBranch, but no TBasket data). This has to follow a different code path from histograms and other "normal" objects because TTrees can create and manage other objects (TBaskets), and they have to be replaceable, if the number of TBaskets exceeds the originally allocated number. (A TTree has to communicate with the TDirectory that contains it to replace the TTree metadata with larger metadata. The TBaskets themselves never get copied or moved.) with uproot.recreate(newfile, compression=None) as fout:
# internal interface; not wrapped up in a high-level function yet
tree = fout._cascading.add_tree(
fout._file.sink, "t1", "title", {"branch1": np.float64}
)
f2 = ROOT.TFile(newfile)
t2 = f2.Get("t1")
assert t2.GetName() == "t1"
assert t2.GetTitle() == "title"
assert t2.GetBranch("branch1").GetName() == "branch1"
assert t2.GetBranch("branch1").GetTitle() == "branch1/D"
assert t2.GetBranch("branch1").GetLeaf("branch1").GetName() == "branch1"
assert t2.GetBranch("branch1").GetLeaf("branch1").GetTitle() == "branch1"
assert t2.GetLeaf("branch1").GetName() == "branch1"
assert t2.GetLeaf("branch1").GetTitle() == "branch1" The TTree above is byte-for-byte identical with one made this way: f1 = ROOT.TFile(original, "recreate")
f1.SetCompressionLevel(0)
t1 = ROOT.TTree("t1", "title")
d1 = array.array("d", [0.0])
d2 = array.array("d", [0.0])
t1.Branch("branch1", d1, "branch1/D")
t1.Write()
f1.Close() |
Beta Was this translation helpful? Give feedback.
-
Uproot file-writing is in a usable state!As of PR #409 (which is now merged into main), Uproot's file-writing code can be considered alpha quality: ready for testers, albeit with no documentation and you'd be the first to test it. Fortunately for documentation, there's not much to explain. The interface for writing a TTree is with uproot.recreate("filename.root") as output_file:
output_file["tree_name"] = {"branch1": array1, "branch2": array2, "branch3": array3} where Once you've created a TTree, you can add more data to it with output_file["tree_name"].extend({"branch1": array1, "branch2": array2, "branch3": array3}) The set of branch names needs to be the same (dtypes will be converted if they don't match the first batch), and again, the lengths of all of these arrays must be equal. The file must still be open. Each call to The interface for writing a histogram is output_file["hist_name"] = np.histogram(some_data) where the right-hand-side is some "recognized type," a type that can be converted into a histogram. The first such type is a 2-tuple of NumPy arrays—what Fancy stuff:
Temporary limitations:
Permanent limitations:
Probably permanent limitation:
|
Beta Was this translation helpful? Give feedback.
-
Update from PR #412: I fixed some of the "temporary limitations" above, including usability feedback from @alexander-held (on https://gitter.im/Scikit-HEP/uproot):
Now just for showing off, behold: >>> import pandas as pd
>>> df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "y": [1.1, 2.2, 3.3, 4.4, 5.5]})
>>> df
x y
0 1 1.1
1 2 2.2
2 3 3.3
3 4 4.4
4 5 5.5
>>>
>>> import uproot
>>> with uproot.recreate("output.root") as rootfile:
... rootfile["tree"] = df
...
>>>
>>> import ROOT
>>> f = ROOT.TFile("output.root")
>>> t = f.Get("tree")
>>> t.Scan()
************************************************
* Row * index.ind * x.x * y.y *
************************************************
* 0 * 0 * 1 * 1.1 *
* 1 * 1 * 2 * 2.2 *
* 2 * 2 * 3 * 3.3 *
* 3 * 3 * 4 * 4.4 *
* 4 * 4 * 5 * 5.5 *
************************************************
5
>>> t.Print()
******************************************************************************
*Tree :tree : *
*Entries : 5 : Total = 2133 bytes File Size = 2144 *
* : : Tree compression factor = 1.00 *
******************************************************************************
*Br 0 :index : index/L *
*Entries : 5 : Total Size= 609 bytes File Size = 112 *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 1 :x : x/L *
*Entries : 5 : Total Size= 589 bytes File Size = 108 *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 2 :y : y/D *
*Entries : 5 : Total Size= 589 bytes File Size = 108 *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................* (The DataFrame index becomes a TBranch named "index," unless overwritten by a DataFrame column named "index.") Thus, a ROOT file is a handy way to store Pandas DataFrames. |
Beta Was this translation helpful? Give feedback.
-
PR #414 adds jagged array writing. Jagged arrays present a problem of interface. Although we have a choice between writing them as dynamic length arrays (e.g. What do we name these extra branches? How do we keep them from colliding? How do we avoid creating a lot of them? For the first two problems, I added
If you need different names or you need to avoid a name collision, you can pass your own lambda functions. For the last problem of avoiding many of them, I made sure that we can write Awkward record arrays as well as numerical or jagged-numerical arrays. This is a direct extension of the >>> import uproot
>>> import awkward as ak
>>> one = ak.Array([[1, 2, 3], [], [4, 5]])
>>> two = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> three = ak.Array([[100], [200, 200], [300, 300, 300]])
>>> with uproot.recreate("output.root", compression=None) as rootfile:
... rootfile["tree1"] = ak.zip({"one": one, "two": two})
... rootfile["tree2"] = {"A": ak.zip({"one": one, "two": two}), "B": three}
...
>>> import ROOT
>>> f = ROOT.TFile("output.root")
>>> tree1 = f.Get("tree1")
>>> tree2 = f.Get("tree2") The first TTree, >>> tree1.Print()
******************************************************************************
*Tree :tree1 : *
*Entries : 3 : Total = 2145 bytes File Size = 2161 *
* : : Tree compression factor = 1.00 *
******************************************************************************
*Br 0 :N : N/I *
*Entries : 3 : Total Size= 550 bytes File Size = 81 *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 1 :one : one/L *
*Entries : 3 : Total Size= 685 bytes File Size = 131 *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 2 :two : two/D *
*Entries : 3 : Total Size= 685 bytes File Size = 131 *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
>>> [x.N for x in tree1]
[3, 0, 2]
>>> [list(x.one) for x in tree1]
[[1, 2, 3], [], [4, 5]]
>>> [list(x.two) for x in tree1]
[[1.1, 2.2, 3.3], [], [4.4, 5.5]] The second TTree, >>> tree2.Print()
******************************************************************************
*Tree :tree2 : *
*Entries : 3 : Total = 3349 bytes File Size = 3395 *
* : : Tree compression factor = 1.00 *
******************************************************************************
*Br 0 :NA : NA/I *
*Entries : 3 : Total Size= 555 bytes File Size = 82 *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 1 :A_one : A_one/L *
*Entries : 3 : Total Size= 697 bytes File Size = 133 *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 2 :A_two : A_two/D *
*Entries : 3 : Total Size= 697 bytes File Size = 133 *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 3 :NB : NB/I *
*Entries : 3 : Total Size= 555 bytes File Size = 82 *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
*Br 4 :B : B/L *
*Entries : 3 : Total Size= 685 bytes File Size = 137 *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
>>> [x.NA for x in tree2]
[3, 0, 2]
>>> [list(x.A_one) for x in tree2]
[[1, 2, 3], [], [4, 5]]
>>> [list(x.A_two) for x in tree2]
[[1.1, 2.2, 3.3], [], [4.4, 5.5]]
>>>
>>> [x.NB for x in tree2]
[1, 2, 3]
>>> [list(x.B) for x in tree2]
[[100], [200, 200], [300, 300, 300]] I know that TTrees can have nested branches and this could be a good application of nested branches (i.e. distinguish them as Now that this is done, all that's left is
|
Beta Was this translation helpful? Give feedback.
-
PR #416 adds the ability to compress data written to ROOT files: objects like histograms and also TTree data. The compression setting can be specified when creating the file: with uproot.recreate("filename.root", compression=uproot.ZLIB(5)) as file:
... or it can be assigned to a file later: file.compression = uproot.LZMA(9) or it can be assigned to a tree (and therefore be different from the rest of the file): tree = file.mktree("tree", {"branch1": numpy_dtype1, "branch2": awkward_type2})
tree.compression = uproot.LZ4(1) or it can be set on a per-branch level: tree["branch1"].compression = uproot.ZSTD(4) or tree.compression = {"branch1": uproot.ZSTD(4), "branch2": uproot.LZ4(1)} Since WritableFile, WritableDirectory (which just passes it to its file), and WritableTree are all mutable, you can set the compression to one value, write some data, set it to another value, and write some more data. The compression that gets used is the configuration at the time of writing. Thus, it's even possible for different baskets of the same branch to be compressed differently, though I can't see how that's a good idea. As far as I can tell, it is allowed by ROOT, since ROOT uses a specialized header at the beginning of a TObject or TBasket to determine how to decompress some data; it doesn't seem to ever use the file-level or tree-level |
Beta Was this translation helpful? Give feedback.
-
Hi @pcanal! PR #420 adds PyROOT-Uproot interoperability:
The reason for these caveats is because the data conversion goes through TMessages: they're serialized to bytes and back, though that all happens in memory (no disk). The main motivation for this is to be able to write PyROOT objects to the same files you're writing hist histograms as TH* or Awkward Arrays as TTrees (without having to reopen the file with PyROOT and doing it the conventional way—this is a convenience). with uproot.recreate("filename.root") as file:
file["hist"] = hist_object
file["tree"] = {"branch1": jagged_array, "branch2": numpy_array}
file["workspace"] = roofit_workspace (When PyROOT objects are being written to files, they do not need to be readable by Uproot because the raw bytes from the TMessage are directly copied into the file, albeit with compression if you ask for it. Also, all of the appropriate TStreamerInfo is included in the output file, which we get from PyROOT by making a temporary TMemFile, once per class type. If you have to write a thousand histograms, only one TMemFile will be made for the TStreamerInfo, but a thousand TMessages will be made.) The only things left on my to-do list are interpreting hist objects and documentation. @henryiii and @amangoel185, we should talk about the hist objects; I'll ping you on Slack. With the documentation finished, this will be released as 4.1.0. |
Beta Was this translation helpful? Give feedback.
-
As of PR #421, it is now documented. See the last 1/3rd of the Getting Started Guide. |
Beta Was this translation helpful? Give feedback.
-
As I accidentally stated here, Thank you, @jpivarski, for putting this together. This was a really essential project for the interoperability of ROOT, and we’re all better off for your effort. |
Beta Was this translation helpful? Give feedback.
-
As a last update (in this sprint, at least), I've released Uproot 4.1.1 (PyPI) with several writing-related performance bugs and one bug-bug fixed, all in PRs #426 and #428. The GitHub pages have some measurements with the code used to make them. A rule of thumb that comes out of this is that you want to call TTree.extend with no less than about 100 kB per array branch. At that point, the Pythonic overhead for writing TBasket metadata is in the noise: This is also the scale at which reading data becomes efficient, too, but in writing, you get to control it. @masonproffitt's question is buried in a thread above, but writing of |
Beta Was this translation helpful? Give feedback.
-
Hi, I was wondering that the status of uproot writing is w.r.t. inferring the right counters from the input tree. For example here I'm reading import uproot
fname = "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/TT_TuneCUETP8M1_13TeV-powheg-pythia8/cmsopendata2015_ttbar_19980_PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext3-v1_00000_0000.root"
in_tree, out_file = uproot.open(fname + ":Events"), uproot.recreate("out.root")
def get_uproot_type(interpretation):
try:
return interpretation.to_dtype
except:
return interpretation.content.to_dtype
branch_dict = {name: get_uproot_type(branch.interpretation) for name, branch in in_tree.items() if name == "Muon_pt" or name == "nMuon"}
out_tree = out_file.mktree("Events", branch_dict)
iterator = in_tree.iterate(branch_dict.keys(), step_size="1 MB")
batch = next(iterator)
contents = {name: batch[name] for name in branch_dict.keys()}
out_tree.extend(contents) yields Question 2. is whether there is a better, built-in way to convert the input branch types to the dtypes that works for ~all branches, I realize my (Question 3. is whether I'm completely misusing uproot here 😄 ) EDIT: |
Beta Was this translation helpful? Give feedback.
-
Each TBranch is independently read from a ROOT file, producing an Awkward Array with independent offsets for each output array (i.e. each field of the output array of records): >>> import uproot
>>> import awkward as ak
>>> in_tree = uproot.open("Run2012B_DoubleMuParked.root")["Events"]
>>> muon_pt_eta = in_tree.arrays(["Muon_pt", "Muon_eta"])
>>> muon_pt_eta.type.show(type=True)
type: 1000000 * {
Muon_pt: var * float32,
Muon_eta: var * float32
} The key thing is that the type has Writing is different. We can't make variable array type data without also making their counters because there are ROOT functions that use these counters. So, given a jagged array, Uproot has to make two TBranches, one for the entire jagged array (including offsets) and another for the counter. The uproot.WritableDirectory.mktree function has some handles to control what the counter TBranch will be named,
and the default is to put " This becomes problematic if you have a lot of arrays that should share the same counter TBranch, such as >>> muons = ak.zip({"pt": muon_pt_eta["Muon_pt"], "eta": muon_pt_eta["Muon_eta"]})
>>> muons.type.show()
1000000 * var * {
pt: float32,
eta: float32
} Now there's only one " This is an Awkward type that Uproot recognizes, and upon seeing it, Uproot will make all of the output TBranches share a single counter TBranch because it knows that it can do that. Now, though, it also needs to generate names for TBranches from fields of a record array, which is another uproot.WritableDirectory.mktree argument,
and the default is to use " >>> out_file = uproot.recreate("/tmp/output.root")
>>> out_file["out_tree"] = {"Muon": muons}
>>> out_file["out_tree"].show()
name | typename | interpretation
---------------------+--------------------------+-------------------------------
nMuon | int32_t | AsDtype('>i4')
Muon_pt | float[] | AsJagged(AsDtype('>f4'))
Muon_eta | float[] | AsJagged(AsDtype('>f4')) This asymmetry between reading and writing is because the data model of Awkward Arrays doesn't perfectly match the data model of TTree. To make them match, reading could get more aggressive about zipping TBranches that share a counter, but that would surprise long-time users of Uproot, who expect 1 TBranch → 1 array. (Also, that's a nice simplifying assumption.) Writing can't get less aggressive about unzipping records into TBranches with a shared counter because somehow the knowledge of which TBranches should share a counter has to be communicated. It could have been done with arguments to the function, but then if somebody had data like And anyway, data structures like |
Beta Was this translation helpful? Give feedback.
-
Hi @jpivarski, and everyone, thanks for updating on this! I was just wondering, is there an example of how to use |
Beta Was this translation helpful? Give feedback.
-
This thread is about writing files with Uproot 4. I've included here everyone who has asked about this, talked about this, or provided feedback on it in Uproot 3. You can "unwatch" this thread if you're not interested.
@reikdas @pcanal @sbinet @oshadura @nsmith- @goi42 @eduardo-rodrigues @chrisburr @hassec @dnadeau-lanl @lindsaybolz @atasattari @lkorley @bfis @masonproffitt @jonr667 @xaratustrah @HDembinski @kratsg @adamdddave @bixel @andrzejnovak @BrutalPosedon @VukanJ @marinang @douglasdavis @kgizdov @matthewfeickert @beojan @Justin-Tan @burneyy @anerokhi @lukasheinrich @henryiii
PR #320 was a first step—it only adds one test file:
https://github.com/scikit-hep/uproot4/blob/main/tests/test_0320-start-working-on-ROOT-writing.py
which doesn't actually use Uproot; it reads and writes to a file with basic Python commands (
struct.pack
andstruct.unpack
). It moves the root directory of the file from one seek position to another such that the file contents are effectively the same, but it has to update pointers and the list of free segments to do so (for three different files that started with different numbers of free segments). To be thorough, I put"X"
chars over the original data, to be sure it can't be read anymore.Then the same file can not only be read by ROOT, it can be updated by ROOT and the output of that second operation is still valid. Thus, it's a full cycle: the original file was created by ROOT → updated by Python code → read back and updated again by ROOT → read back again by ROOT and Uproot. The original contents and the newly added contents (a TObjString) are both readable in the final state.
My conclusion from this is that it will be possible to have an "update" function, not just "recreate", despite what I've said in the past about that being out of scope (e.g. scikit-hep/uproot3#381, scikit-hep/uproot3#460, scikit-hep/uproot3#530). I haven't written any of that new code, just determined that it will be possible.
I have all of @reikdas's Uproot 3 work to draw on, and I just came across this:
https://github.com/root-project/root/tree/master/io/doc/TFile
Many of the fundamental serialization types of ROOT are specified, including the free segments used above. I've gone through and verified these against what we use in Uproot—this is a fantastic resource I wish I'd known about before!
Beta Was this translation helpful? Give feedback.
All reactions