fix: RNTuple ug-fixing offset array concatenation, adding filter_name #1285

giedrius2020 · 2024-09-06T12:51:05Z

Changes made in RNTuple model:

Bug-fix: when reading arrays from more than one cluster, data was not identical to TTree. The reason for this was improper array concatenation in case of offset values. More information in here: https://gist.github.com/giedrius2020/c1ca87784779418ee3e938aaca31ff75
Added a function to filter wanted key values in def keys()
Minimal change: changed existing "filter_names" variables to "filter_name", because that is how the variable was called in TTree model.

src/uproot/models/RNTuple.py

ariostas · 2024-09-06T15:15:39Z

Thank you for catching this @giedrius2020! I think the issue could be solved in a simpler way by adjusting some of the logic in here. I'll think back at what my reasoning was for this piece.

uproot5/src/uproot/models/RNTuple.py

Lines 511 to 521 in 012df94

    
           cumsum = 0 
        
           for page_desc in pagelist: 
        
               n_elements = page_desc.num_elements 
        
               tracker_end = tracker + n_elements 
        
               self.read_pagedesc( 
        
                   res[tracker:tracker_end], page_desc, dtype_str, dtype, nbits, split 
        
               ) 
        
               if delta: 
        
                   res[tracker] -= cumsum 
        
                   cumsum += numpy.sum(res[tracker:tracker_end]) 
        
               tracker = tracker_end

ariostas · 2024-09-06T15:29:41Z

Could you try checking if changing the last few lines to

if delta:
    if tracker > 0:
        res[tracker] -= cumsum
    cumsum += numpy.sum(res[tracker:tracker_end])
tracker = tracker_end

fixes the issue?

davidlange6 · 2024-09-06T22:12:07Z

perhaps best that we separate the api change to keys() from the bug fix?

otherwise, perhaps beyond this PR, is there some test in uproot intended to check the physics correctness of reading back files?

giedrius2020 · 2024-09-09T13:09:31Z

Could you try checking if changing the last few lines to

if delta:
    if tracker > 0:
        res[tracker] -= cumsum
    cumsum += numpy.sum(res[tracker:tracker_end])
tracker = tracker_end

fixes the issue?

@ariostas , I tried your suggestion (while disabling my changes) and it did not help.

The result is still the same, arrays for all cluster look like this:

Offset arrays before adjusting: 
Cluster 1: [    0     0     1 ... 22134 22134 22135]
Cluster 2: [    0     0     2 ... 35045 35046 35048]
Cluster 3: [    0     2     4 ... 34930 34931 34931]
...

While it should be like this:

Offset arrays after adjusting: 
Cluster 1: [    0     0     1 ... 22134 22134 22135]
Cluster 2: [22135 22137 22138 ... 57180 57181 57183]
Cluster 3: [57185 57187 57187 ... 92113 92114 92114]
...

ariostas · 2024-09-09T13:23:30Z

I tried your suggestion (while disabling my changes) and it did not help.

Ah okay, thanks

perhaps best that we separate the api change to keys() from the bug fix?

Since the RNTuple stuff isn't stable yet, I wouldn't worry much about separating things

otherwise, perhaps beyond this PR, is there some test in uproot intended to check the physics correctness of reading back files?

I've added some tests, but the issue was that I didn't have a test file that was large enough to have multiple clusters, so that's why I hadn't seen this bug

src/uproot/models/RNTuple.py

davidlange6 · 2024-09-10T17:39:45Z

can one control cluster size in rntuple writing? indeed, i guess one doesn't want a few 100 MB of test file around

ariostas · 2024-09-10T19:12:45Z

can one control cluster size in rntuple writing?

I'm not sure, but that would be nice. I'll look into it

Co-authored-by: Andres Rios Tascon <ariostas@gmail.com>

ariostas

Thank you @giedrius2020, this looks great. The test that is failing is unrelated.

I'm going to try to generate a small RNTuple with multiple clusters so that we can add a test for this.

ariostas · 2024-09-11T18:05:08Z

I added a new test file in scikit-hep/scikit-hep-testdata#159. Let's add a new test once that gets merged and released.

jpivarski · 2024-09-12T18:56:36Z

That worked! I'm updating again to be sure that we get all of the 3.13 tests.

jpivarski · 2024-09-12T19:56:44Z

Now this should include the pyodide-build, and I'll enable auto-merge because I'm sure it will pass.

jpivarski · 2024-09-20T18:24:52Z

@all-contributors please add @giedrius2020 for code

allcontributors · 2024-09-20T18:25:02Z

@jpivarski

I've put up a pull request to add @giedrius2020! 🎉

* Fixed __len__ method * Added a few more useful methods * Use the right number in arrays method * Updated to match spec and did some cleanup * Fixed order of extra type information * Extract column summary flags * style: pre-commit fixes * Fixed conflict resolution * Fixed test * Switched to using enums * Fixed RNTuple anchor * Updated locator types * Removed UserMetadata envelope * Started implementing new real32 types * Updated sharded cluster to match spec * Removed user metadata from footer * Fixed ClusterSummaryReader * Fix cascadentuple * Introduced RNTupleField class * Added test for #1285 * Fixed test * Fix test (attempt 2) * Finalized first version of RNTupleField * Added tests for RNTupleField * Implemented iterate method --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Fixing cardinality cluster edges

4c17529

giedrius2020 changed the title ~~RNTuple: Bug-fixing offset array concatenation, adding filter_name~~ fix: RNTuple ug-fixing offset array concatenation, adding filter_name Sep 6, 2024

pre-commit-ci bot and others added 2 commits September 6, 2024 12:52

style: pre-commit fixes

980b279

Ruff CI error fix

d0ebe1f

ariostas reviewed Sep 6, 2024

View reviewed changes