Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: RNTuple ug-fixing offset array concatenation, adding filter_name #1285

Merged

Conversation

giedrius2020
Copy link
Contributor

@giedrius2020 giedrius2020 commented Sep 6, 2024

Changes made in RNTuple model:

  • Bug-fix: when reading arrays from more than one cluster, data was not identical to TTree. The reason for this was improper array concatenation in case of offset values. More information in here: https://gist.github.com/giedrius2020/c1ca87784779418ee3e938aaca31ff75
  • Added a function to filter wanted key values in def keys()
  • Minimal change: changed existing "filter_names" variables to "filter_name", because that is how the variable was called in TTree model.

@giedrius2020 giedrius2020 changed the title RNTuple: Bug-fixing offset array concatenation, adding filter_name fix: RNTuple ug-fixing offset array concatenation, adding filter_name Sep 6, 2024
@ariostas
Copy link
Collaborator

ariostas commented Sep 6, 2024

Thank you for catching this @giedrius2020! I think the issue could be solved in a simpler way by adjusting some of the logic in here. I'll think back at what my reasoning was for this piece.

cumsum = 0
for page_desc in pagelist:
n_elements = page_desc.num_elements
tracker_end = tracker + n_elements
self.read_pagedesc(
res[tracker:tracker_end], page_desc, dtype_str, dtype, nbits, split
)
if delta:
res[tracker] -= cumsum
cumsum += numpy.sum(res[tracker:tracker_end])
tracker = tracker_end

@ariostas
Copy link
Collaborator

ariostas commented Sep 6, 2024

Could you try checking if changing the last few lines to

if delta:
    if tracker > 0:
        res[tracker] -= cumsum
    cumsum += numpy.sum(res[tracker:tracker_end])
tracker = tracker_end 

fixes the issue?

@davidlange6
Copy link

perhaps best that we separate the api change to keys() from the bug fix?

otherwise, perhaps beyond this PR, is there some test in uproot intended to check the physics correctness of reading back files?

@giedrius2020
Copy link
Contributor Author

giedrius2020 commented Sep 9, 2024

Could you try checking if changing the last few lines to

if delta:
    if tracker > 0:
        res[tracker] -= cumsum
    cumsum += numpy.sum(res[tracker:tracker_end])
tracker = tracker_end 

fixes the issue?

@ariostas , I tried your suggestion (while disabling my changes) and it did not help.

The result is still the same, arrays for all cluster look like this:

Offset arrays before adjusting: 
Cluster 1: [    0     0     1 ... 22134 22134 22135]
Cluster 2: [    0     0     2 ... 35045 35046 35048]
Cluster 3: [    0     2     4 ... 34930 34931 34931]
...

While it should be like this:

Offset arrays after adjusting: 
Cluster 1: [    0     0     1 ... 22134 22134 22135]
Cluster 2: [22135 22137 22138 ... 57180 57181 57183]
Cluster 3: [57185 57187 57187 ... 92113 92114 92114]
...

@ariostas
Copy link
Collaborator

ariostas commented Sep 9, 2024

I tried your suggestion (while disabling my changes) and it did not help.

Ah okay, thanks

perhaps best that we separate the api change to keys() from the bug fix?

Since the RNTuple stuff isn't stable yet, I wouldn't worry much about separating things

otherwise, perhaps beyond this PR, is there some test in uproot intended to check the physics correctness of reading back files?

I've added some tests, but the issue was that I didn't have a test file that was large enough to have multiple clusters, so that's why I hadn't seen this bug

@davidlange6
Copy link

can one control cluster size in rntuple writing? indeed, i guess one doesn't want a few 100 MB of test file around

@ariostas
Copy link
Collaborator

can one control cluster size in rntuple writing?

I'm not sure, but that would be nice. I'll look into it

giedrius2020 and others added 3 commits September 11, 2024 11:42
Co-authored-by: Andres Rios Tascon <ariostas@gmail.com>
Co-authored-by: Andres Rios Tascon <ariostas@gmail.com>
Copy link
Collaborator

@ariostas ariostas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @giedrius2020, this looks great. The test that is failing is unrelated.

I'm going to try to generate a small RNTuple with multiple clusters so that we can add a test for this.

@ariostas
Copy link
Collaborator

I added a new test file in scikit-hep/scikit-hep-testdata#159. Let's add a new test once that gets merged and released.

@jpivarski
Copy link
Member

That worked! I'm updating again to be sure that we get all of the 3.13 tests.

@jpivarski
Copy link
Member

Now this should include the pyodide-build, and I'll enable auto-merge because I'm sure it will pass.

@jpivarski jpivarski enabled auto-merge (squash) September 12, 2024 19:57
@jpivarski jpivarski merged commit dc19ce9 into scikit-hep:main Sep 12, 2024
25 checks passed
@jpivarski
Copy link
Member

@all-contributors please add @giedrius2020 for code

Copy link
Contributor

@jpivarski

I've put up a pull request to add @giedrius2020! 🎉

ariostas added a commit that referenced this pull request Oct 9, 2024
ariostas added a commit that referenced this pull request Nov 5, 2024
* Fixed __len__ method

* Added a few more useful methods

* Use the right number in arrays method

* Updated to match spec and did some cleanup

* Fixed order of extra type information

* Extract column summary flags

* style: pre-commit fixes

* Fixed conflict resolution

* Fixed test

* Switched to using enums

* Fixed RNTuple anchor

* Updated locator types

* Removed UserMetadata envelope

* Started implementing new real32 types

* Updated sharded cluster to match spec

* Removed user metadata from footer

* Fixed ClusterSummaryReader

* Fix cascadentuple

* Introduced RNTupleField class

* Added test for #1285

* Fixed test

* Fix test (attempt 2)

* Finalized first version of RNTupleField

* Added tests for RNTupleField

* Implemented iterate method

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants