Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Infrastructure for writing of RNTuple (incomplete functionality) #705

Merged
merged 53 commits into from
Jan 6, 2023

Conversation

Moelf
Copy link
Collaborator

@Moelf Moelf commented Sep 5, 2022

No description provided.

@Moelf Moelf changed the title Support writing of RNTuple [feat] Support writing of RNTuple Sep 5, 2022
@Moelf Moelf changed the title [feat] Support writing of RNTuple feat: Support writing of RNTuple Sep 5, 2022
@Moelf
Copy link
Collaborator Author

Moelf commented Sep 5, 2022

The bytes blob inside a RNTuple is pre-pended by not TKey but RBlob (and these 'RBlob's are not put in the TFile-wide directory):

In [34]: t = ROOT.TFile("/home/akako/Documents/github/scikit-hep-testdata/src/skhep_testdata/data/test_ntuple_int_10.root")

In [35]: t.Map()
20220628/101853  At:100    N=112       TFile                     
20220628/101853  At:212    N=405       StreamerInfo   CX =  2.73 
20220628/101853  At:617    N=120       ROOT::Experimental::RNTuple            
20220628/101853  At:737    N=95        KeysList                  
20220628/101853  At:832    N=167       RBlob          CX =  1.16 
20220628/101853  At:999    N=74        RBlob                     
20220628/101853  At:1073   N=94        RBlob                     
20220628/101853  At:1167   N=115       RBlob          CX =  1.20 
20220628/101853  At:1282   N=39        FreeSegments              
20220628/101853  At:1321   N=1         END 

In [49]: rn = up.open("/home/akako/Documents/github/scikit-hep-testdata/src/skhep_testdata/data/test_ntuple_int_10.root")["ntuple"]

In [50]: rn._members
Out[50]: 
{'fCheckSum': 1700499286,
 'fVersion': 0,
 'fSize': 48,
 'fSeekHeader': 866,
 'fNBytesHeader': 133,
 'fLenHeader': 159,
 'fSeekFooter': 1201,
 'fNBytesFooter': 81,
 'fLenFooter': 104,
 'fReserved': 0}

First RBlob is header:

In [51]: 866+133 == 832+167
Out[51]: True

Second RBlob is the column content

In [20]: link = rn.page_list_envelopes.pagelinklist[0][0]

In [21]: rn.pagelist(link)
Out[21]: [MetaData('PageDescription', num_elements=10, locator=MetaData('Locator', num_bytes=40, offset=1033))]

In [23]: 1033+40 == 1073
Out[23]: True

Third RBlob is one of the page list:

In [56]: 1073+94
Out[56]: 1167

In [62]: rn.page_list_envelopes.pagelinklist[0][0].chunk
Out[62]: <Chunk 1107-1167>

Fourth (last) RBlob is footer:

In [52]: 1201+81 == 1167+115
Out[52]: True

@Moelf
Copy link
Collaborator Author

Moelf commented Sep 5, 2022

In [20]: akform = ak._v2.from_iter([{"one": 1, "two": 2.0}]).layout.form

In [21]: with up.recreate("/tmp/test.root") as file:
    ...:     file.mktree("Events", {"one":"int"})
    ...:     file.mkntuple("ntuple", akform)

In [23]: up.open("/tmp/test.root")["ntuple"]._members
Out[23]: 
{'fCheckSum': 1700499286,
 'fVersion': 0,
 'fSize': 48,
 'fSeekHeader': 866,
 'fNBytesHeader': 133,
 'fLenHeader': 159,
 'fSeekFooter': 1201,
 'fNBytesFooter': 81,
 'fLenFooter': 104,
 'fReserved': 0}

In [22]: rn = up.open("/home/akako/Documents/github/scikit-hep-testdata/src/skhep_testdata/data/test_ntuple_int_10.root")["ntuple"]

In [24]: rn._members
Out[24]: 
{'fCheckSum': 1700499286,
 'fVersion': 0,
 'fSize': 48,
 'fSeekHeader': 866,
 'fNBytesHeader': 133,
 'fLenHeader': 159,
 'fSeekFooter': 1201,
 'fNBytesFooter': 81,
 'fLenFooter': 104,
 'fReserved': 0}

@Moelf
Copy link
Collaborator Author

Moelf commented Sep 5, 2022

in a different file, the stl_container test file, we have:

  • one header RBlob
  • one footer RBlob
  • one chunk for pagelinklist (the 30 entries share same chunk just at different offset)
In [48]: rn.page_list_envelopes.pagelinklist[0][0].chunk
Out[48]: <Chunk 0-1104>
  • 30 page data blobs:
In [49]: len(rn.page_list_envelopes.pagelinklist[0])
Out[49]: 30

@Moelf
Copy link
Collaborator Author

Moelf commented Sep 6, 2022

In [227]: with up.recreate("/tmp/test.root") as file:
     ...:     file.mkntuple("ntuple", akform)

In [228]: up.open("/tmp/test.root")["ntuple"].header
Out[228]: MetaData('HeaderReader', env_header={'env_version': 1, 'min_version': 1}, feature_flag=0, rc_tag=1, name='ntuple', ntuple_description='', writer_identifier='uproot 5.0.0rc2', field_records=[MetaData('FieldRecordFrame', field_version=0, type_version=0, parent_field_id=0, struct_role=0, flags=0, field_name='one_integers', type_name='std::int32_t', type_alias='', field_desc='')], column_records=[MetaData('ColumnRecordFrame', type=11, nbits=32, field_id=0, flags=0)], alias_columns=[], extra_type_infos=[], crc32=3689245460)

@Moelf
Copy link
Collaborator Author

Moelf commented Sep 7, 2022

In [444]: with up.recreate("/tmp/test.root") as file:
     ...:     akform =  ak._v2.forms.RecordForm([ak._v2.forms.NumpyForm('float64'), ak._v2.forms.NumpyForm('int32'), ak._v2.forms.Numpy
     ...: Form('bool')], ['one', 'two', 'three'])
     ...:     file.mkntuple("ntuple", akform)
up
In [445]: up.open("/tmp/test.root")["ntuple"].header.field_records
Out[445]: 
[MetaData('FieldRecordFrame', field_version=0, type_version=0, parent_field_id=0, struct_role=0, flags=0, field_name='one', type_name='double', type_alias='', field_desc=''),
 MetaData('FieldRecordFrame', field_version=0, type_version=0, parent_field_id=1, struct_role=0, flags=0, field_name='two', type_name='std::int32_t', type_alias='', field_desc=''),
 MetaData('FieldRecordFrame', field_version=0, type_version=0, parent_field_id=2, struct_role=0, flags=0, field_name='three', type_name='bit', type_alias='', field_desc='')]

In [446]: up.open("/tmp/test.root")["ntuple"].header.column_records
Out[446]: 
[MetaData('ColumnRecordFrame', type=7, nbits=64, field_id=0, flags=0),
 MetaData('ColumnRecordFrame', type=11, nbits=32, field_id=1, flags=0),
 MetaData('ColumnRecordFrame', type=6, nbits=1, field_id=2, flags=0)]

@Moelf
Copy link
Collaborator Author

Moelf commented Sep 16, 2022

In [149]: array
Out[149]: <Array [{one_integers: 9}, {...}, ..., {...}] type='10 * {one_integers: int32}'>

In [150]: with up.recreate("/tmp/test.root") as file:
     ...:     akform =  ak._v2.forms.RecordForm([ak._v2.forms.NumpyForm('int32')], ['one_integers'])
     ...:     file.mkntuple("ntuple", akform)
     ...:     a = file["ntuple"]
     ...:     a.extend(array)
     ...:     file.close()
     ...:     rn = up.open("/tmp/test.root")["ntuple"]
     ...:     assert ak.all(rn.arrays()["one_integers"]  == np.array([9,8,7,6,5,4,3,2,1,0]))

comment:

now we can do round trip with ourselves but ROOT doesn't like it, there are two visible problems:

  1. ROOT thinks the total # of entries is 0 for some reason
  2. https://github.com/root-project/root/blob/ef62da7335eecdea3df98269f2d0cccbda600b98/tree/ntuple/v7/src/RPageStorageFile.cxx#L383 is violated (when doing ntuple.LoadEntry(0)

for 1, I tried to look at the source code, and I found that https://github.com/root-project/root/blob/ef62da7335eecdea3df98269f2d0cccbda600b98/tree/ntuple/v7/inc/ROOT/RNTuple.hxx#L196, where the fSource is a pointer to a RPageSource , and that eventually calls https://github.com/root-project/root/blob/ef62da7335eecdea3df98269f2d0cccbda600b98/tree/ntuple/v7/src/RPageStorage.cxx#L94

the problem: I've never seen "descriptor" and I'm not sure what they are referring to / if it's part of the old world or not

for 2, I suspect it has something to do with 1, if ROOT doesn't detect our rntuple has X number of events, we probably have some metadata regarding clusters incorrect, because I would expect the total entry numbers is just a sum of cluster_summaries in the footer.

@jpivarski jpivarski added the inactive A pull request that hasn't been touched in a long time label Nov 28, 2022
@jpivarski
Copy link
Member

@Moelf, I'm marking this PR as "inactive." When the RNTuple-writing feature is eventually implemented, it might need to be overlaid on a more recent version of main (that is, by closing this and reopening it as a new PR), but I'd like this to remain in the list of PRs as a reminder to get back to it and a knowledge base of how to do it.

@Moelf
Copy link
Collaborator Author

Moelf commented Nov 28, 2022

looks like we can still just update? I will try to make test pass somewhat, but since it's not usable I won't try to advocate for a merge

@jpivarski
Copy link
Member

Got it, thanks! We might be doing a lot of bug-fixes in the coming weeks when a wave of non-early-adopter users get Uproot 5. It shouldn't impact the RNTuple submodule, but main will move.

@jpivarski
Copy link
Member

awkward._v2 is now just awkward. We couldn't find a clean way to make a fake submodule point at its parent. (You'd think you could, but there were hidden subtleties.) That's the reason for the first failure. The main branch takes care of that for most of the codebase, but if you have new code in this PR branch that assumes the existence of a _v2 submodule, you should be able to fix it by just removing that "._v2" string.

@jpivarski jpivarski removed the inactive A pull request that hasn't been touched in a long time label Jan 5, 2023
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of merging this PR as soon as the tests pass.

  • It includes important code for reading std::array<T>. We want to get all of the reading updates in.
  • We don't want the code for writing to get out of date with main, even if it is incomplete, so that it will be easier to finish later.
  • None of this should break things for regular Uproot users, since they'd have to go out of their way to even try to use this. At the very least, they'd have to get access to a file with an RNTuple in it, and those are well-advertised as being experimental.

@jpivarski
Copy link
Member

To add to the above, it only touches a few files used by the rest of Uproot:

  • src/uproot/const.py: adds a few new constants and a dependency on the struct Python standard library module
  • src/uproot/writing/_cascade.py: adds a new function, add_ntuple (maybe the name should be add_rntuple? There's already a TNTuple, and we don't mean that.)
  • src/uproot/writing/writable.py: adds new methods, a new class, a new (hidden) attribute, and an elif case to WritableDirectory._get.

None of this should cause any problems for current users of Uproot.

@Moelf
Copy link
Collaborator Author

Moelf commented Jan 6, 2023

seems weird

tests/test_0791-protect-uproot-project_columns-from-dask-node-names.py::test - AssertionError: assert 

only happens on >3.7

@jpivarski
Copy link
Member

dask-awkward only runs in Python 3.7+. However, it shouldn't fail. I'm going to run the tests in main: https://github.com/scikit-hep/uproot5/actions/runs/3851990477

We'll see if it's broken overall or only in this branch.

@Moelf
Copy link
Collaborator Author

Moelf commented Jan 6, 2023

looks like manually triggered CI also failed on main

@Moelf Moelf marked this pull request as ready for review January 6, 2023 19:36
@jpivarski jpivarski enabled auto-merge (squash) January 6, 2023 19:37
@jpivarski jpivarski merged commit 600005f into scikit-hep:main Jan 6, 2023
@Moelf Moelf changed the title feat: Support writing of RNTuple feat: Infrastructure for writing of RNTuple (incomplete functionality) Jan 6, 2023
@Moelf Moelf deleted the RNTuple branch November 12, 2023 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants