-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PHYSLITE schema and inconsistent amounts of data being read for the same task #1073
Comments
What i suspect: In The I'm not quite sure how everything is wired up in dask mode now - i think in the past it was avoided to read the branch multiple times due to the various caches - not sure how this is now. To check the suspicion i ran your code through the debugger, and inspecting the coffea/src/coffea/nanoevents/schemas/physlite.py Lines 66 to 71 in 750b96d
They look like the following: ipdb> base_form
{'class': 'RecordArray', 'contents': [{'class': 'ListOffsetArray', 'offsets': 'i64', 'content': {'class': 'NumpyArray', 'primitive': 'float32', 'inner_shape': [], 'parameters': {'__doc__': 'AnalysisJetsAuxDyn.pt'}, 'form_key': 'AnalysisJetsAuxDyn.pt%2C%21load%2C%21content'}, 'parameters': {'__doc__': 'AnalysisJetsAuxDyn.pt'}, 'form_key': 'AnalysisJetsAuxDyn.pt%2C%21load'}], 'fields': ['AnalysisJetsAuxDyn.pt'], 'parameters': {'__doc__': 'CollectionTree', 'metadata': {'dataset': 'ttbar'}}, 'form_key': None}
ipdb> output
{'Jets': {'class': 'ListOffsetArray', 'offsets': 'i64', 'content': {'class': 'RecordArray', 'fields': ['pt', '_eventindex'], 'contents': [{'class': 'NumpyArray', 'primitive': 'float32', 'inner_shape': [], 'parameters': {'__doc__': 'AnalysisJetsAuxDyn.pt'}, 'form_key': 'AnalysisJetsAuxDyn.pt%2C%21load%2C%21content'}, {'class': 'NumpyArray', 'parameters': {}, 'form_key': 'AnalysisJetsAuxDyn.pt%2C%21load%2C%21eventindex%2C%21content', 'itemsize': 8, 'primitive': 'int64'}], 'form_key': '%21invalid%2CJets', 'parameters': {'__record__': 'Particle', 'collection_name': 'Jets'}}, 'form_key': 'AnalysisJetsAuxDyn.pt%2C%21load'}} where one can see the column
The same two also occur in the transformed form (
These additional instructions ( |
@nikoladze - ok, if I understand this, this is really the same root cause as #1074. Is that right? I'm asking because that one seems tricky to solve - so it won't show up for a while. |
Describe the bug
I am trying to track how much data exactly is getting read when reading PHYSLITE files. I am observing that this differs with the schema being used. In particular, I am observing for a test file (full reproducer below):
coffea
withPHYSLITESchema
uproot.open
(no schema)uproot.dask
(no schema)coffea
withBaseSchema
In all cases I request the same branch. Why does the report change with the schemas? Which extra information is being read, and why is that information needed?
I am also looking at the results of
dak.report_necessary_columns
, which only shows the specific branch I want to read anyway and does not show anything else in addition which may have explain a discrepancy.I am happy to test more things but am somewhat stuck trying to understand the behavior. I also tested a similar setup on an CMS Open Data NanoAOD file with the corresponding NanoAOD schema and do not observe a similar kind of discrepancy there.
cc @nikoladze as expert on the schema
To Reproduce
full reproducer is at https://gist.github.com/alexander-held/8af116d93e936c5930648f1dea4fb02b (includes optional download for ~200 MB input file)
Expected behavior
I expect the same amount of data being read in all configurations.
Output
see gist
Desktop (please complete the following information):
Additional context
n/a
The text was updated successfully, but these errors were encountered: