PHYSLITE schema and inconsistent amounts of data being read for the same task #1073

alexander-held · 2024-04-10T20:40:02Z

Describe the bug

I am trying to track how much data exactly is getting read when reading PHYSLITE files. I am observing that this differs with the schema being used. In particular, I am observing for a test file (full reproducer below):

4.4 MB read through coffea with PHYSLITESchema
1.9 MB read with simple uproot.open (no schema)
1.9 MB read with simple uproot.dask (no schema)
3.1 MB read through coffea with BaseSchema

In all cases I request the same branch. Why does the report change with the schemas? Which extra information is being read, and why is that information needed?

I am also looking at the results of dak.report_necessary_columns, which only shows the specific branch I want to read anyway and does not show anything else in addition which may have explain a discrepancy.

I am happy to test more things but am somewhat stuck trying to understand the behavior. I also tested a similar setup on an CMS Open Data NanoAOD file with the corresponding NanoAOD schema and do not observe a similar kind of discrepancy there.

cc @nikoladze as expert on the schema

To Reproduce
full reproducer is at https://gist.github.com/alexander-held/8af116d93e936c5930648f1dea4fb02b (includes optional download for ~200 MB input file)

Expected behavior
I expect the same amount of data being read in all configurations.

Output
see gist

Desktop (please complete the following information):

awkward: 2.6.2
dask-awkward: 2024.3.0
uproot: 5.3.2
coffea: 2024.3.0

Additional context
n/a

The text was updated successfully, but these errors were encountered:

nikoladze · 2024-04-11T16:45:04Z

What i suspect: In PHYSLITESchema it might be that the AnalysisJetsAuxDyn.pt is read 3 times, one time for the offsets, one time for the content and one time to produce the _eventindex field that is attached to every collection to be able to calculate global indices dynamically for ElementLinks (the global index is the event index + local index).

The BaseSchema might still read the branch twice, one time for offsets, one time for content.

I'm not quite sure how everything is wired up in dask mode now - i think in the past it was avoided to read the branch multiple times due to the various caches - not sure how this is now.

To check the suspicion i ran your code through the debugger, and inspecting the base_form and rearranged form for PHYSLITE (output) in these lines of code:

coffea/src/coffea/nanoevents/schemas/physlite.py

Lines 66 to 71 in 750b96d

    
           def __init__(self, base_form, *args, **kwargs): 
        
               super().__init__(base_form) 
        
               form_dict = { 
        
                   key: form for key, form in zip(self._form["fields"], self._form["contents"]) 
        
               } 
        
               output = self._build_collections(form_dict)

They look like the following:

ipdb>  base_form
{'class': 'RecordArray', 'contents': [{'class': 'ListOffsetArray', 'offsets': 'i64', 'content': {'class': 'NumpyArray', 'primitive': 'float32', 'inner_shape': [], 'parameters': {'__doc__': 'AnalysisJetsAuxDyn.pt'}, 'form_key': 'AnalysisJetsAuxDyn.pt%2C%21load%2C%21content'}, 'parameters': {'__doc__': 'AnalysisJetsAuxDyn.pt'}, 'form_key': 'AnalysisJetsAuxDyn.pt%2C%21load'}], 'fields': ['AnalysisJetsAuxDyn.pt'], 'parameters': {'__doc__': 'CollectionTree', 'metadata': {'dataset': 'ttbar'}}, 'form_key': None}

ipdb>  output
{'Jets': {'class': 'ListOffsetArray', 'offsets': 'i64', 'content': {'class': 'RecordArray', 'fields': ['pt', '_eventindex'], 'contents': [{'class': 'NumpyArray', 'primitive': 'float32', 'inner_shape': [], 'parameters': {'__doc__': 'AnalysisJetsAuxDyn.pt'}, 'form_key': 'AnalysisJetsAuxDyn.pt%2C%21load%2C%21content'}, {'class': 'NumpyArray', 'parameters': {}, 'form_key': 'AnalysisJetsAuxDyn.pt%2C%21load%2C%21eventindex%2C%21content', 'itemsize': 8, 'primitive': 'int64'}], 'form_key': '%21invalid%2CJets', 'parameters': {'__record__': 'Particle', 'collection_name': 'Jets'}}, 'form_key': 'AnalysisJetsAuxDyn.pt%2C%21load'}}

where one can see the column AnalysisJetsAuxDyn.pt occuring 2 times in form_key for base_form:

AnalysisJetsAuxDyn.pt%2C%21load%2C%21content for the content
AnalysisJetsAuxDyn.pt%2C%21load for the offsets (not quite sure anymore why there is no !offsets here)

The same two also occur in the transformed form (output - this is the actual PHYSLITE schema) and additionally there is

AnalysisJetsAuxDyn.pt%2C%21load%2C%21eventindex%2C%21content for creating the eventindex

These additional instructions (!load, !content, !eventindex, with ! urlencoded to %21) are a coffea specific mini-language with transforms defined in src/coffea/nanoevents/transforms.py

gordonwatts · 2024-04-14T00:03:58Z

@nikoladze - ok, if I understand this, this is really the same root cause as #1074. Is that right? I'm asking because that one seems tricky to solve - so it won't show up for a while.

alexander-held added the bug Something isn't working label Apr 10, 2024

This was referenced Apr 10, 2024

Understand difference in num_requested_bytes between coffea + Dask setup and plain uproot.open iris-hep/idap-200gbps-atlas#27

Open

Lessons learned iris-hep/idap-200gbps-atlas#13

Open

nikoladze mentioned this issue Apr 12, 2024

PHYSLITE schema and EnergyPerSampling branch #1074

Open

matthewfeickert mentioned this issue Apr 16, 2024

G2.2: All core components of the Analysis Systems pipeline fully support distributed analysis iris-hep/analysis-systems-deliverables#2

Open

ekourlit mentioned this issue Jul 21, 2024

ATLAS PHYSLITESchema wish list #1135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHYSLITE schema and inconsistent amounts of data being read for the same task #1073

PHYSLITE schema and inconsistent amounts of data being read for the same task #1073

alexander-held commented Apr 10, 2024 •

edited

Loading

nikoladze commented Apr 11, 2024

gordonwatts commented Apr 14, 2024

PHYSLITE schema and inconsistent amounts of data being read for the same task #1073

PHYSLITE schema and inconsistent amounts of data being read for the same task #1073

Comments

alexander-held commented Apr 10, 2024 • edited Loading

nikoladze commented Apr 11, 2024

gordonwatts commented Apr 14, 2024

alexander-held commented Apr 10, 2024 •

edited

Loading