Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a tool to merge several podio files into a single one #681

Merged
merged 14 commits into from
Nov 20, 2024

Conversation

jmcarcell
Copy link
Member

@jmcarcell jmcarcell commented Sep 19, 2024

BEGINRELEASENOTES

  • Add a tool to merge several podio files into a single one:
    • Metadata for every event can be saved or not
    • The same format as the first input file will be used (TTree or RNTuple)
    • Metadata about the input file names will be saved

ENDRELEASENOTES

Useful not to have to deal with many small files, even though the readers can read them fine.

Copy link
Collaborator

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does a lot of heavy lifting that might not be strictly necessary, as it unpacks and than repacks every collection in every frame, whereas a simple hadd would almost be enough for merging TTree based files. Additionally, it will do schema evolution (if applicable), so that there could be subtle changes to the produced output file. On the other hand, this will correctly handle input files with different schema versions, so that might be not too bad.

How much slower is this than a c++ implementation? There is #620 after all to solve a quite significant performance issue with podio-dump, and the difference is not only the long startup time in that case.

tools/podio-merge-files Outdated Show resolved Hide resolved
@jmcarcell
Copy link
Member Author

Ah I didn't think about hadd, this is a tool I have used a couple of times (well only the for loop) so I thought it would be nice to have. On small files (1 MB TTree files) it's much faster than in python, but that's the case with the worst relative overhead I would imagine. For RNTuples it crashes with the ones I tried.

@tmadlener
Copy link
Collaborator

For RNTuples it crashes with the ones I tried.

Can you specify what crashes in this case? hadd or the c++ version of this script?

@jmcarcell
Copy link
Member Author

hadd is crashing

@tmadlener
Copy link
Collaborator

Maybe add a bit of metadata that keeps track of which files the the merged files comes from.

@jmcarcell
Copy link
Member Author

This should be ready and now write some metadata with the names of the files as passed in the arguments (full paths being saved as full paths):

Key                           Value
--------------------------------------------------------------------------------
MergeInputFiles               [example_frame.root, example_frame2.root]

tools/podio-merge-files Outdated Show resolved Hide resolved
tools/podio-merge-files Outdated Show resolved Hide resolved
@tmadlener tmadlener merged commit 960795b into AIDASoft:master Nov 20, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants