Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging MC files #715

Open
mmkekic opened this issue Apr 6, 2020 · 13 comments
Open

Merging MC files #715

mmkekic opened this issue Apr 6, 2020 · 13 comments

Comments

@mmkekic
Copy link
Collaborator

mmkekic commented Apr 6, 2020

Dealing with #693 we run into a problem of merging several nexus files information of the configuration table. Seems that it makes sense to have a configuration information (such as Geometry/Physics used to generate files) unique for all concatenated files, however, random_seed is a per-file information that needs to be saved and we are not sure what is the best way to save it. Maybe the best option is to have another table that will match event numbers and random seed?

This issue is somewhat related to

  1. future version of detsim https://github.com/nextic/detsim.git is able to split long nexus events, making nexus event_id repeating, meaning we need a mapping between detsim_event_id and nexus_event_id (maybe random_seed can go into this table?). There is also important issue of assigning detsim_event_id in a way to ensure that it is unique per run.

  2. in general how to deal with merging files with repeated event_id. This issue is certainly present when mixing several MC productions and we still dont have a good solution for it.

@gonzaponte
Copy link
Collaborator

in general how to deal with merging files with repeated event_id. This issue is certainly present when mixing several MC productions and we still dont have a good solution for it.

How about adding a sub_event_id column? We would need to add some stuff to deal with MC and data separately and transparently, but it sounds feasible.

@jmalbos
Copy link
Collaborator

jmalbos commented Apr 6, 2020

Dealing with #693 we run into a problem of merging several nexus files information of the configuration table. Seems that it makes sense to have a configuration information (such as Geometry/Physics used to generate files) unique for all concatenated files, however, random_seed is a per-file information that needs to be saved and we are not sure what is the best way to save it. Maybe the best option is to have another table that will match event numbers and random seed?

There may be cases (e.g. mixing of background events from different sources) in which not only the random seed but also other configuration parameters could be different. It may be simpler to copy all the configuration tables, labelling them with a subrun tag (or something similar).

@andLaing
Copy link
Collaborator

andLaing commented Apr 7, 2020

I think that this is something that needs to be solved in the medium term but I'd propose a staged approach since we really need to get PR #693 completed so that the integration of the detector simulation code isn't delayed too much.

I think we need to come up with a more general solution (please keep suggesting here) but I'd propose a minimal protection in PR #693 so that the configuration info isn't confusing/clashing (still not that simple, really) and that we deal with it in a more complete way in the PRs related with event splitting etc. I think that the favoured production paradigm needs to remain what we've used up til now in the short term -- processing single files per job.

A patch could be adding file number/name in the param_keys somewhere in the configuration information and adding a check for overlap in the merging of other MC tables.

Thoughts?

@jmalbos
Copy link
Collaborator

jmalbos commented Apr 7, 2020

How about adding a sub_event_id column? We would need to add some stuff to deal with MC and data separately and transparently, but it sounds feasible.

This would handle well the event splitting in detsim, and events from the same MC production (which have by construction different event ids) could be processed with no issue as part of the same run.

A possible problem would be the event mixing from different MC productions, e.g. what was being done for the mixing of different background sources (@msorel, @paolafer: are we going to continue doing this?). Possible solutions include:

  • Mixing only events from a NEXUS production in which we've ensured that the events ids are unique.
  • Mixing the events outside IC, renumbering them as needed.
  • Handling in the processing the unique combination of run id, event id and subevent id. This would allow as well merging files from different data runs.

@msorel
Copy link
Collaborator

msorel commented Apr 7, 2020

How about adding a sub_event_id column? We would need to add some stuff to deal with MC and data separately and transparently, but it sounds feasible.

This would handle well the event splitting in detsim, and events from the same MC production (which have by construction different event ids) could be processed with no issue as part of the same run.

A possible problem would be the event mixing from different MC productions, e.g. what was being done for the mixing of different background sources (@msorel, @paolafer: are we going to continue doing this?). Possible solutions include:

  • Mixing only events from a NEXUS production in which we've ensured that the events ids are unique.
  • Mixing the events outside IC, renumbering them as needed.
  • Handling in the processing the unique combination of run id, event id and subevent id. This would allow as well merging files from different data runs.

We will continue mixing MC events, yes. In case it is relevant for this discussion: until now we have mixed events at the nexus level, and went through al other processing steps only for the mixed files and not for the single-source files. We want to change this, allowing for mixing files at different stages of processing. We have not decided yet if mixing post-irene, post-esmeralda or what.

Concerning your possible ways to tackle this, Justo. Renumbering events would be fine (and could be done outside IC, as right now still) if information were not dropped from one processing step to other, but only added, so that effectively you would never need to go back to a previous processing step, as you have all information available at the end. But this is not the case: we drop information. If we renumber events, we cannot relate events in different processing steps anymore. So at first thought I would vote against option 2.

Option 1 should work: some sort of script that runs over nexus output directories, and raises a flag if an event_id is repeated? Option 3 too, I guess, but sounds like it requires some more gymnastics? Beyond these three, perhaps there are more elegant ways to deal with this.

@paolafer
Copy link
Collaborator

paolafer commented Apr 7, 2020

  • Mixing only events from a NEXUS production in which we've ensured that the events ids are unique.

This is what we're doing now: we're doing the gymnastic of not having repeated event IDs across the full background production, is that right, @msorel ?

About the configuration table when merging files, I agree with Justo that there may be more parameters that differ from one file to another one and we may not be able to foresee them now. I understand that if we saved the configuration table of each file, we should have a column somewhere that relates a specific event ID to its correct table. Maybe it would be useful to add another table to the file, that deals with all the information that we need event by event. This table would be useful also to simplify the reading of event IDs in the MC readers (see #693).

@andLaing
Copy link
Collaborator

I was thinking about the implementation necessary for the long event splitting and I kept hitting mental or physical blocks. The idea to put a sub-event number made sense but I hit a problem when I got to the output of irene where the pmaps are indexed according to event number. Without adding quite a lot of complexity I couldn't think of a way to get the output non-repeating.

The only other, suboptimal, idea I had was to make the event number a float (either for pmaps and above or in general) and make the sub-event be the first decimal.

@gonzaponte
Copy link
Collaborator

when I got to the output of irene where the pmaps are indexed according to event number

What should Irene do: merge subevents into a single event or store each subevent separately?

Without adding quite a lot of complexity I couldn't think of a way to get the output non-repeating.

If Irene merges subevents into a single one this complexity should go away, I think.

The only other, suboptimal, idea I had was to make the event number a float (either for pmaps and above or in general) and make the sub-event be the first decimal.

I also thought of that, it is not terrible, but I agree it is not optimal. I don't know if we can also encounter precision problems...

@andLaing
Copy link
Collaborator

andLaing commented Apr 24, 2020

What should Irene do: merge subevents into a single event or store each subevent separately?

No, nexus simulates all the activity coming from, for example, a muon and records the times that energy was deposited or sensors recorded photons. In detsim we want to be able to recognise events which would be two or more triggers in the detector and split accordingly into subevents. These subevents need to be treated as independent entities by the processing as in data that would be the case. We come into difficulties with indexing though.

@andLaing
Copy link
Collaborator

I started to have a look at a version of IC that could read events with (evt_number, subevt_number). It's a bit fiddly but it might be an ok starting point. Have a look if you can: https://github.com/andLaing/IC/tree/new-run-table

carmenromo pushed a commit that referenced this issue Jun 5, 2020
#722

[author: andLaing]

Adds a new io option to `rwf_io` which uses the basic `rwf_writer` and
other table writers to write all event info in one step. In this way
long MC events can be split into multiple trigger-like events in the
output file and the event numbers can be logged and mapped
accordingly.

Some issues remain for the logging portion that are under debate.

Addresses point 2 of issue #691

[reviewer: jmalbos]

This PR adds a new writer (and the corresponding test) to `rwf_io` than
can handle the splitting of long (MC) events into several subevents. A
new table (`MCEventMap`) is used to associate the new subevents to the
original event.

Nevertheless, this new writer has limited use until a decision is
taken regarding #715.
@andLaing
Copy link
Collaborator

andLaing commented Oct 21, 2020

Hi everyone. I recently came back to thinking about this issue as I'm starting to hit some walls (semi)related to this in the analysis of cosmogenic backgrounds. The attempt I made to solve the problem (in previous comment) involved a lot of changes to IC and was quite fiddly. I thought about some possible alternatives, it'd be good to have some comments on them or other suggestions which could be better. The two possible alternatives I came up with yesterday were:

  1. Keep the nexus event number and add a subevent number (basically what I tried above)
  • Pros: Implementation in detsim/bufferization is simple

  • Cons: Doesn't necessarily solve all possible issues with merging files, requires a lot of underlying changes to IC.

  1. Generate a new event number in detsim/bufferization. Structured, for example, as 'run code'0'file number'0'generator code'0'nexus event number'. The run and generator codes could come from some convention or an enum (for generator) or even an IC tag number or production date.
  • Pros: Basically all the complexity is taken by detsim/bufferization, with a bit of work should be close to unique.

  • Cons: Probably requires changing the MC tables event number in detsim/bufferization (event_mapping table could link backwards?), depending on what structure for the number is chosen could still lead to possible overlap.

I currently favour option 2 but that could just be that it's newer and it should cause less upstream issues. Please comment @mmkekic , @paolafer , @jmalbos , @gonzaponte , @jjgomezcadenas and all.

@msorel
Copy link
Collaborator

msorel commented Oct 21, 2020

Hi @andLaing , without having thought too hard on it, looks also to me that 2 is better, so that code downstream is untouched. Basically anything that falls outside the event time window defined in detsim/bufferization gets a new event id.

I am not sure I understand what you mean by "structured as 000", though. Can you explain?

@andLaing
Copy link
Collaborator

Sorry, the example didn't render, I've fixed it in the original comment.

carmenromo pushed a commit that referenced this issue Dec 21, 2020
#751

[author: andLaing]

Adds functions to generate unique event numbers for MC to allow for safe
processing of split nexus events and simplify MC event mixing.

In need of more tests and a file number reader but ready for discussion.

Discussion continues from issue #715

[reviewer: mmkekic]

This PR adds a event splitting functionality keeping unique nexus event numbers
across files by assuming a constant maximum number of splits per event. A new
table that maps IC event number to original MC event number is added to Run
group ensuring the code is compatible with the old formats of MC production. The
code is documented and tested, good job!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants