-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New tracking city #755
New tracking city #755
Conversation
a4ae48d
to
9a8c154
Compare
This PR is almost 5 months old, and has received no attention whatsoever. It relates to an issue (#749) which solicited the wisdom of: @pnovella @msorel @paolafer @ausonandres @Aretno @jahernando @jjgomezcadenas @gonzaponte 7 tests fail, most of them for fairly basic reasons. Please fix. Then rebase, so that the tests run on our new CI infrastructure (GitHub Actions). Please don't leave PRs with red crosses (failing tests) in the queue
A PR queue full of red crosses encourages people to ignore PRs. |
Hi, sorry i's my fault... I was trying to solve the test failing issues, however some other urgent work that I had to attend to came up at that moment. After that, I completely forgot to come back to this PR. I'll do it as soon as I have some time. |
When it is ready for review, please explicitly ask someone to review it, using the buttons/links near the top-right of the PR page. If it is not yet ready for review, please mark it as a draft, using the link at the bottom of the reviewers section: 'Still in progress? convert to draft'. |
49bea46
to
be15740
Compare
be15740
to
5d98b10
Compare
event_info = load_dst(path, 'Run', 'events') | ||
for evt in dhits_df.event.unique(): | ||
timestamp = event_info[event_info.evt_number==evt].timestamp.values[0] | ||
dhits = hits_from_df(dhits_df.loc[dhits_df.event==evt]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct since the DECO hits for a single event can contain multiple peaks. Since no peak information is included in the final track information output, we can not directly check if a multiple-track event comes from a real spliting or by a multiple-S2 event. This is not a relevant issue as long as commonly used 1S2 and 1-track cut are performed, but it is biasing the efficiencies.
This PR also modifies the search of track extremes in order to be deterministic but it does not adapt the tests, also affecting esmeralda tests and making the PR more difficult to review. I think this modifications should be a PR appart. I will do a different PR implementing the same functionality of this city, essentially running paolina algorithm over deconvoluted hits, without modifying the track extreme search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand completely your plan. If you want to add a new PR that exclusively adds the city to run paolina over deconvoluted hits, what would you do with this PR? just ignore it? would you use this just for solving the extreme issue? There is another thing to take into account, this PR apart from doing both things, also moves some functions from Esmeralda
to components.py
, would you also add that functionality in your new PR?
In my opinion I believe that the best approach could be removing all the extreme-related issue from this PR, and then add that new PR that you commented to take it into account. So the goal of this PR will keep being to add the tracking city.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as you might see in #798 I just did another PR since I didn't aim to undo your changes related with the extreme finder. I didn't move the functions still from esmeralda.py
to ease the review, but I intend to do it once the PR is reviewed. Then the main difference with this PR is that I left the paolina functions unmodified, which in my opinion should be done, if ever, in a different PR. It also changes the source to solve the issue with multipeak events, such that those events are filtered.
08444ee
to
b70a7ba
Compare
@@ -589,6 +595,31 @@ def MC_hits_from_files(files_in : List[str], rate: float) -> Generator: | |||
timestamp = timestamp(evt)) | |||
|
|||
|
|||
def dhits_from_files(paths: List[str]) -> Iterator[Dict[str,Union[HitCollection, pd.DataFrame, MCInfo, int, float]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that
hits_and_kdst_from_files
, cdst_from_files
and now a newly added dhits_from_files
are all doing almost the same thing!
They all read hits (from different tables) and optionally kdst
and summary_df
, but do we really need 3 different readers? Cant we just have one reader with hits- table and group name as parameters and optionally read kdst
and summary_df
? @gonzaponte , @Aretno
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all that needs some refactoring, but I would put it in another PR.
There is dst_from_files
which is almost what you request, but there are some subtleties done in some of those readers that might not be easy to implement in a general reader.
return copy_energy_to_Ep_hit_attribute_ | ||
|
||
|
||
types_dict_summary = OrderedDict({'event' : np.int32 , 'evt_energy' : np.float64, 'evt_charge' : np.float64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit cumbersome, maybe is better to place types_dict_...
in ic_types.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I agree
write_summary = fl.sink( summary_writer (h5out=h5out) , args="event_info" ) | ||
write_topology_filter = fl.sink( event_filter_writer(h5out, "topology_select"), args=("event_number", "topology_passed" )) | ||
|
||
if hits_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isnt it better to have
write_hits_paolina
be None
or fl.branch(fl.sink..))
and then filter the fn_list
for None
s?
pipe(*filter(None, fn_list)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also put this writer definition in esmeralda and pass it as argument to this function, if we are sure not to use this in other cities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll try that. Regarding your second comment, my intention here was to generalize this part of the function in order to allow writing the deconvoluted hits if this tracking city is run before Esmeralda's hit correction (maybe in the future). But I'll change it if you want.
write_topology_filter = fl.sink( event_filter_writer(h5out, "topology_select"), args=("event_number", "topology_passed" )) | ||
|
||
if hits_type: | ||
group_table = hits_type.split("/")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hits_type
is very misleading name, isnt it better to simply have 2 parameters directly? It would also be less error prone since you wont depend on the user passing string with this explicit pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we have discussed offline, it doesn't make a lot of sense the possibility to save the deconvoluted hits, therefore I remove the option for allowing to change the name of the table. Now there is just a flag called save_paolina_hits
that will be True, in case the function is executed inside Esmeralda (saving the hits in the table CHITS/highTh
) and it will correspond to False, if the function is run inside Isaura.
args="paolina_hits" ) | ||
else: write_hits_paolina = fl.sink(lambda _: None) | ||
|
||
fn_list = (create_extract_track_blob_info , |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should add the E
or Ec
to Ep
copying here, it would simply need a hit_type
as input and used in both cities.
you should also add hits filter (and filter writer) here since it is also shared between the cities
invisible_cities/io/hits_io.py
Outdated
Cluster(row.Q, xy(row.X, row.Y), xy(row.Xrms**2, row.Yrms**2), | ||
row.nsipm, row.Z, row.E, | ||
Qc = getattr(row, 'Qc', -1)), # for backwards compatibility | ||
Cluster(getattr(row,'Q', row.E), xy(row.X, row.Y), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is becoming a bit silly, we are missing half of the variables to construct our HitCollection
type...
It is also not true in case of Xrms and Yrms squared - here you are setting them to 0.
I am not saying you should deal with all event model problems in this PR but maybe doing something a bit cleaner like
for i, row in df.iterrows():
Q = getattr(row,'Q', -1)
if skip_NN and Q == NN:
continue
if hasattr(row, 'Xrms):
Xrms = row.Xrms
Xrms2 = Xrms**2
else:
Xrms = Xrms2 = -1
if hasattr(row, 'Yrms):
....
nsipm = getattr...
Qc = ...
This is still very ugly but avoids calling getattr
for parameters that are used more than once, and it also makes it a bit more clear where we are setting this default -1 value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll modify this to match your suggestion. It's completely true that this should be done just as a provisional solution: as you say, in the near future the event model must be revisited and changed in order to avoid these issues...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks, I've changed the function trying to be a bit clearer.
Yes, I obviously agree with you, it has no sense using this HitCollection
here, however I'd say that that issue is not related with this PR (since it's meant to adapt the tracking algorithm to the current code), so it should be mandatory to revisit the event_model
in the near future.
60c82f7
to
52a03e0
Compare
52a03e0
to
9ad4059
Compare
86cd1f0
to
dc00c6c
Compare
@@ -1002,3 +1037,267 @@ def calculate_and_save_buffers(buffer_length : float , | |||
# Filter out order_sensors if it is not set | |||
buffer_definition = pipe(*filter(None, find_signal_and_write_buffers)) | |||
return buffer_definition | |||
|
|||
def copy_E_or_Ec_to_Ep(energy_type: HitEnergy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @gonzaponte suggest, to keep our naming convention, this function can be called
Efield_copier
and the inner one copy_Efield
(this later name can also be used in the pipe elements namings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds good, I've changed that.
dc00c6c
to
c269a95
Compare
invisible_cities/cities/esmeralda.py
Outdated
@@ -429,37 +190,16 @@ def esmeralda(files_in, file_out, compression, event_range, print_mod, | |||
|
|||
threshold_and_correct_hits_high = fl.map(hits_threshold_and_corrector(threshold_charge=cor_hits_params['threshold_charge_high'], **cor_hits_params_), | |||
args = 'hits', | |||
out = 'cor_high_th_hits') | |||
out = 'hits') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use
item = 'hits'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know, I didn't know about that option.
00adf17
to
1052d36
Compare
1052d36
to
7421845
Compare
4785e15
to
3e8742e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds a new city, Isaura, that applies paolina tracking functions to the output of Beersheba. The functions that are shared between Esmeralda and Isaura are extracted into a single pipe, hence no code is copied. Esmeralda tests pass as before, and new tests are added for Isaura output.
This city is a temporary step until we decide on final reconstruction + analysis order (as discussed in Issue #749).
3e8742e
to
7dd0a19
Compare
According to issue #749, the aim of this PR is to add a city (to be run just after Beersheba, for the moment), that computes all the tracking information. Since the tracking-related functions will be shared between Esmeralda and this new city (so-called Isaura (credits to @jjgomezcadenas)), they will be moved to other places (
components.py
, in most of the cases).There are two issues that I faced while doing the city. The first one was already pointed out in #749, and it is related to the kdst information. It would be interesting that the output of this tracking city contains kdst information (
nS2
variable is used for the posterior analysis), however currentBeersheba
version doesn't save this table. Therefore, for the time being, we can access to this information just opening thecdst
files, although I believe that in the near futurue we should add this table inBeersheba
and then inIsaura
.The second issue concerns the tracking algorithm. As you may know, it seems that
Paolina
, and particularlyreco/paolina_functions/find_extrema()
funcion is not deterministic. This is because in some events there are more than one possible extreme voxels for a given track, and it appear that in those cases they are chosen randomly. The solution that I propose (and add here, in order to check that the output was correct -event though maybe this is not the place to do it-) consists in save all the possibilities, and then choose the extreme where the sum of their energy takes the maximum value (in order to avoid, if possible, the usage ofpaolina_functions/drop_end_point_voxels()
function). Nonetheless, there would be other solutions, but the choice of one or another will be a bit arbitrary at the end, in my opinion.