Add NEXT100 Activity/Efficiency database #806

gondiaz · 2021-10-22T15:08:42Z

This PR adds a new database to IC in similar terms as is done for detector's databases.

The new database is called NEXT100ACTDB, and is already placed in a dummy version at the mysql server (mysql server).

The database contains two tables, Activity and Efficiency with same column names. This database will be likely updated in the near future, and its purpose is to serve as an input to the MC Event-Mixer which will be also developed in subsequent PRs.

jmbenlloch

This PR adds a new database with the activity assumptions.

jmbenlloch · 2021-10-25T09:20:39Z

Please, DO NOT merge yet. I've seen after the approval that the automated tests related to the database are failing. I need to check what is happening.

gonzaponte · 2021-10-25T09:33:16Z

A few comments:

I think the tables shouldn't go into their own database, but rather inside the existing NEXT100DB/NEXT100DB database.
The two tables can be merged into one. I think it's a good idea because they are somewhat interconnected, but I don't have a strong opinion on this, particularly if it facilitates their usage in the code.
We need to think and decide what is the best way to version-control these tables, but at least there should be a column that allows you to select different iterations of these values. There is a chance that the values will be updated in the future and you don't want to remove the old ones. So maybe an int-column with a version number is ok. Perhaps it is better to associate it with a run number or a timestamp, I don't know. It needs to be discussed with the relevant people, probably @msorel, @ausonandres and whoever is involved in producing and utilizing these numbers.
How did you add the data to the lfs file? There is no download function, which needs to be added.

gondiaz · 2021-10-25T09:46:31Z

* How did you add the data to the lfs file? There is no download function, which needs to be added.

Not sure what you mean, the download.py script is modified to download this new database, which uses the existing function loadDB function.

gonzaponte · 2021-10-25T09:57:56Z

Yes, my bad, too little coffee in my system.

gondiaz · 2021-10-25T10:30:30Z

* I think the tables shouldn't go into their own database, but rather inside the existing NEXT100DB/NEXT100DB database.

* The two tables can be merged into one. I think it's a good idea because they are somewhat interconnected, but I don't have a strong opinion on this, particularly if it facilitates their usage in the code.

* We need to think and decide what is the best way to version-control these tables, but at least there should be a column that allows you to select different iterations of these values. There is a chance that the values will be updated in the future and you don't want to remove the old ones. So maybe an int-column with a version number is ok. Perhaps it is better to associate it with a run number or a timestamp, I don't know. It needs to be discussed with the relevant people, probably @msorel, @ausonandres and whoever is involved in producing and utilizing these numbers.

I agree that we should add a version number column and we can think of moving the databse to NEXT100DB. But whatever we decide does not affect this PR.

gonzaponte · 2021-10-25T10:39:55Z

But whatever we decide does not affect this PR.

Well, not really. For sure it would affect the download program because now the table would be located in a different database, but most importantly you would need to modify this code again once we agree on the structure of the table, so why not do that already?

gondiaz · 2021-10-25T11:05:29Z

I would say that the database will be composed by two tables: Activity and Efficiency, and that won't change.

If I understand your point, what it is still to be decided:

if we add a version number in the tables (does not affect the PR)
if we move NEXT100ACTDB into NEXT100DB folder.

If I am not wrong, the second point does not affect the PR either, since NEXT100DB/NEXT100DB is loaded independently of being inside NEXT100DB folder.

jmbenlloch · 2021-10-25T12:21:25Z

2. if we move NEXT100ACTDB into NEXT100DB folder.
If I am not wrong, the second point does not affect the PR either, since NEXT100DB/NEXT100DB is loaded independently of being inside NEXT100DB folder.

Regarding that point, let me clarify those "folders" do not exist. You are getting that impression by looking at the databases using the web interface (phpMyAdmin). In the web, the databases that start with a common name are grouped for readability, but such groups are completely fictitious.

You can have your tables either in NEXT100ACTDB or in NEXT100DB. That's it. There are no more levels in the structure, each database is completely independent from each other.

gondiaz · 2021-10-25T12:51:28Z

1. if we add a version number in the tables (does not affect the PR)

Sorry, in fact this would affect the reader

gonzaponte · 2021-10-25T13:00:44Z

As for 1), it doesn't need to affect this PR necessarily, but since you added a function to load the data within IC, I think it makes sense to fix it here already. And as I was typing this you just replied saying that :)
Regarding 2), if you move the tables into NEXT100DB, it would change slightly the code used to download the data. In this case, you would have something like

table_dict = dict.fromkeys(dbnames, common_tables)
table_dict["NEXT100DB"].extend(["Activity", "Efficiency"])

rather than adding another db with just these two tables to be read.

gondiaz · 2021-10-25T14:13:08Z

Ok, I see. Let me state my opinion for the final implementation:

Use NEXT100DB instead of NEXT100ACTDB, as @gonzaponte commented the change would be minimal.
Add a version number column for tagging. I don't think much version changes will be done, then a single column could be enought instead of the (minVersion, maxVersion) as it is done in detector dbs to save space. The last could be a bit painful for the person in charge to produce the csv file since he/she will produce it from an excell-like spreadsheet.

Could you express your opinion @gonzaponte @jmbenlloch @msorel?

gonzaponte · 2021-10-25T14:31:01Z

The actual choice of using a version number vs. minrun/maxrun vs. any other variant is something that I think should be decided by those who have used these data for various studies in the past. They should have a better understanding of how to manage these tables' updates and have a more informed and relevant opinion on the best way to go.

msorel · 2021-10-26T08:59:47Z

HI, thanks for moving this. Some comments from my side:

Where to put activity-related tables: OK with me to put it in NEXT100DB. These tables are detector-dependent, and in principle they should have the same format for other detectors (eg, NEXT-HD in the future). So, this would be a good scheme
How many activity-related tables: certainly I would split activity- from efficiency-related tables. The reason is that these are normally updated in different moments. The latter needs a new MC production made. The former can be updated when new screening measurements or a geometry understanding update is made. In fact, you may consider splitting into three tables: i) specific activity table, with screening measurements, ii) quantities table, with detector geometry information, iii) efficiency table, with background acceptance factors from simulations
How many rows: as discussed offline with @gondiaz , there should be one row per Geant4 volume. Each Geant4 volume can have activity contributions from many sub-components
How to tag: certainly not RunMin/RunMax, since this has nothing to do with changes in detector response (as sensor calibration), but with improvements in our understanding at analysis level, applicable to all runs. A single version number as we have now for the activity assumptions in docdb would be sufficient. I mean a single tag for all 2-3 tables we are proposing, and for all rows (each G4 volume) in this table. So this does not marry well with the idea of adding a table column in each table in my opinion. It is a global tag for the entire table.

jmbenlloch · 2021-10-26T09:38:35Z

Where to put activity-related tables: OK with me to put it in NEXT100DB. These tables are detector-dependent, and in principle they should have the same format for other detectors (eg, NEXT-HD in the future). So, this would be a good scheme

I agree on that. It makes more sense to have all the information related to a particular detector in the same database.

How many activity-related tables: certainly I would split activity- from efficiency-related tables. The reason is that these are normally updated in different moments. The latter needs a new MC production made. The former can be updated when new screening measurements or a geometry understanding update is made.

I also agree on this. Its better to have different tables.

In fact, you may consider splitting into three tables: i) specific activity table, with screening measurements, ii) quantities table, with detector geometry information, iii) efficiency table, with background acceptance factors from simulations

This scheme would mirror the current tables we have been using in the spreadsheet. I think the advantage would be that all the information would be in the DB, but it may require more development from the software side. It's up to the developers/users to decide which scheme fits better your needs.

How to tag: certainly not RunMin/RunMax, since this has nothing to do with changes in detector response (as sensor calibration), but with improvements in our understanding at analysis level, applicable to all runs. A single version number as we have now for the activity assumptions in docdb would be sufficient.

Completely agree on that.

I mean a single tag for all 2-3 tables we are proposing, and for all rows (each G4 volume) in this table. So this does not marry well with the idea of adding a table column in each table in my opinion. It is a global tag for the entire table.

This is not that simple to do. You can basically do two things:

Add an extra column with the version number for each row. The reading function would filter the relevant rows when you read them.
Add a suffix or something to the table name and create a new table for each new version. The software would be able to construct the corresponding table name for each version. Disadvantage: table proliferation? At some point there might be too many tables, but I think it shouldn't create any technical problem.

Again, up to you which version suits better your needs.

gonzaponte · 2021-10-26T09:47:17Z

Thanks, @msorel. If you think 3 tables is more flexible it might be the preferred solution. Regarding the versioning of the tables, I think I don't understand it well. You first said that these 2/3 tables are updated independently, so they would have different version numbers, but then you said we need to tag them together.

Technically, a single version number can be achieved in practice in a couple of different ways. First, using the version column, which is not a big deal, we are not talking about huge tables anyway. Then you can do two things:

Repeat the values on the other tables with a new version number every time one of them is updated (a bit of a waste of space, but not a big deal) or
Update only one of them and make the selection query to pick the last table with a version number <= than the one requested.

I think the second one might be a good way to go, but I don't have that much experience with databases, so it might have some caveats.

msorel · 2021-10-26T09:54:47Z

I mean a single tag for all 2-3 tables we are proposing, and for all rows (each G4 volume) in this table. So this does not marry well with the idea of adding a table column in each table in my opinion. It is a global tag for the entire table.

This is not that simple to do. You can basically do two things:

Add an extra column with the version number for each row. The reading function would filter the relevant rows when you read them.

Add a suffix or something to the table name and create a new table for each new version. The software would be able to construct the corresponding table name for each version. Disadvantage: table proliferation? At some point there might be too many tables, but I think it shouldn't create any technical problem.

Again, up to you which version suits better your needs.

Option 1 looks less ugly to me, so I would go with that. The operation could then be like this, if I understand correctly:

Say each table has n (few tens) of rows for each (Geant4 volume, isotope) combination. Initial import has v1 in the version number for each of the n rows
We need to update some rows, not necessarily all. We can update only those with version number v2
The algorithm then loops over all unique (G4 volume, isotope) combinations, picking up latest version (v2) if it exists, otherwise less old version that exists for that (volume, isotope) combination (v1, in this case). There is also the option to duplicate all rows every time we update some rows, but this is probably not needed
If sometime later we need to make a second change for a row that was not updated in the earlier change, we tag that with v3, hence changing from v1 to v3, skipping v2 for that row
The largest version number in the table at any moment, in this case v3, corresponds to the global tag for the table

What do you think?

jmbenlloch · 2021-10-26T10:06:48Z

What do you think?

The solution you've described is the one I'd implement. Everything should work properly with the workflow you've described and would avoid data redundancies.

msorel · 2021-10-26T10:13:59Z

HI @gonzaponte @jmbenlloch , looks like we agree on how to version-control individual tables, then!

I have a slight preference to separate into three tables, yes. If @gondiaz @MiryamMV are fine with that

you are right @gonzaponte , the scheme I have proposed does not necessarily ensure the same version number for all 2/3 tables we are talking about (table version = highest version when looping over table rows). But thinking more about it, ensuring this is probably not needed. If the specific acitivity, quantities and efficiency tables have different version numbers, it is probably OK

gondiaz · 2021-10-26T11:41:09Z

Talking with @gonzaponte offline we propose to do have a single version number, the version number of the Activity-assumptions. An example might help to be more clear:

v0: start with three tables, then save (v0, v0, v0)
v1: modify table 1 , then save (v0, v1, v0)
v2: modify table 1 and 2 , then save (v0, v2, v2)
v3: modify table 0 , then save (v3, v2, v2)

Then if we request for example Activity-assumptions version v2 we will get (v0, v2, v2), or (v3, v2, v2) if we request the last version v3.

Since it is not necessary for the event-mixer, I would skip the specific-activities table for the moment. It gives an extra work since the volumes are not necesarily the ones in the G4Volume list.

Do you all agree?

msorel · 2021-10-26T13:15:02Z

Hi @gondiaz , sounds good to me, if we want to keep a global version number encompassing all tables. However, now we only two tables and not three, I understand, right?

Another comment. I checked what you put in mysql as temporary table structure. you may consider switching to a more flexible table structure with four columns each.

Table Activity, with total activities (in mBq):

Version: int
G4Volume: string
Isotope: string (Bi214, Tl208, Co60, K40 -> please do not drop mass number)
TotalActivity: float

Table Efficiency, with dimensionless efficiencies:

Version: int
G4Volume: string
Isotope: string
Efficiency: float

so first three columns are common. This structure is a little more flexible. If tomorrow we want to add a background isotope other than the four considered thus far, we won't need to change the structure of the tables.

gondiaz · 2021-10-26T13:38:50Z

Yes, let's start with the two tables and add the thrid one that is not used directly by the mixer in the future.

Agree with that, this structure is better suited to add more background or components in future versions but slighly worse in my opinion to work with pandas, but it can be managed.

MiryamMV · 2021-10-26T14:20:53Z

Then if we request for example Activity-assumptions version v2 we will get (v0, v2, v2), or (v3, v2, v2) if we request the last version v3.

I agree with everything, but what's the purpose of keeping (v0, v2, v2) and (v3, v2, v2) at the same time?

gondiaz · 2021-10-27T07:57:49Z

The test processing is failing at the new test function I have added. I think that is is caused because the new table Activity in the database is not being updated in the test server. Could you take a look @jmbenlloch? The PR should be ready to review (again) by then.

jmbenlloch · 2021-10-27T08:59:51Z

The test processing is failing at the new test function I have added. I think that is is caused because the new table Activity in the database is not being updated in the test server. Could you take a look @jmbenlloch? The PR should be ready to review (again) by then.

You'll have to commit the localdb.NEXT100DB.sqlite3 with the new tables you have created.

gondiaz · 2021-10-27T09:13:02Z

Absolutely, my bad. It should pass the failing test now.

gonzaponte

First round. Please run pyflakes after you address these comments to check for unused variables.

invisible_cities/database/load_db.py

invisible_cities/database/load_db_test.py

invisible_cities/database/download.py

invisible_cities/database/load_db.py

invisible_cities/database/load_db_test.py

gonzaponte · 2021-10-27T13:23:15Z

Side note:
I've noticed that some of the G4Volumes contain whitespaces, is this something we can avoid, or is it too late?

msorel · 2021-10-27T13:27:11Z

Side note:
I've noticed that some of the G4Volumes contain whitespaces, is this something we can avoid, or is it too late?

Unless this has changed in nexus recently, all those white spaces should be underscores. I think we should be in time to enforce underscores and avoid white spaces in NEXT-100 geometry. luckily, precisely @gondiaz is our man for that!

gondiaz · 2021-10-27T13:42:12Z

G4Volume names does not contain white spaces in nexus in fact. The reason why they are in the database is because I took the data from an unupdated spread sheet, without the correct G4Volume names in the last nexus revision.
Both Activity and Efficiency table values are dummy now, I did it that way to test the procedure to upload the values in the database.

invisible_cities/database/load_db_test.py

Cosmetics

Update NEXT100DB Update NEXT100DB Update NEXT100DB

gondiaz · 2021-10-30T11:01:25Z

Just some squashing and reword

invisible_cities/database/load_db.py

invisible_cities/database/load_db_test.py

Rewrite activity reader to be more flexible Alignment

Rewrite database test Fix typo in test Rename df in test Co-authored-by: Gonzalo <gonzaponte@gmail.com> Add test for Efficiency table Add test for version number request Parametrize version in test Avoid code repetition in test Co-authored-by: Gonzalo <gonzaponte@gmail.com> Improve redability

gonzaponte

This PR adds new tables to the database and implements the functionality to load them within IC. It also provides a couple of tests to demonstrate and check the functionality.

Great job.

#815 [author: gondiaz] Updates the activity and efficiency columns in NEXT100 database with the latest activity values from measurements and the efficiencies computed through nexus simulations. Efficiencies assume at least 2 MeV is deposited in the active volume. Recall that the previous values were dummy, in order to test PR #806. [reviewer: MiryamMV] This PR adds the correct values of activities (v3) and efficiencies to the database for NEXT100.

gondiaz requested review from jmbenlloch and gonzaponte October 22, 2021 15:09

gondiaz force-pushed the activities_db branch from 1f194c1 to eae5030 Compare October 22, 2021 15:13

jmbenlloch approved these changes Oct 25, 2021

View reviewed changes

gondiaz changed the title ~~Add NEXT100ACTDB~~ Add NEXT100 Activity/Efficiency database Oct 26, 2021

gondiaz force-pushed the activities_db branch from 18bc8fa to 5dc2ac8 Compare October 26, 2021 19:16

gondiaz force-pushed the activities_db branch from 611987a to 6d71e1d Compare October 27, 2021 09:25

gondiaz added the database label Oct 27, 2021

gonzaponte reviewed Oct 27, 2021

View reviewed changes

invisible_cities/database/load_db_test.py Outdated Show resolved Hide resolved

gonzaponte reviewed Oct 28, 2021

View reviewed changes

invisible_cities/database/load_db_test.py Outdated Show resolved Hide resolved

gondiaz force-pushed the activities_db branch from 21ed856 to 30d55cd Compare October 28, 2021 18:10

gondiaz added 2 commits October 30, 2021 12:55

Add new NEXT100DB tables in download

d2e2d07

Cosmetics

Add new NEXT100DB tables in database

66b90b1

Update NEXT100DB Update NEXT100DB Update NEXT100DB

gondiaz force-pushed the activities_db branch from 30d55cd to 6813c62 Compare October 30, 2021 10:59

gonzaponte reviewed Nov 1, 2021

View reviewed changes

invisible_cities/database/load_db.py Show resolved Hide resolved

invisible_cities/database/load_db.py Outdated Show resolved Hide resolved

invisible_cities/database/load_db_test.py Outdated Show resolved Hide resolved

gondiaz added 2 commits November 2, 2021 09:40

Add activity db reader

9acf1af

Rewrite activity reader to be more flexible Alignment

gondiaz force-pushed the activities_db branch from 6813c62 to a8b0424 Compare November 2, 2021 08:41

gonzaponte approved these changes Nov 2, 2021

View reviewed changes

MiryamMV merged commit 2e8bfc1 into next-exp:master Nov 2, 2021

gondiaz deleted the activities_db branch November 2, 2021 17:34

gondiaz mentioned this pull request Apr 25, 2022

Update NEXT100 activity/efficiency database #815

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NEXT100 Activity/Efficiency database #806

Add NEXT100 Activity/Efficiency database #806

gondiaz commented Oct 22, 2021

jmbenlloch left a comment

jmbenlloch commented Oct 25, 2021

gonzaponte commented Oct 25, 2021

gondiaz commented Oct 25, 2021

gonzaponte commented Oct 25, 2021

gondiaz commented Oct 25, 2021

gonzaponte commented Oct 25, 2021

gondiaz commented Oct 25, 2021

jmbenlloch commented Oct 25, 2021

gondiaz commented Oct 25, 2021

gonzaponte commented Oct 25, 2021

gondiaz commented Oct 25, 2021

gonzaponte commented Oct 25, 2021

msorel commented Oct 26, 2021

jmbenlloch commented Oct 26, 2021

gonzaponte commented Oct 26, 2021

msorel commented Oct 26, 2021

jmbenlloch commented Oct 26, 2021

msorel commented Oct 26, 2021

gondiaz commented Oct 26, 2021

msorel commented Oct 26, 2021

gondiaz commented Oct 26, 2021

MiryamMV commented Oct 26, 2021

gondiaz commented Oct 27, 2021

jmbenlloch commented Oct 27, 2021

gondiaz commented Oct 27, 2021

gonzaponte left a comment

gonzaponte commented Oct 27, 2021

msorel commented Oct 27, 2021

gondiaz commented Oct 27, 2021 •

edited

Loading

gondiaz commented Oct 30, 2021

gonzaponte left a comment

Add NEXT100 Activity/Efficiency database #806

Add NEXT100 Activity/Efficiency database #806

Conversation

gondiaz commented Oct 22, 2021

jmbenlloch left a comment

Choose a reason for hiding this comment

jmbenlloch commented Oct 25, 2021

gonzaponte commented Oct 25, 2021

gondiaz commented Oct 25, 2021

gonzaponte commented Oct 25, 2021

gondiaz commented Oct 25, 2021

gonzaponte commented Oct 25, 2021

gondiaz commented Oct 25, 2021

jmbenlloch commented Oct 25, 2021

gondiaz commented Oct 25, 2021

gonzaponte commented Oct 25, 2021

gondiaz commented Oct 25, 2021

gonzaponte commented Oct 25, 2021

msorel commented Oct 26, 2021

jmbenlloch commented Oct 26, 2021

gonzaponte commented Oct 26, 2021

msorel commented Oct 26, 2021

jmbenlloch commented Oct 26, 2021

msorel commented Oct 26, 2021

gondiaz commented Oct 26, 2021

msorel commented Oct 26, 2021

gondiaz commented Oct 26, 2021

MiryamMV commented Oct 26, 2021

gondiaz commented Oct 27, 2021

jmbenlloch commented Oct 27, 2021

gondiaz commented Oct 27, 2021

gonzaponte left a comment

Choose a reason for hiding this comment

gonzaponte commented Oct 27, 2021

msorel commented Oct 27, 2021

gondiaz commented Oct 27, 2021 • edited Loading

gondiaz commented Oct 30, 2021

gonzaponte left a comment

Choose a reason for hiding this comment

gondiaz commented Oct 27, 2021 •

edited

Loading