Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NEXT100 Activity/Efficiency database #806

Merged
merged 4 commits into from
Nov 2, 2021

Conversation

gondiaz
Copy link
Collaborator

@gondiaz gondiaz commented Oct 22, 2021

This PR adds a new database to IC in similar terms as is done for detector's databases.

The new database is called NEXT100ACTDB, and is already placed in a dummy version at the mysql server (mysql server).

The database contains two tables, Activity and Efficiency with same column names. This database will be likely updated in the near future, and its purpose is to serve as an input to the MC Event-Mixer which will be also developed in subsequent PRs.

Copy link
Contributor

@jmbenlloch jmbenlloch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds a new database with the activity assumptions.

@jmbenlloch
Copy link
Contributor

Please, DO NOT merge yet. I've seen after the approval that the automated tests related to the database are failing. I need to check what is happening.

@gonzaponte
Copy link
Collaborator

A few comments:

  • I think the tables shouldn't go into their own database, but rather inside the existing NEXT100DB/NEXT100DB database.
  • The two tables can be merged into one. I think it's a good idea because they are somewhat interconnected, but I don't have a strong opinion on this, particularly if it facilitates their usage in the code.
  • We need to think and decide what is the best way to version-control these tables, but at least there should be a column that allows you to select different iterations of these values. There is a chance that the values will be updated in the future and you don't want to remove the old ones. So maybe an int-column with a version number is ok. Perhaps it is better to associate it with a run number or a timestamp, I don't know. It needs to be discussed with the relevant people, probably @msorel, @ausonandres and whoever is involved in producing and utilizing these numbers.
  • How did you add the data to the lfs file? There is no download function, which needs to be added.

@gondiaz
Copy link
Collaborator Author

gondiaz commented Oct 25, 2021

* How did you add the data to the lfs file? There is no download function, which needs to be added.

Not sure what you mean, the download.py script is modified to download this new database, which uses the existing function loadDB function.

@gonzaponte
Copy link
Collaborator

Yes, my bad, too little coffee in my system.

@gondiaz
Copy link
Collaborator Author

gondiaz commented Oct 25, 2021

* I think the tables shouldn't go into their own database, but rather inside the existing NEXT100DB/NEXT100DB database.

* The two tables can be merged into one. I think it's a good idea because they are somewhat interconnected, but I don't have a strong opinion on this, particularly if it facilitates their usage in the code.

* We need to think and decide what is the best way to version-control these tables, but at least there should be a column that allows you to select different iterations of these values. There is a chance that the values will be updated in the future and you don't want to remove the old ones. So maybe an int-column with a version number is ok. Perhaps it is better to associate it with a run number or a timestamp, I don't know. It needs to be discussed with the relevant people, probably @msorel, @ausonandres and whoever is involved in producing and utilizing these numbers.

I agree that we should add a version number column and we can think of moving the databse to NEXT100DB. But whatever we decide does not affect this PR.

@gonzaponte
Copy link
Collaborator

But whatever we decide does not affect this PR.

Well, not really. For sure it would affect the download program because now the table would be located in a different database, but most importantly you would need to modify this code again once we agree on the structure of the table, so why not do that already?

@gondiaz
Copy link
Collaborator Author

gondiaz commented Oct 25, 2021

I would say that the database will be composed by two tables: Activity and Efficiency, and that won't change.

If I understand your point, what it is still to be decided:

  1. if we add a version number in the tables (does not affect the PR)
  2. if we move NEXT100ACTDB into NEXT100DB folder.

If I am not wrong, the second point does not affect the PR either, since NEXT100DB/NEXT100DB is loaded independently of being inside NEXT100DB folder.

@jmbenlloch
Copy link
Contributor

2. if we move NEXT100ACTDB into NEXT100DB folder.

If I am not wrong, the second point does not affect the PR either, since NEXT100DB/NEXT100DB is loaded independently of being inside NEXT100DB folder.

Regarding that point, let me clarify those "folders" do not exist. You are getting that impression by looking at the databases using the web interface (phpMyAdmin). In the web, the databases that start with a common name are grouped for readability, but such groups are completely fictitious.

You can have your tables either in NEXT100ACTDB or in NEXT100DB. That's it. There are no more levels in the structure, each database is completely independent from each other.

@gondiaz
Copy link
Collaborator Author

gondiaz commented Oct 25, 2021

1. if we add a version number in the tables (does not affect the PR)

Sorry, in fact this would affect the reader

@gonzaponte
Copy link
Collaborator

As for 1), it doesn't need to affect this PR necessarily, but since you added a function to load the data within IC, I think it makes sense to fix it here already. And as I was typing this you just replied saying that :)
Regarding 2), if you move the tables into NEXT100DB, it would change slightly the code used to download the data. In this case, you would have something like

table_dict = dict.fromkeys(dbnames, common_tables)
table_dict["NEXT100DB"].extend(["Activity", "Efficiency"])

rather than adding another db with just these two tables to be read.

@gondiaz
Copy link
Collaborator Author

gondiaz commented Oct 25, 2021

Ok, I see. Let me state my opinion for the final implementation:

  1. Use NEXT100DB instead of NEXT100ACTDB, as @gonzaponte commented the change would be minimal.
  2. Add a version number column for tagging. I don't think much version changes will be done, then a single column could be enought instead of the (minVersion, maxVersion) as it is done in detector dbs to save space. The last could be a bit painful for the person in charge to produce the csv file since he/she will produce it from an excell-like spreadsheet.

Could you express your opinion @gonzaponte @jmbenlloch @msorel?

@gonzaponte
Copy link
Collaborator

The actual choice of using a version number vs. minrun/maxrun vs. any other variant is something that I think should be decided by those who have used these data for various studies in the past. They should have a better understanding of how to manage these tables' updates and have a more informed and relevant opinion on the best way to go.

@msorel
Copy link
Collaborator

msorel commented Oct 26, 2021

HI, thanks for moving this. Some comments from my side:

  • Where to put activity-related tables: OK with me to put it in NEXT100DB. These tables are detector-dependent, and in principle they should have the same format for other detectors (eg, NEXT-HD in the future). So, this would be a good scheme
  • How many activity-related tables: certainly I would split activity- from efficiency-related tables. The reason is that these are normally updated in different moments. The latter needs a new MC production made. The former can be updated when new screening measurements or a geometry understanding update is made. In fact, you may consider splitting into three tables: i) specific activity table, with screening measurements, ii) quantities table, with detector geometry information, iii) efficiency table, with background acceptance factors from simulations
  • How many rows: as discussed offline with @gondiaz , there should be one row per Geant4 volume. Each Geant4 volume can have activity contributions from many sub-components
  • How to tag: certainly not RunMin/RunMax, since this has nothing to do with changes in detector response (as sensor calibration), but with improvements in our understanding at analysis level, applicable to all runs. A single version number as we have now for the activity assumptions in docdb would be sufficient. I mean a single tag for all 2-3 tables we are proposing, and for all rows (each G4 volume) in this table. So this does not marry well with the idea of adding a table column in each table in my opinion. It is a global tag for the entire table.

@jmbenlloch
Copy link
Contributor

Where to put activity-related tables: OK with me to put it in NEXT100DB. These tables are detector-dependent, and in principle they should have the same format for other detectors (eg, NEXT-HD in the future). So, this would be a good scheme

I agree on that. It makes more sense to have all the information related to a particular detector in the same database.

How many activity-related tables: certainly I would split activity- from efficiency-related tables. The reason is that these are normally updated in different moments. The latter needs a new MC production made. The former can be updated when new screening measurements or a geometry understanding update is made.

I also agree on this. Its better to have different tables.

In fact, you may consider splitting into three tables: i) specific activity table, with screening measurements, ii) quantities table, with detector geometry information, iii) efficiency table, with background acceptance factors from simulations

This scheme would mirror the current tables we have been using in the spreadsheet. I think the advantage would be that all the information would be in the DB, but it may require more development from the software side. It's up to the developers/users to decide which scheme fits better your needs.

How to tag: certainly not RunMin/RunMax, since this has nothing to do with changes in detector response (as sensor calibration), but with improvements in our understanding at analysis level, applicable to all runs. A single version number as we have now for the activity assumptions in docdb would be sufficient.

Completely agree on that.

I mean a single tag for all 2-3 tables we are proposing, and for all rows (each G4 volume) in this table. So this does not marry well with the idea of adding a table column in each table in my opinion. It is a global tag for the entire table.

This is not that simple to do. You can basically do two things:

  1. Add an extra column with the version number for each row. The reading function would filter the relevant rows when you read them.
  2. Add a suffix or something to the table name and create a new table for each new version. The software would be able to construct the corresponding table name for each version. Disadvantage: table proliferation? At some point there might be too many tables, but I think it shouldn't create any technical problem.

Again, up to you which version suits better your needs.

@gonzaponte
Copy link
Collaborator

Thanks, @msorel. If you think 3 tables is more flexible it might be the preferred solution. Regarding the versioning of the tables, I think I don't understand it well. You first said that these 2/3 tables are updated independently, so they would have different version numbers, but then you said we need to tag them together.

Technically, a single version number can be achieved in practice in a couple of different ways. First, using the version column, which is not a big deal, we are not talking about huge tables anyway. Then you can do two things:

  1. Repeat the values on the other tables with a new version number every time one of them is updated (a bit of a waste of space, but not a big deal) or
  2. Update only one of them and make the selection query to pick the last table with a version number <= than the one requested.

I think the second one might be a good way to go, but I don't have that much experience with databases, so it might have some caveats.

@msorel
Copy link
Collaborator

msorel commented Oct 26, 2021

I mean a single tag for all 2-3 tables we are proposing, and for all rows (each G4 volume) in this table. So this does not marry well with the idea of adding a table column in each table in my opinion. It is a global tag for the entire table.

This is not that simple to do. You can basically do two things:

  1. Add an extra column with the version number for each row. The reading function would filter the relevant rows when you read them.
  2. Add a suffix or something to the table name and create a new table for each new version. The software would be able to construct the corresponding table name for each version. Disadvantage: table proliferation? At some point there might be too many tables, but I think it shouldn't create any technical problem.

Again, up to you which version suits better your needs.

Option 1 looks less ugly to me, so I would go with that. The operation could then be like this, if I understand correctly:

  • Say each table has n (few tens) of rows for each (Geant4 volume, isotope) combination. Initial import has v1 in the version number for each of the n rows
  • We need to update some rows, not necessarily all. We can update only those with version number v2
  • The algorithm then loops over all unique (G4 volume, isotope) combinations, picking up latest version (v2) if it exists, otherwise less old version that exists for that (volume, isotope) combination (v1, in this case). There is also the option to duplicate all rows every time we update some rows, but this is probably not needed
  • If sometime later we need to make a second change for a row that was not updated in the earlier change, we tag that with v3, hence changing from v1 to v3, skipping v2 for that row
  • The largest version number in the table at any moment, in this case v3, corresponds to the global tag for the table

What do you think?

@jmbenlloch
Copy link
Contributor

What do you think?

The solution you've described is the one I'd implement. Everything should work properly with the workflow you've described and would avoid data redundancies.

@msorel
Copy link
Collaborator

msorel commented Oct 26, 2021

HI @gonzaponte @jmbenlloch , looks like we agree on how to version-control individual tables, then!

I have a slight preference to separate into three tables, yes. If @gondiaz @MiryamMV are fine with that

you are right @gonzaponte , the scheme I have proposed does not necessarily ensure the same version number for all 2/3 tables we are talking about (table version = highest version when looping over table rows). But thinking more about it, ensuring this is probably not needed. If the specific acitivity, quantities and efficiency tables have different version numbers, it is probably OK

@gondiaz
Copy link
Collaborator Author

gondiaz commented Oct 26, 2021

Talking with @gonzaponte offline we propose to do have a single version number, the version number of the Activity-assumptions. An example might help to be more clear:

  • v0: start with three tables, then save (v0, v0, v0)
  • v1: modify table 1 , then save (v0, v1, v0)
  • v2: modify table 1 and 2 , then save (v0, v2, v2)
  • v3: modify table 0 , then save (v3, v2, v2)

Then if we request for example Activity-assumptions version v2 we will get (v0, v2, v2), or (v3, v2, v2) if we request the last version v3.

Since it is not necessary for the event-mixer, I would skip the specific-activities table for the moment. It gives an extra work since the volumes are not necesarily the ones in the G4Volume list.

Do you all agree?

@msorel
Copy link
Collaborator

msorel commented Oct 26, 2021

Hi @gondiaz , sounds good to me, if we want to keep a global version number encompassing all tables. However, now we only two tables and not three, I understand, right?

Another comment. I checked what you put in mysql as temporary table structure. you may consider switching to a more flexible table structure with four columns each.

Table Activity, with total activities (in mBq):

  • Version: int
  • G4Volume: string
  • Isotope: string (Bi214, Tl208, Co60, K40 -> please do not drop mass number)
  • TotalActivity: float

Table Efficiency, with dimensionless efficiencies:

  • Version: int
  • G4Volume: string
  • Isotope: string
  • Efficiency: float

so first three columns are common. This structure is a little more flexible. If tomorrow we want to add a background isotope other than the four considered thus far, we won't need to change the structure of the tables.

@gondiaz
Copy link
Collaborator Author

gondiaz commented Oct 26, 2021

Yes, let's start with the two tables and add the thrid one that is not used directly by the mixer in the future.

Agree with that, this structure is better suited to add more background or components in future versions but slighly worse in my opinion to work with pandas, but it can be managed.

@MiryamMV
Copy link
Collaborator

Then if we request for example Activity-assumptions version v2 we will get (v0, v2, v2), or (v3, v2, v2) if we request the last version v3.

I agree with everything, but what's the purpose of keeping (v0, v2, v2) and (v3, v2, v2) at the same time?

@gondiaz gondiaz changed the title Add NEXT100ACTDB Add NEXT100 Activity/Efficiency database Oct 26, 2021
@gondiaz
Copy link
Collaborator Author

gondiaz commented Oct 27, 2021

The test processing is failing at the new test function I have added. I think that is is caused because the new table Activity in the database is not being updated in the test server. Could you take a look @jmbenlloch? The PR should be ready to review (again) by then.

@jmbenlloch
Copy link
Contributor

The test processing is failing at the new test function I have added. I think that is is caused because the new table Activity in the database is not being updated in the test server. Could you take a look @jmbenlloch? The PR should be ready to review (again) by then.

You'll have to commit the localdb.NEXT100DB.sqlite3 with the new tables you have created.

@gondiaz
Copy link
Collaborator Author

gondiaz commented Oct 27, 2021

Absolutely, my bad. It should pass the failing test now.

Copy link
Collaborator

@gonzaponte gonzaponte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round. Please run pyflakes after you address these comments to check for unused variables.

@gonzaponte
Copy link
Collaborator

Side note:
I've noticed that some of the G4Volumes contain whitespaces, is this something we can avoid, or is it too late?

@msorel
Copy link
Collaborator

msorel commented Oct 27, 2021

Side note:
I've noticed that some of the G4Volumes contain whitespaces, is this something we can avoid, or is it too late?

Unless this has changed in nexus recently, all those white spaces should be underscores. I think we should be in time to enforce underscores and avoid white spaces in NEXT-100 geometry. luckily, precisely @gondiaz is our man for that!

@gondiaz
Copy link
Collaborator Author

gondiaz commented Oct 27, 2021

G4Volume names does not contain white spaces in nexus in fact. The reason why they are in the database is because I took the data from an unupdated spread sheet, without the correct G4Volume names in the last nexus revision.
Both Activity and Efficiency table values are dummy now, I did it that way to test the procedure to upload the values in the database.

Update NEXT100DB

Update NEXT100DB

Update NEXT100DB
@gondiaz
Copy link
Collaborator Author

gondiaz commented Oct 30, 2021

Just some squashing and reword

Rewrite activity reader to be more flexible

Alignment
Rewrite database test

Fix typo in test

Rename df in test

Co-authored-by: Gonzalo <gonzaponte@gmail.com>

Add test for Efficiency table

Add test for version number request

Parametrize version in test

Avoid code repetition in test

Co-authored-by: Gonzalo <gonzaponte@gmail.com>

Improve redability
Copy link
Collaborator

@gonzaponte gonzaponte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds new tables to the database and implements the functionality to load them within IC. It also provides a couple of tests to demonstrate and check the functionality.

Great job.

@MiryamMV MiryamMV merged commit 2e8bfc1 into next-exp:master Nov 2, 2021
@gondiaz gondiaz deleted the activities_db branch November 2, 2021 17:34
MiryamMV added a commit that referenced this pull request May 5, 2022
#815

[author: gondiaz]

Updates the activity and efficiency columns in NEXT100 database with
the latest activity values from measurements and the efficiencies
computed through nexus simulations. Efficiencies assume at least 2 MeV
is deposited in the active volume.
Recall that the previous values were dummy, in order to test PR #806.

[reviewer: MiryamMV]

This PR adds the correct values of activities (v3) and efficiencies to
the database for NEXT100.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants