-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump ord-schema version #43
Conversation
FYI added a table of the datasets needing migration so we can claim them (feel free to edit the table). |
Could you give a quick blurb on your workflow for this, please? Want to make sure I'm catching everything. |
Sure. For the photodehalogenation dataset I looked at the template pbtxt and saw that I didn't need to change anything for the products, since the results were reported as I think for more complicated edits I'll have to update the template manually, re-enumerate the dataset, and then programmatically copy over the existing reaction IDs, provenance info, etc. |
PS the migration script seemed to be all that was needed for |
(nit: the |
- Note: The original dataset should have raw peak areas, which the schema now supports. However, I did not go back to the original data to fill in this detail. The automatic migration script was sufficient to convert the recorded yield values into the new schema.
- The previous version had "yield" information, but this was back-calculated from scaled-up reactions and isolated yields in a speculative way. Now, the new data has been more appropriately labeled as a normalized AREA - Note that the original dataset contains information that might be slightly richer, but this dataset at least gives us the relative yields
Migration script seemed to work great for I'm thinking the final dataset (Santanilla HTE) will be more complicated. Since I generated the pbtxt programmatically and I've updated the notebook with schema modifications (mostly), I'm thinking of just regenerating the pbtxt with reaction_id's copied over and updated provenance. Can then run the migration script afterwards to make sure. @skearnes @connorcoley thoughts? |
I think that's a great idea. Otherwise, you'd need a very custom, dataset-specific migration script. |
Update: Manual/programmatic editing and re-writing of the pbtxt complete and working. However, the migration script is failing on the re-generated pbtxt since the fields are already in the new schema format. Example error:
@skearnes @connorcoley thoughts on course of action? Could just replace the old pbtxt with this new one since it was generated with the new schema version anyway, and thus could be considered "migrated". Copy-over of |
Yes, that's what I thought you had in mind. I don't think we need to rely on the migration script at all for this example. If you're preserving the reaction_ids and dataset_id, then I think we should be all set |
Great, thanks. Done! |
Thanks everyone! Let's get everyone's approval and then merge. |
I'll suggest that @connorcoley approve for Mike's datasets, @michaelmaser approve for mine, and I'll approve for Connor's. Just to make sure we're not losing any information. |
@connorcoley can you comment on your note that "the original dataset contains information that might be slightly richer" in 48f0a76? AFAICT there was no loss of information in the migration? |
@michaelmaser please add a new |
Are you referring to the |
Yes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skearnes datasets look good to me.
General comment that products in photodehalogenation dataset (b4) do not have is_desired_product
set, and both datasets (b4 & 33) do not have products' reaction_role=PRODUCT
. See that this was true of former version(s) as well. Do we plan to add these later or leave as entered?
I'm generally in favor of just making this PR about migration and not cleanup, but happy to add these if you think we should just take care of it now. WDYT? |
Agreed. Was just curious |
Agreed. Let's merge |
* Bump ord-schema version * Automatic update for ord_dataset-b440f8c90b6343189093770060fc4098 * Automatic update for ord_dataset-33320f511ffb4f89905448c7a5153111 * Updating Ahneman dataset automatically - Note: The original dataset should have raw peak areas, which the schema now supports. However, I did not go back to the original data to fill in this detail. The automatic migration script was sufficient to convert the recorded yield values into the new schema. * Migrate Santanilla data subset (3 x 96 well experiment) - The previous version had "yield" information, but this was back-calculated from scaled-up reactions and isolated yields in a speculative way. Now, the new data has been more appropriately labeled as a normalized AREA - Note that the original dataset contains information that might be slightly richer, but this dataset at least gives us the relative yields * Automatic update for ord_dataset-d319c2a22ecf4ce59db1a18ae71d529c * Automatic update for ord_dataset-7d8f5fd922d4497d91cb81489b052746 * Re-enter github-actions record_modified in Santanilla dataset Co-authored-by: Connor W. Coley <connor.coley@gmail.com> Co-authored-by: Michael Maser <mmaser@caltech.edu>
ord_dataset-33320f511ffb4f89905448c7a5153111
ord_dataset-46ff9a32d9e04016b9380b1b1ef949c3
ord_dataset-7d8f5fd922d4497d91cb81489b052746
ord_dataset-b440f8c90b6343189093770060fc4098
ord_dataset-cbcc4048add7468e850b6ec42549c70d
ord_dataset-d319c2a22ecf4ce59db1a18ae71d529c