Bump ord-schema version #43

skearnes · 2020-12-15T22:58:38Z

Dataset ID	Name	Owner	Status
`ord_dataset-33320f511ffb4f89905448c7a5153111`	Nanomole flow	@skearnes	x
`ord_dataset-46ff9a32d9e04016b9380b1b1ef949c3`	Ahneman	@connorcoley	x
`ord_dataset-7d8f5fd922d4497d91cb81489b052746`	Santanilla	@michaelmaser	x
`ord_dataset-b440f8c90b6343189093770060fc4098`	Photodehalogenation	@skearnes	x
`ord_dataset-cbcc4048add7468e850b6ec42549c70d`	Merck CN cross coupling	@connorcoley	x
`ord_dataset-d319c2a22ecf4ce59db1a18ae71d529c`	Suzuki informer	@michaelmaser	x

skearnes · 2020-12-17T17:16:02Z

FYI added a table of the datasets needing migration so we can claim them (feel free to edit the table).

michaelmaser · 2020-12-17T17:43:53Z

Could you give a quick blurb on your workflow for this, please? Want to make sure I'm catching everything.

skearnes · 2020-12-17T17:47:22Z

Could you give a quick blurb on your workflow for this, please? Want to make sure I'm catching everything.

Sure. For the photodehalogenation dataset I looked at the template pbtxt and saw that I didn't need to change anything for the products, since the results were reported as conversion. So then I just ran @connorcoley's migration script (open-reaction-database/ord-schema#522) after updating the script to make sure the record_modified data matched me.

I think for more complicated edits I'll have to update the template manually, re-enumerate the dataset, and then programmatically copy over the existing reaction IDs, provenance info, etc.

skearnes · 2020-12-17T17:57:45Z

PS the migration script seemed to be all that was needed for ord_dataset-33320f511ffb4f89905448c7a5153111 as well.

skearnes · 2020-12-17T18:03:44Z

(nit: the record_modified details should indicate v0.2.3, not v0.2.2)

- Note: The original dataset should have raw peak areas, which the schema now supports. However, I did not go back to the original data to fill in this detail. The automatic migration script was sufficient to convert the recorded yield values into the new schema.

- The previous version had "yield" information, but this was back-calculated from scaled-up reactions and isolated yields in a speculative way. Now, the new data has been more appropriately labeled as a normalized AREA - Note that the original dataset contains information that might be slightly richer, but this dataset at least gives us the relative yields

michaelmaser · 2020-12-20T16:43:02Z

Migration script seemed to work great for ord_dataset-d319c2a22ecf4ce59db1a18ae71d529c.

I'm thinking the final dataset (Santanilla HTE) will be more complicated. Since I generated the pbtxt programmatically and I've updated the notebook with schema modifications (mostly), I'm thinking of just regenerating the pbtxt with reaction_id's copied over and updated provenance. Can then run the migration script afterwards to make sure. @skearnes @connorcoley thoughts?

connorcoley · 2020-12-20T16:47:18Z

I'm thinking the final dataset (Santanilla HTE) will be more complicated. Since I generated the pbtxt programmatically and I've updated the notebook with schema modifications (mostly), I'm thinking of just regenerating the pbtxt with reaction_id's copied over and updated provenance. Can then run the migration script afterwards to make sure. @skearnes @connorcoley thoughts?

I think that's a great idea. Otherwise, you'd need a very custom, dataset-specific migration script.

michaelmaser · 2020-12-20T19:08:22Z

Update: Manual/programmatic editing and re-writing of the pbtxt complete and working. However, the migration script is failing on the re-generated pbtxt since the fields are already in the new schema format. Example error:

ValueError: error parsing data/7d/ord_dataset-7d8f5fd922d4497d91cb81489b052746.pbtxt: 20:9 : Message type "ord_old.Compound" has no field named "amount".

@skearnes @connorcoley thoughts on course of action? Could just replace the old pbtxt with this new one since it was generated with the new schema version anyway, and thus could be considered "migrated".

Copy-over of reaction_id's, dataset_id, and all provenance info (original record_created and new record_modified) seems to be successful on a couple of visual checks, though would want you to take a look. Made all the migration modifications in the example_Santanilla.ipynb in a migration branch on my fork of ord-schema.

connorcoley · 2020-12-21T01:47:42Z

Could just replace the old pbtxt with this new one since it was generated with the new schema version anyway, and thus could be considered "migrated".

Yes, that's what I thought you had in mind. I don't think we need to rely on the migration script at all for this example. If you're preserving the reaction_ids and dataset_id, then I think we should be all set

michaelmaser · 2020-12-21T14:48:45Z

Yes, that's what I thought you had in mind. I don't think we need to rely on the migration script at all for this example. If you're preserving the reaction_ids and dataset_id, then I think we should be all set

Great, thanks. Done!

skearnes · 2020-12-21T17:16:10Z

Thanks everyone! process_dataset is failing because the main branch is using a different schema version. However, a look at the logs shows that the validations are passing for all six datasets.

Let's get everyone's approval and then merge.

skearnes · 2020-12-21T17:18:34Z

I'll suggest that @connorcoley approve for Mike's datasets, @michaelmaser approve for mine, and I'll approve for Connor's. Just to make sure we're not losing any information.

skearnes · 2020-12-21T17:25:35Z

@connorcoley can you comment on your note that "the original dataset contains information that might be slightly richer" in 48f0a76? AFAICT there was no loss of information in the migration?

skearnes · 2020-12-21T17:36:45Z

@michaelmaser please add a new record_modified entry on the Santanilla dataset instead of overwriting the existing one.

michaelmaser · 2020-12-21T17:47:06Z

@michaelmaser please add a new record_modified entry on the Santanilla dataset instead of overwriting the existing one.

Are you referring to the record_modified from "github-actions"?

skearnes · 2020-12-21T17:48:09Z

@michaelmaser please add a new record_modified entry on the Santanilla dataset instead of overwriting the existing one.

Are you referring to the record_modified from "github-actions"?

Yes.

michaelmaser

@skearnes datasets look good to me.

General comment that products in photodehalogenation dataset (b4) do not have is_desired_product set, and both datasets (b4 & 33) do not have products' reaction_role=PRODUCT. See that this was true of former version(s) as well. Do we plan to add these later or leave as entered?

skearnes · 2020-12-21T18:43:40Z

@skearnes datasets look good to me.

General comment that products in photodehalogenation dataset (b4) do not have is_desired_product set, and both datasets (b4 & 33) do not have products' reaction_role=PRODUCT. See that this was true of former version(s) as well. Do we plan to add these later or leave as entered?

I'm generally in favor of just making this PR about migration and not cleanup, but happy to add these if you think we should just take care of it now. WDYT?

michaelmaser · 2020-12-21T18:45:54Z

I'm generally in favor of just making this PR about migration and not cleanup, but happy to add these if you think we should just take care of it now. WDYT?

Agreed. Was just curious

connorcoley · 2020-12-23T15:09:15Z

I'm generally in favor of just making this PR about migration and not cleanup, but happy to add these if you think we should just take care of it now. WDYT?

Agreed. Let's merge

* Bump ord-schema version * Automatic update for ord_dataset-b440f8c90b6343189093770060fc4098 * Automatic update for ord_dataset-33320f511ffb4f89905448c7a5153111 * Updating Ahneman dataset automatically - Note: The original dataset should have raw peak areas, which the schema now supports. However, I did not go back to the original data to fill in this detail. The automatic migration script was sufficient to convert the recorded yield values into the new schema. * Migrate Santanilla data subset (3 x 96 well experiment) - The previous version had "yield" information, but this was back-calculated from scaled-up reactions and isolated yields in a speculative way. Now, the new data has been more appropriately labeled as a normalized AREA - Note that the original dataset contains information that might be slightly richer, but this dataset at least gives us the relative yields * Automatic update for ord_dataset-d319c2a22ecf4ce59db1a18ae71d529c * Automatic update for ord_dataset-7d8f5fd922d4497d91cb81489b052746 * Re-enter github-actions record_modified in Santanilla dataset Co-authored-by: Connor W. Coley <connor.coley@gmail.com> Co-authored-by: Michael Maser <mmaser@caltech.edu>

Bump ord-schema version

2a35067

skearnes assigned skearnes, connorcoley and michaelmaser Dec 15, 2020

Automatic update for ord_dataset-b440f8c90b6343189093770060fc4098

ed0b769

Automatic update for ord_dataset-33320f511ffb4f89905448c7a5153111

0f88208

connorcoley and others added 3 commits December 17, 2020 21:19

Automatic update for ord_dataset-d319c2a22ecf4ce59db1a18ae71d529c

9195d7a

Automatic update for ord_dataset-7d8f5fd922d4497d91cb81489b052746

ffe3938

skearnes marked this pull request as ready for review December 21, 2020 17:16

skearnes requested review from connorcoley and michaelmaser December 21, 2020 17:16

Re-enter github-actions record_modified in Santanilla dataset

b313eba

michaelmaser approved these changes Dec 21, 2020

View reviewed changes

skearnes merged commit 2050b73 into main Dec 23, 2020

skearnes deleted the migration branch December 23, 2020 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump ord-schema version #43

Bump ord-schema version #43

skearnes commented Dec 15, 2020 •

edited by michaelmaser

Loading

skearnes commented Dec 17, 2020 •

edited

Loading

michaelmaser commented Dec 17, 2020

skearnes commented Dec 17, 2020

skearnes commented Dec 17, 2020

skearnes commented Dec 17, 2020

michaelmaser commented Dec 20, 2020

connorcoley commented Dec 20, 2020

michaelmaser commented Dec 20, 2020

connorcoley commented Dec 21, 2020

michaelmaser commented Dec 21, 2020

skearnes commented Dec 21, 2020

skearnes commented Dec 21, 2020

skearnes commented Dec 21, 2020

skearnes commented Dec 21, 2020

michaelmaser commented Dec 21, 2020

skearnes commented Dec 21, 2020

michaelmaser left a comment •

edited by skearnes

Loading

skearnes commented Dec 21, 2020

michaelmaser commented Dec 21, 2020

connorcoley commented Dec 23, 2020

Bump ord-schema version #43

Bump ord-schema version #43

Conversation

skearnes commented Dec 15, 2020 • edited by michaelmaser Loading

skearnes commented Dec 17, 2020 • edited Loading

michaelmaser commented Dec 17, 2020

skearnes commented Dec 17, 2020

skearnes commented Dec 17, 2020

skearnes commented Dec 17, 2020

michaelmaser commented Dec 20, 2020

connorcoley commented Dec 20, 2020

michaelmaser commented Dec 20, 2020

connorcoley commented Dec 21, 2020

michaelmaser commented Dec 21, 2020

skearnes commented Dec 21, 2020

skearnes commented Dec 21, 2020

skearnes commented Dec 21, 2020

skearnes commented Dec 21, 2020

michaelmaser commented Dec 21, 2020

skearnes commented Dec 21, 2020

michaelmaser left a comment • edited by skearnes Loading

Choose a reason for hiding this comment

skearnes commented Dec 21, 2020

michaelmaser commented Dec 21, 2020

connorcoley commented Dec 23, 2020

skearnes commented Dec 15, 2020 •

edited by michaelmaser

Loading

skearnes commented Dec 17, 2020 •

edited

Loading

michaelmaser left a comment •

edited by skearnes

Loading