Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump ord-schema version #43

Merged
merged 8 commits into from
Dec 23, 2020
Merged

Bump ord-schema version #43

merged 8 commits into from
Dec 23, 2020

Conversation

skearnes
Copy link
Contributor

@skearnes skearnes commented Dec 15, 2020

Dataset ID Name Owner Status
ord_dataset-33320f511ffb4f89905448c7a5153111 Nanomole flow @skearnes x
ord_dataset-46ff9a32d9e04016b9380b1b1ef949c3 Ahneman @connorcoley x
ord_dataset-7d8f5fd922d4497d91cb81489b052746 Santanilla @michaelmaser x
ord_dataset-b440f8c90b6343189093770060fc4098 Photodehalogenation @skearnes x
ord_dataset-cbcc4048add7468e850b6ec42549c70d Merck CN cross coupling @connorcoley x
ord_dataset-d319c2a22ecf4ce59db1a18ae71d529c Suzuki informer @michaelmaser x

@skearnes
Copy link
Contributor Author

skearnes commented Dec 17, 2020

FYI added a table of the datasets needing migration so we can claim them (feel free to edit the table).

@michaelmaser
Copy link
Contributor

Could you give a quick blurb on your workflow for this, please? Want to make sure I'm catching everything.

@skearnes
Copy link
Contributor Author

Could you give a quick blurb on your workflow for this, please? Want to make sure I'm catching everything.

Sure. For the photodehalogenation dataset I looked at the template pbtxt and saw that I didn't need to change anything for the products, since the results were reported as conversion. So then I just ran @connorcoley's migration script (open-reaction-database/ord-schema#522) after updating the script to make sure the record_modified data matched me.

I think for more complicated edits I'll have to update the template manually, re-enumerate the dataset, and then programmatically copy over the existing reaction IDs, provenance info, etc.

@skearnes
Copy link
Contributor Author

PS the migration script seemed to be all that was needed for ord_dataset-33320f511ffb4f89905448c7a5153111 as well.

@skearnes
Copy link
Contributor Author

(nit: the record_modified details should indicate v0.2.3, not v0.2.2)

connorcoley and others added 3 commits December 17, 2020 21:19
- Note: The original dataset should have raw peak areas, which
  the schema now supports. However, I did not go back to the original
  data to fill in this detail. The automatic migration script was
  sufficient to convert the recorded yield values into the new schema.
- The previous version had "yield" information, but this was
  back-calculated from scaled-up reactions and isolated yields
  in a speculative way. Now, the new data has been more
  appropriately labeled as a normalized AREA
- Note that the original dataset contains information that might
  be slightly richer, but this dataset at least gives us the
  relative yields
@michaelmaser
Copy link
Contributor

Migration script seemed to work great for ord_dataset-d319c2a22ecf4ce59db1a18ae71d529c.

I'm thinking the final dataset (Santanilla HTE) will be more complicated. Since I generated the pbtxt programmatically and I've updated the notebook with schema modifications (mostly), I'm thinking of just regenerating the pbtxt with reaction_id's copied over and updated provenance. Can then run the migration script afterwards to make sure. @skearnes @connorcoley thoughts?

@connorcoley
Copy link
Contributor

I'm thinking the final dataset (Santanilla HTE) will be more complicated. Since I generated the pbtxt programmatically and I've updated the notebook with schema modifications (mostly), I'm thinking of just regenerating the pbtxt with reaction_id's copied over and updated provenance. Can then run the migration script afterwards to make sure. @skearnes @connorcoley thoughts?

I think that's a great idea. Otherwise, you'd need a very custom, dataset-specific migration script.

@michaelmaser
Copy link
Contributor

Update: Manual/programmatic editing and re-writing of the pbtxt complete and working. However, the migration script is failing on the re-generated pbtxt since the fields are already in the new schema format. Example error:

ValueError: error parsing data/7d/ord_dataset-7d8f5fd922d4497d91cb81489b052746.pbtxt: 20:9 : Message type "ord_old.Compound" has no field named "amount".

@skearnes @connorcoley thoughts on course of action? Could just replace the old pbtxt with this new one since it was generated with the new schema version anyway, and thus could be considered "migrated".

Copy-over of reaction_id's, dataset_id, and all provenance info (original record_created and new record_modified) seems to be successful on a couple of visual checks, though would want you to take a look. Made all the migration modifications in the example_Santanilla.ipynb in a migration branch on my fork of ord-schema.

@connorcoley
Copy link
Contributor

Could just replace the old pbtxt with this new one since it was generated with the new schema version anyway, and thus could be considered "migrated".

Yes, that's what I thought you had in mind. I don't think we need to rely on the migration script at all for this example. If you're preserving the reaction_ids and dataset_id, then I think we should be all set

@michaelmaser
Copy link
Contributor

Yes, that's what I thought you had in mind. I don't think we need to rely on the migration script at all for this example. If you're preserving the reaction_ids and dataset_id, then I think we should be all set

Great, thanks. Done!

@skearnes
Copy link
Contributor Author

Thanks everyone! process_dataset is failing because the main branch is using a different schema version. However, a look at the logs shows that the validations are passing for all six datasets.

Let's get everyone's approval and then merge.

@skearnes skearnes marked this pull request as ready for review December 21, 2020 17:16
@skearnes
Copy link
Contributor Author

I'll suggest that @connorcoley approve for Mike's datasets, @michaelmaser approve for mine, and I'll approve for Connor's. Just to make sure we're not losing any information.

@skearnes
Copy link
Contributor Author

@connorcoley can you comment on your note that "the original dataset contains information that might be slightly richer" in 48f0a76? AFAICT there was no loss of information in the migration?

@skearnes
Copy link
Contributor Author

@michaelmaser please add a new record_modified entry on the Santanilla dataset instead of overwriting the existing one.

@michaelmaser
Copy link
Contributor

@michaelmaser please add a new record_modified entry on the Santanilla dataset instead of overwriting the existing one.

Are you referring to the record_modified from "github-actions"?

@skearnes
Copy link
Contributor Author

@michaelmaser please add a new record_modified entry on the Santanilla dataset instead of overwriting the existing one.

Are you referring to the record_modified from "github-actions"?

Yes.

Copy link
Contributor

@michaelmaser michaelmaser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skearnes datasets look good to me.

General comment that products in photodehalogenation dataset (b4) do not have is_desired_product set, and both datasets (b4 & 33) do not have products' reaction_role=PRODUCT. See that this was true of former version(s) as well. Do we plan to add these later or leave as entered?

@skearnes
Copy link
Contributor Author

@skearnes datasets look good to me.

General comment that products in photodehalogenation dataset (b4) do not have is_desired_product set, and both datasets (b4 & 33) do not have products' reaction_role=PRODUCT. See that this was true of former version(s) as well. Do we plan to add these later or leave as entered?

I'm generally in favor of just making this PR about migration and not cleanup, but happy to add these if you think we should just take care of it now. WDYT?

@michaelmaser
Copy link
Contributor

I'm generally in favor of just making this PR about migration and not cleanup, but happy to add these if you think we should just take care of it now. WDYT?

Agreed. Was just curious

@connorcoley
Copy link
Contributor

I'm generally in favor of just making this PR about migration and not cleanup, but happy to add these if you think we should just take care of it now. WDYT?

Agreed. Let's merge

@skearnes skearnes merged commit 2050b73 into main Dec 23, 2020
@skearnes skearnes deleted the migration branch December 23, 2020 19:56
skearnes added a commit to skearnes/ord-data that referenced this pull request Feb 8, 2021
* Bump ord-schema version

* Automatic update for ord_dataset-b440f8c90b6343189093770060fc4098

* Automatic update for ord_dataset-33320f511ffb4f89905448c7a5153111

* Updating Ahneman dataset automatically
- Note: The original dataset should have raw peak areas, which
  the schema now supports. However, I did not go back to the original
  data to fill in this detail. The automatic migration script was
  sufficient to convert the recorded yield values into the new schema.

* Migrate Santanilla data subset (3 x 96 well experiment)
- The previous version had "yield" information, but this was
  back-calculated from scaled-up reactions and isolated yields
  in a speculative way. Now, the new data has been more
  appropriately labeled as a normalized AREA
- Note that the original dataset contains information that might
  be slightly richer, but this dataset at least gives us the
  relative yields

* Automatic update for ord_dataset-d319c2a22ecf4ce59db1a18ae71d529c

* Automatic update for ord_dataset-7d8f5fd922d4497d91cb81489b052746

* Re-enter github-actions record_modified in Santanilla dataset

Co-authored-by: Connor W. Coley <connor.coley@gmail.com>
Co-authored-by: Michael Maser <mmaser@caltech.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants