Add data quality control functions for dataset creators #162

agitter · 2024-07-28T14:33:39Z

Is your feature request related to a problem? Please describe.

The DatasetFactory could have functions that support detecting the problems in polaris-hub/polaris-recipes#11 such as duplicate items or duplicates with conflicting labels. It may even canonicalize SMILES because a common problem is that SMILES can initially appear unique if they are created by different programs and then merged.

Describe the solution you'd like

Consider ways to improve the dataset quality of newly contributed datasets to support thrid-party contributors.

Describe alternatives you've considered

Manually reviewing everything?

Additional context

The problems in polaris-hub/polaris-recipes#11 could be detected automatically.

The text was updated successfully, but these errors were encountered:

agitter · 2024-07-29T17:08:59Z

Based on polaris-hub/polaris-recipes#11 and the linked curation notebook, I see that most or all of this may already exist in Auroris.

cwognum · 2024-07-31T10:53:01Z

Hey @agitter , thank you for raising this issue! It's much appreciated and an important point.

The auroris package is one way by which we're indeed trying to improve dataset quality through technically simplifying data curation. I do think there is more we can do. For example:

Automated (sanity) checks on dataset upload.
For decisions that are not easy to automate, we could have a review system where datasets can be submitted for review.
Such a centralized review system won't scale, so we could also have a distributed, user-driven feedback system (e.g. a commenting feature) such that users have the autonomy to raise issues as you are doing.

I would love to hear your thoughts on these features. More generally, I think community feedback on such features is important, because these are complex questions that require interdisciplinary expertise.

To make discussion easier, how do you feel about moving this conversation to a Github discussion in this repo?

agitter · 2024-08-04T15:28:13Z

Based on what I'm seeing in the example notebooks, I think that the role of auroris can be to provide automated quality checks to help catch things for manual review when a user creates a dataset or a community member reviews a dataset. It shouldn't try to automate all of the data quality control though. That is best left to domain experts because the same criteria may not apply to every dataset. For instance, I could be convinced that for some datasets it is okay to have duplicate molecules with conflicting labels because that is reflective of experimental uncertainty.

A centralized review system for datasets could be appealing. That implies that datasets may be dynamic. For benchmarking purposes, if a dataset receives feedback during review and is updated, it would be important to track and report dataset versions. You may even want to annotate some datasets as "deprecated" if the contributors or Polaris maintainers determine it no longer meets certain quality standards to indicate it should no longer be included in benchmarks.

Moving this to a discussion sounds good.

agitter added the feature Annotates any PR that adds new features; Used in the release process label Jul 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data quality control functions for dataset creators #162

Add data quality control functions for dataset creators #162

agitter commented Jul 28, 2024

agitter commented Jul 29, 2024

cwognum commented Jul 31, 2024 •

edited

Loading

agitter commented Aug 4, 2024

Add data quality control functions for dataset creators #162

Add data quality control functions for dataset creators #162

Comments

agitter commented Jul 28, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

agitter commented Jul 29, 2024

cwognum commented Jul 31, 2024 • edited Loading

agitter commented Aug 4, 2024

cwognum commented Jul 31, 2024 •

edited

Loading