Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data quality control functions for dataset creators #162

Open
agitter opened this issue Jul 28, 2024 · 3 comments
Open

Add data quality control functions for dataset creators #162

agitter opened this issue Jul 28, 2024 · 3 comments
Labels
feature Annotates any PR that adds new features; Used in the release process

Comments

@agitter
Copy link

agitter commented Jul 28, 2024

Is your feature request related to a problem? Please describe.

The DatasetFactory could have functions that support detecting the problems in polaris-hub/polaris-recipes#11 such as duplicate items or duplicates with conflicting labels. It may even canonicalize SMILES because a common problem is that SMILES can initially appear unique if they are created by different programs and then merged.

Describe the solution you'd like

Consider ways to improve the dataset quality of newly contributed datasets to support thrid-party contributors.

Describe alternatives you've considered

Manually reviewing everything?

Additional context

The problems in polaris-hub/polaris-recipes#11 could be detected automatically.

@agitter agitter added the feature Annotates any PR that adds new features; Used in the release process label Jul 28, 2024
@agitter
Copy link
Author

agitter commented Jul 29, 2024

Based on polaris-hub/polaris-recipes#11 and the linked curation notebook, I see that most or all of this may already exist in Auroris.

@cwognum
Copy link
Collaborator

cwognum commented Jul 31, 2024

Hey @agitter , thank you for raising this issue! It's much appreciated and an important point.

The auroris package is one way by which we're indeed trying to improve dataset quality through technically simplifying data curation. I do think there is more we can do. For example:

  • Automated (sanity) checks on dataset upload.
  • For decisions that are not easy to automate, we could have a review system where datasets can be submitted for review.
  • Such a centralized review system won't scale, so we could also have a distributed, user-driven feedback system (e.g. a commenting feature) such that users have the autonomy to raise issues as you are doing.

I would love to hear your thoughts on these features. More generally, I think community feedback on such features is important, because these are complex questions that require interdisciplinary expertise.

To make discussion easier, how do you feel about moving this conversation to a Github discussion in this repo?

@agitter
Copy link
Author

agitter commented Aug 4, 2024

Based on what I'm seeing in the example notebooks, I think that the role of auroris can be to provide automated quality checks to help catch things for manual review when a user creates a dataset or a community member reviews a dataset. It shouldn't try to automate all of the data quality control though. That is best left to domain experts because the same criteria may not apply to every dataset. For instance, I could be convinced that for some datasets it is okay to have duplicate molecules with conflicting labels because that is reflective of experimental uncertainty.

A centralized review system for datasets could be appealing. That implies that datasets may be dynamic. For benchmarking purposes, if a dataset receives feedback during review and is updated, it would be important to track and report dataset versions. You may even want to annotate some datasets as "deprecated" if the contributors or Polaris maintainers determine it no longer meets certain quality standards to indicate it should no longer be included in benchmarks.

Moving this to a discussion sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Annotates any PR that adds new features; Used in the release process
Projects
None yet
Development

No branches or pull requests

2 participants