-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data quality control functions for dataset creators #162
Comments
Based on polaris-hub/polaris-recipes#11 and the linked curation notebook, I see that most or all of this may already exist in Auroris. |
Hey @agitter , thank you for raising this issue! It's much appreciated and an important point. The
I would love to hear your thoughts on these features. More generally, I think community feedback on such features is important, because these are complex questions that require interdisciplinary expertise. To make discussion easier, how do you feel about moving this conversation to a Github discussion in this repo? |
Based on what I'm seeing in the example notebooks, I think that the role of A centralized review system for datasets could be appealing. That implies that datasets may be dynamic. For benchmarking purposes, if a dataset receives feedback during review and is updated, it would be important to track and report dataset versions. You may even want to annotate some datasets as "deprecated" if the contributors or Polaris maintainers determine it no longer meets certain quality standards to indicate it should no longer be included in benchmarks. Moving this to a discussion sounds good. |
Is your feature request related to a problem? Please describe.
The
DatasetFactory
could have functions that support detecting the problems in polaris-hub/polaris-recipes#11 such as duplicate items or duplicates with conflicting labels. It may even canonicalize SMILES because a common problem is that SMILES can initially appear unique if they are created by different programs and then merged.Describe the solution you'd like
Consider ways to improve the dataset quality of newly contributed datasets to support thrid-party contributors.
Describe alternatives you've considered
Manually reviewing everything?
Additional context
The problems in polaris-hub/polaris-recipes#11 could be detected automatically.
The text was updated successfully, but these errors were encountered: