-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unique
keyword option to all schemas and schema components
#390
Comments
Hi is this feature still something you would like and be open to a PR for @cosmicBboy ? I've done this manually quite a bit and itd be nice to have built-in :) At the dataframe level would you allow for a list of lists? Could one assume that the object will always be a pandas object and thus have access to dataframe/series-level attributes? Any other considerations? |
hi @fkrull8! yes a PR would be very much appreciated 🙂
The list of lists would be in the case of specifying multiple sets of columns that would be considered unique? I think that's a good idea! Feel free to add support for that to your PR. We'll probably want to keep the Re: dataframe-level If you haven't already check out the contributing page, and let me know if you have any questions! |
Would love to see this feature!!! Definitely needed for a Hypothesis strategy. In the meantime i'll see if i can't figure out a composite strategy/custom check |
Hi @cosmicBboy apologies I'm just getting around to this. I have the dataframe level kwarg added and tested. However, adding the unique kwarg to the series object has proven to be more challenging. I'm not sure I understand entirely how the assignment is functioning for the SeriesBaseModel , as my implementation is failing two tests that check that copies of the object are being made. What would be the best way to proceed? Should I submit the PR with the broken test? If so and you would be able to point me in the right direction, I could then resubmit with the corrections. |
hey @fkrull8 thanks for your efforts on this!
yes please create a PR against the |
fixed by #580 |
Is your feature request related to a problem? Please describe.
Currently, the
allow_duplicates
option enforces uniqueness ifFalse
, but it's only available in theColumn
schema component. Because value uniqueness is a fundamental data quality check and also has implications for the data synthesis strategies for a schema, it would make sense to (i) deprecateallow_duplicates
inColumn
s and replace it withunique
and (ii) add theunique
option to all schemas and schema components.Describe the solution you'd like
For Columns and Indexes, this can simply be
True
orFalse
. However, at the dataframe-level, it would make sense to also acceptList[str]
representing the columns whose combined values should be unique.Describe alternatives you've considered
The alternative, as described in #386 (comment), would be to have a built-in Check for uniqueness. This would be fine, except for the fact that it would complicate the data synthesis strategy for this validation check.
The text was updated successfully, but these errors were encountered: