Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isDistinctValidation(_SeriesValidation) for a combination of columns i.e. as composite keys #2

Closed
diegoquintanav opened this issue Nov 10, 2017 · 8 comments

Comments

@diegoquintanav
Copy link
Contributor

diegoquintanav commented Nov 10, 2017

Hi there.

Consider the following schema with fake column names

schema = Schema([
    Column('year',[InRangeValidation(1900, 3000), IsDistinctValidation()]),
    Column('id',[IsDistinctValidation()])
])

this works on top of the series.duplicated method of pandas.

    def validate(self, series: pd.Series) -> pd.Series:
        return ~series.duplicated(keep='first')

Consider that there is also a method for Dataframes, is it possible to establish composite columns
so IsDistinctValidation() checks for combinations also? kind of an additional parameter **columns as a list of columns defined inside the same schema passed to isDistinctValidation().

What I do now is to insert a new temporary column as a tuple out of the elements I want to check i.e.

df.insert(loc=0, column='composite__year__id', value=list(zip(df.year, df.id)), allow_duplicates=False)

and then in the schema add the column as

Column('composite__year__id',[IsDistinctValidation()])

BTW nice job and thanks!

@diegoquintanav diegoquintanav changed the title isDistinctValida[tion(_SeriesValidation) for combination of columns i.e. as composite keys isDistinctValidation(_SeriesValidation) for combination of columns i.e. as composite keys Nov 10, 2017
@diegoquintanav diegoquintanav changed the title isDistinctValidation(_SeriesValidation) for combination of columns i.e. as composite keys isDistinctValidation(_SeriesValidation) for a combination of columns i.e. as composite keys Nov 10, 2017
@markusbaden
Copy link

I'd be interested in something like this as well. In general it would be nice to have validation across columns. Not sure what's the best way though to generalize the current schema which is centered on independent columns.

@TMiguelT have you got any suggestions?

@multimeric
Copy link
Owner

Hmm. This seems like a useful validation to have. I'll have to think about how to handle DataFrame-level validations in terms of the interface

@markusbaden
Copy link

Another one we are using is something like "if col a has value x then col b needs to have value in list c", so you would need to some sort of constraint that works on the data frame itself. Something like SeriesValidation but which accepts a DataFrame in validate.

@multimeric
Copy link
Owner

Good point. There's probably a need for a generalised DataFrame-level validation

@diegoquintanav
Copy link
Contributor Author

(off-topic)
@TMiguelT are you expecting contributions? Perhaps a gitter chat?

@multimeric
Copy link
Owner

I'm happy to have contributions for this or any other feature requests. I've commented on your other PR

@quipa
Copy link

quipa commented Jan 8, 2018

I am interested in this enhancement too. In my case I would be using it to check if a total count column is in fact equal to the total of several category count columns. Thanks!

@multimeric
Copy link
Owner

Closing in favour of the more general #57 that I just opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants