What do you think of cuallee providing Check suggestions? #298
Replies: 1 comment
-
Hi @ScottWilliamAnderson thanks for opening a discussion on the Check suggestions.
I think creating suggestions based on these predefined rules, shouldn't be that difficult. I think this falls more into a sort of I think for small datasets, this could be a good option, for large datasets I feel is killing flies with a shotgun 🪰 🔫 . In any case, I think the implementation for a profiler should be trivial, as the majority of the predicates for types, cardinality, date algebra and numeric operations are already available. In terms of what is available, and considering you are using Control Examplefrom cuallee import Control
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.range(10)
Control.percentage_fill(df) Check Examplefrom cuallee import Check
check = Check().is_in("label", ["S", "M", "L", "XL"], pct=0.9).has_cardinality("label", 4)
check.rules
---
[Rule(method:is_contained_in, column:label, value:('S', 'M', 'L', 'XL'), data_type:CheckDataType.AGNOSTIC, coverage:0.9, status:None,
Rule(method:has_cardinality, column:label, value:4, data_type:CheckDataType.AGNOSTIC, coverage:1.0, status:None] B.T.W. thanks for the PR with the corrections!, we need to renew our Snowflake account for testing, once we get it done, we will merge it. Does this help to answer your question? or would you prefer to actually move on with a sort of suggestion-based approach? Regards, |
Beta Was this translation helpful? Give feedback.
-
Hi,
Very happy to have come across cuallee, it's very underrated! I agree with your opinions of Deequ :)
The one thing Deequ has which we find very useful is automatic constraint (Check) suggestions given a dataset.
using PyDeequ as reference, as I think it's a closer library to cuallee:
https://github.com/awslabs/python-deequ/blob/master/tutorials/suggestions.ipynb
Starting from a basic set of default checks could be an idea, though I was curious if you had spent any time thinking about these automated constraint suggestions?
To give you some more context, a use case we actually find for this is for testing datasets that are generated or delivered on a recurring basis: if provided an instance of this tabular dataset (likely pandas DF, though I very much appreciate how many other data storages are supported!), it could have a set of Checks that the following iterations of the same type of data is constrained against, without the need to either pick out the Checks, or write the custom SQL ones.
N.B. The automated SQL checks are likely very difficult and if implemented in an automated way, they're probably best converted to regular Check functions.
Perhaps I may have missed some kind of similar functionality that can achieve the same goal, happy to hear your thoughts!
Best
Scott
Beta Was this translation helpful? Give feedback.
All reactions