Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Custom Analyzer in python by implementing pydeequ.analyzers._AnalyzerObject #186

Open
WiktorMadejski opened this issue Jan 6, 2024 · 0 comments
Labels
feature request Feature request

Comments

@WiktorMadejski
Copy link
Contributor

WiktorMadejski commented Jan 6, 2024

Is your feature request related to a problem? Please describe.
Enabling and showing the example of how to extend pydeequ.analyzers._AnalyzerObject to define custom Analyzer in python.

Describe the solution you'd like
Be able to implement:

class MyCustomAnalyzer(_AnalyzerObject):
    """Get the maximum of a numeric column."""

    def __init__(self, column, my_property: str = None):
        """
        :param str column: column to find the maximum.
        :param str my_property: custom property
        """
        self.column = column
        self.my_property = my_property

    @property
    def _analyzer_jvm(self, foo: AnalyzerInput) -> AnalyzerOutput:
       # my custom transformation that transforms well defined AnalyzerInput into AnalyzerOutput
        bar: AnalyzerOutput = ...
        return bar

and then run it in VerificationSuite, ex:

results = (VerificationSuite(spark)
            .onData(df)
            .useRepository(repository)
            .saveOrAppendResult(ResultKey(spark, ResultKey.current_milli_time(), {'tag': 'my-tag'}))
            .addAnomalyCheck(OnlineNormalStrategy(
                        lowerDeviationFactor=0.01,
                        upperDeviationFactor=0.01,
                        ignoreStartPercentage=0.1,
                        ignoreAnomalies=False,
            ), MyCustomAnalyzer("column_name", my_property="yeey!")) 
            .run())

Describe alternatives you've considered
When calculating Anomalies - every time I have a custom metrics (to focus attention - lets say Sum() / CountDistinct()) I build temporary table that has one row, ex:

|        value_unique_name          |
-----------------------------------
| <value of Sum() / CountDistinct() |

and than run Anomaly over pydeequ.analyzers.Sum (or Mean, ie. transformation that gives identity). Its best if those custom metrics have seperate pydeequ metrics repository to the source table.

Additional context
In anybody hacked it in a better way than described in Describe alternatives you've considered let us know in the comments!

@WiktorMadejski WiktorMadejski changed the title Enable custom Analyzer in python by extending pydeequ.analyzers._AnalyzerObject to Enable custom Analyzer in python by implementing pydeequ.analyzers._AnalyzerObject Jan 6, 2024
@WiktorMadejski WiktorMadejski changed the title Enable custom Analyzer in python by implementing pydeequ.analyzers._AnalyzerObject Enable Custom Analyzer in python by implementing pydeequ.analyzers._AnalyzerObject Jan 6, 2024
@chenliu0831 chenliu0831 added the feature request Feature request label Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Feature request
Projects
None yet
Development

No branches or pull requests

2 participants