Enable Custom Analyzer in python by implementing pydeequ.analyzers._AnalyzerObject #186

WiktorMadejski · 2024-01-06T14:23:04Z

Is your feature request related to a problem? Please describe.
Enabling and showing the example of how to extend pydeequ.analyzers._AnalyzerObject to define custom Analyzer in python.

Describe the solution you'd like
Be able to implement:

class MyCustomAnalyzer(_AnalyzerObject):
    """Get the maximum of a numeric column."""

    def __init__(self, column, my_property: str = None):
        """
        :param str column: column to find the maximum.
        :param str my_property: custom property
        """
        self.column = column
        self.my_property = my_property

    @property
    def _analyzer_jvm(self, foo: AnalyzerInput) -> AnalyzerOutput:
       # my custom transformation that transforms well defined AnalyzerInput into AnalyzerOutput
        bar: AnalyzerOutput = ...
        return bar

and then run it in VerificationSuite, ex:

results = (VerificationSuite(spark)
            .onData(df)
            .useRepository(repository)
            .saveOrAppendResult(ResultKey(spark, ResultKey.current_milli_time(), {'tag': 'my-tag'}))
            .addAnomalyCheck(OnlineNormalStrategy(
                        lowerDeviationFactor=0.01,
                        upperDeviationFactor=0.01,
                        ignoreStartPercentage=0.1,
                        ignoreAnomalies=False,
            ), MyCustomAnalyzer("column_name", my_property="yeey!")) 
            .run())

Describe alternatives you've considered
When calculating Anomalies - every time I have a custom metrics (to focus attention - lets say Sum() / CountDistinct()) I build temporary table that has one row, ex:

|        value_unique_name          |
-----------------------------------
| <value of Sum() / CountDistinct() |

and than run Anomaly over pydeequ.analyzers.Sum (or Mean, ie. transformation that gives identity). Its best if those custom metrics have seperate pydeequ metrics repository to the source table.

Additional context
In anybody hacked it in a better way than described in Describe alternatives you've considered let us know in the comments!

The text was updated successfully, but these errors were encountered:

WiktorMadejski changed the title ~~Enable custom Analyzer in python by extending pydeequ.analyzers._AnalyzerObject to~~ Enable custom Analyzer in python by implementing pydeequ.analyzers._AnalyzerObject Jan 6, 2024

WiktorMadejski changed the title ~~Enable custom Analyzer in python by implementing pydeequ.analyzers._AnalyzerObject~~ Enable Custom Analyzer in python by implementing pydeequ.analyzers._AnalyzerObject Jan 6, 2024

chenliu0831 added the feature request Feature request label Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Custom Analyzer in python by implementing pydeequ.analyzers._AnalyzerObject #186

Enable Custom Analyzer in python by implementing pydeequ.analyzers._AnalyzerObject #186

WiktorMadejski commented Jan 6, 2024 •

edited

Loading

Enable Custom Analyzer in python by implementing pydeequ.analyzers._AnalyzerObject #186

Enable Custom Analyzer in python by implementing pydeequ.analyzers._AnalyzerObject #186

Comments

WiktorMadejski commented Jan 6, 2024 • edited Loading

WiktorMadejski commented Jan 6, 2024 •

edited

Loading