You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is it possible to pass a custom registered check and a custom error to the SchemaError/Dictionary containing the errors, for the pandera.pyspark implementation. Currently is return the check_name as "None" and the error as "Failed Validation None".
The current implementations makes it impossible to map registered custom to checks and thus the reason the test failed. Thereby effectively making the custom checks useless as you need to verify 1 by 1 (manually) which test actually failed.
Example:
`
import pandera.pyspark as pa
import pyspark.sql.types as T
from pandera.extensions import register_check_method
from decimal import Decimal
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql import DataFrame
from pandera.pyspark import DataFrameModel
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi,
Is it possible to pass a custom registered check and a custom error to the SchemaError/Dictionary containing the errors, for the pandera.pyspark implementation. Currently is return the check_name as "None" and the error as "Failed Validation None".
The current implementations makes it impossible to map registered custom to checks and thus the reason the test failed. Thereby effectively making the custom checks useless as you need to verify 1 by 1 (manually) which test actually failed.
Example:
`
import pandera.pyspark as pa
import pyspark.sql.types as T
from pandera.extensions import register_check_method
from decimal import Decimal
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql import DataFrame
from pandera.pyspark import DataFrameModel
###Generate Dataset
data = [
(5, "Bread", Decimal(44.4), ["description of product"], {"product_category": "dairy"}),
(15, "Butter", Decimal(99.0), ["more details here"], {"product_category": "bakery"}),
]
spark_schema = T.StructType(
[
T.StructField("id", T.IntegerType(), False),
T.StructField("product", T.StringType(), False),
T.StructField("price", T.DecimalType(20, 5), False),
T.StructField("description", T.ArrayType(T.StringType(), False), False),
T.StructField(
"meta", T.MapType(T.StringType(), T.StringType(), False), False
),
],
)
df = spark.createDataFrame(data, spark_schema)
###Register Custom Check
@register_check_method(statistic = ['col'])
def new_pyspark3(pyspark_obj, *, col) -> bool:
return pyspark_obj.dataframe.select(col).count() > 4
class Schema(DataFrameModel):
"""Schema"""
product: T.StringType()
price: T.DecimalType(20,5) = pa.Field(new_pyspark3 = {"col" : 'price'})
`
###See Error in Image sdf_out = Schema.validate(df, lazy = False)
Below the error Dictionary:
`
sdf_out = Schema.validate(df, lazy = True)
sdf_out.pandera.errors
Beta Was this translation helpful? Give feedback.
All reactions