Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE] Logstream is flooded with _get_default_value called with key "table", but it is not a known field messages #10897

Open
jschra opened this issue Jan 29, 2025 · 4 comments

Comments

@jschra
Copy link

jschra commented Jan 29, 2025

Describe the bug
Whenever I call expectations over my data, my logs get flooded by _get_default_value called with key "table" but it is not a known field (see below). For reference, I am using a pandas in-memory DataFrame (via the pandas data-source) which I am passing to a checkpoint with 1 expectation suite.

Image

To Reproduce
I'll share a toy example which you can run where it happens. You need to ensure logging level is set to INFO

import logging

import great_expectations as gx
import pandas as pd

# -- Set GX constants for artifact creation
NAME_DATA_SOURCE = "pandas"
NAME_DATA_ASSET = "tutorial_data"
NAME_BATCH_DEF = "pandas_tutorial"
NAME_EXPECTATION_SUITE = "pandas_tutorial"
NAME_VALIDATION_DEF = "pandas_validation"
NAME_CHECKPOINT = "pandas"

FILE_CONFIGURE = "data/yellow_tripdata_2021-11.csv"

formatter = logging.Formatter(
    ("%(levelname)-8s [ %(asctime)s - %(module)s.%(funcName)s : %(lineno)d ] %(message)s"),
    datefmt="%Y-%m-%d %H:%M:%S",
)
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
logging.basicConfig(level=logging.INFO)
handler = root_logger.handlers[0]
handler.setFormatter(formatter)


# -- Load data for configuration
df_configure = pd.read_csv(FILE_CONFIGURE)

# -- 1. Initialize GX for configuration & set up in-memory source
context = gx.get_context(mode="file")

data_source = context.data_sources.add_pandas(name=NAME_DATA_SOURCE)
data_asset = data_source.add_dataframe_asset(name=NAME_DATA_ASSET)
batch_definition = data_asset.add_batch_definition_whole_dataframe(NAME_BATCH_DEF)

# -- 2. Configure expectation suite to be called over runtime data later
expectation_suite = gx.ExpectationSuite(name=NAME_EXPECTATION_SUITE)
expectation_suite = context.suites.add(expectation_suite)

# -- 2.1. Define table level expectations
columns = list(df_configure.columns)
expectation = gx.expectations.ExpectTableColumnsToMatchSet(column_set=columns)
expectation_suite.add_expectation(expectation)

# -- 2.2. Define column level expectations
# -- 2.2.1. Ensure vendor ID is either 1 or 2
expected_values = [1, 2]
expectation = gx.expectations.ExpectColumnValuesToBeInSet(
    column="VendorID",
    value_set=expected_values,
)
expectation_suite.add_expectation(expectation)

# -- 2.2.2. Validate that all columns have non-null values
for column in columns:
    expectation = gx.expectations.ExpectColumnValuesToNotBeNull(column=column)
    expectation_suite.add_expectation(expectation)

# -- 2.2.3. Validate that pickup and dropoff datetimes are in the correct format
date_columns = ["tpep_pickup_datetime", "tpep_dropoff_datetime"]
DATE_PATTERN = (
    r"^(?:19|20)\d{2}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01]) (?:[01]\d|2[0-3]):[0-5]\d:[0-5]\d$"
)
for date_column in date_columns:
    expectation = gx.expectations.ExpectColumnValuesToMatchRegex(
        column=date_column,
        regex=DATE_PATTERN,
    )
    expectation_suite.add_expectation(expectation)

# -- 2.2.4. Validate non-zero columns
numeric_columns = [
    "passenger_count",
    "trip_distance",
    "tip_amount",
]
for numeric_column in numeric_columns:
    expectation = gx.expectations.ExpectColumnValuesToBeBetween(
        column=numeric_column,
        min_value=0,
    )
    expectation_suite.add_expectation(expectation)

# -- 2.3. Evaluate results on test dataset
batch_parameters = {"dataframe": df_configure}
batch = batch_definition.get_batch(batch_parameters=batch_parameters)
validation_results = batch.validate(expectation_suite)

# -- 3. Bundle suite and batch into validation definition and checkpoint w/ bundled
# --    actions for easy execution later
validation_definition = gx.ValidationDefinition(
    data=batch_definition,
    suite=expectation_suite,
    name=NAME_VALIDATION_DEF,
)
_ = context.validation_definitions.add(validation_definition)

action_list = [
    gx.checkpoint.UpdateDataDocsAction(
        name="update_all_data_docs",
    ),
]
checkpoint = gx.Checkpoint(
    name=NAME_CHECKPOINT,
    validation_definitions=[validation_definition],
    actions=action_list,
    result_format={
        "result_format": "COMPLETE",
    },
)
_ = context.checkpoints.add(checkpoint)

# -- 4. Run checkpoint to validate if everything works properly
file_identifier = FILE_CONFIGURE.split("/")[-1]
runid = gx.RunIdentifier(run_name=f"Configuration run - {file_identifier}")
results = checkpoint.run(batch_parameters=batch_parameters, run_id=runid)

Apart from the code to adjust the root logger, it is the exact same code as you can find here: https://github.com/jschra/joriktech/tree/main/data_testing_gx_1. So you could use that for full reproducibility (by adding the logger snippet), since the data is also stored there.

Logs you'll get when you run it:

Image

Expected behavior
I would expect my logstream (set to INFO) to not be flooded with the same message over-and-over again with

INFO [ 2025-01-29 13:50:08 - expectation._get_default_value : 1161 ] _get_default_value called with key "table", but it is not a known field

Environment (please complete the following information):

  • Operating System: MacOS
  • Great Expectations Version: 1.3.3
  • Data Source: pandas
  • Cloud environment: Not relevant

Additional context
None needed I think

@adeola-ak
Copy link
Contributor

thanks for reporting this and including the steps for reproduction! this seems to have also been previously reported be another user. I will be sure to share this with the team and follow up with you

@adeola-ak
Copy link
Contributor

can you try to re-set the logging.basicConfig by removing all existing handlers and then call basicConfig() again? A user has reported that this resolved the issue for them

@jschra
Copy link
Author

jschra commented Feb 13, 2025

can you try to re-set the logging.basicConfig by removing all existing handlers and then call basicConfig() again? A user has reported that this resolved the issue for them

Hi Adeola,

Thanks for your response. Although that approach might fix it, it is a very brute force way to adjust the logging and might have unwanted side effects (e.g. also clearing preconfigured loggers and/or handlers that you'd like to leave untouched). Also, it is not exactly a fix as it just disables the logs, whereas I'd argue it is best to either adjust or remove this logging statement. For your reference, it is found in great_expectations/expectations/expectation.py at line 1211.

In any case, based on your suggestion I tried to figure out a more pinpointed way to disable these logs and the following does work:

expectations_logger = logging.getLogger("great_expectations.expectations.expectation")
expectations_logger.setLevel(logging.CRITICAL)

However, this effectively disables all logs from this module which again is at best a hotfix. So in my opinion it'd be better to adjust or remove this logging statement all together.

@adeola-ak
Copy link
Contributor

hi @jschra i completely agree with you. I should have also mentioned that your report has been shared with the team and that i'd like you to try that in the meantime until we remove the logging statement. I hope to have an update that this has been properly removed for you soon. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

2 participants