Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi index dataframes with datetime level cannot drop invalid rows #1469

Open
rorymcstay opened this issue Jan 24, 2024 · 3 comments
Open
Labels
bug Something isn't working

Comments

@rorymcstay
Copy link

Describe the bug
A clear and concise description of what the bug is.

  • [*] I have checked that this issue has not already been reported.
  • [*] I have confirmed this bug exists on the latest version of pandera.
  • [*] (optional) I have confirmed this bug exists on the master branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

We use raw eval on interpretting string tuples. This fails under datetime types serialised as Timestamp(...) in a string.

Traceback (most recent call last):
  File "/opt/atlassian/pipelines/agent/build/instrument-svc/instrument_svc/test/downloader_persistence/test_arctic_cache.py", line 265, in test_validate_views
    cli.validate(start_date=start_date, end_date=end_date,
  File "/opt/atlassian/pipelines/agent/build/instrument-svc/instrument_svc/downloader/cli.py", line 188, in validate
    good_data, report = view.validate(
  File "/opt/atlassian/pipelines/agent/build/instrument-svc/instrument_svc/downloader/view.py", line 303, in validate
    validated_df = schema.validate(data, lazy=True, inplace=True)
  File "/opt/atlassian/pipelines/agent/build/venv/lib/python3.8/site-packages/pandera/api/pandas/model.py", line 306, in validate
    cls.to_schema().validate(
  File "/opt/atlassian/pipelines/agent/build/venv/lib/python3.8/site-packages/pandera/api/pandas/container.py", line 375, in validate
    return self._validate(
  File "/opt/atlassian/pipelines/agent/build/venv/lib/python3.8/site-packages/pandera/api/pandas/container.py", line 404, in _validate
    return self.get_backend(check_obj).validate(
  File "/opt/atlassian/pipelines/agent/build/venv/lib/python3.8/site-packages/pandera/backends/pandas/container.py", line 112, in validate
    check_obj = self.drop_invalid_rows(check_obj, error_handler)
  File "/opt/atlassian/pipelines/agent/build/venv/lib/python3.8/site-packages/pandera/backends/pandas/base.py", line 170, in drop_invalid_rows
    index_tuples = err.failure_cases["index"].apply(eval)
  File "/opt/atlassian/pipelines/agent/build/venv/lib/python3.8/site-packages/pandas/core/series.py", line 4771, in apply
    return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
  File "/opt/atlassian/pipelines/agent/build/venv/lib/python3.8/site-packages/pandas/core/apply.py", line 1123, in apply
    return self.apply_standard()
  File "/opt/atlassian/pipelines/agent/build/venv/lib/python3.8/site-packages/pandas/core/apply.py", line 1174, in apply_standard
    mapped = lib.map_infer(
  File "pandas/_libs/lib.pyx", line 2924, in pandas._libs.lib.map_infer
  File "<string>", line 1, in <module>
NameError: name 'Timestamp' is not defined

Easy work around would be to pass {'Timestamp': pd.Timestamp} in globals to eval

Expected behavior

Invalid rows are dropped

Additional context

Add any other context about the problem here.

@rorymcstay rorymcstay added the bug Something isn't working label Jan 24, 2024
@rorymcstay
Copy link
Author

I have a fix for this, Will raise a PR today

@rorymcstay
Copy link
Author

It would also be useful to have the SchemaErrors object in the case that there are invalid rows, and we opt to drop them.

I have a workflow whereby I run a validation, dropping invalid rows, and persist the error report of the invalid observations. This way I can track data quality over time.

It would be useful to have that. I can add after or as part of the fix to this.

@cosmicBboy
Copy link
Collaborator

Sounds good! Were you thinking the errors object would be in the DataFrame.pandera accessor?

rorymcstay pushed a commit to rorymcstay/pandera that referenced this issue Aug 15, 2024
…ai-oss#1469

Signed-off-by: Rory <rory@rorymcstay.com>
Signed-off-by: Rory McStay <rory.mcstay@fulcrumasset.com>
rorymcstay pushed a commit to rorymcstay/pandera that referenced this issue Aug 15, 2024
…ai-oss#1469

Signed-off-by: Rory <rory@rorymcstay.com>
Signed-off-by: Rory McStay <rory.mcstay@fulcrumasset.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants