Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: change the validation logic for python_date_format #25510

Merged
merged 21 commits into from
Jan 23, 2024

Conversation

mapledan
Copy link
Contributor

@mapledan mapledan commented Oct 4, 2023

SUMMARY

The PUT /api/v1/dataset/{id} endpoint includes the python_date_format validation.
A PR #24113 optimizes the validation, but there's still an issue with the '%Y%m%d' format.

According to the DATETIME FORMAT description, "If the timestamp format does not adhere to the ISO 8601 standard, you will need to define an expression and type for transforming the string into a date or timestamp.". Therefore, the validation should be more flexible.

In this PR, I have modified the logic to use the strftime function for validating the format string.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

  1. Edit the dataest.
  2. Go to the COLUMNS tab.
  3. Choose the date string column.
  4. Edit the DATETIME FORMAT to '%Y%m%d'.
  5. Save the dataset.
  6. The column can serve as the temporal filter, and the SQL expression can format the datetime to a date string using the python_date_format.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@sfirke
Copy link
Member

sfirke commented Oct 4, 2023

Thanks for submitting this! ping @Antonio-RiveroMartnez and @jfrag1 for review, they worked on that PR you linked.

@john-bodley
Copy link
Member

Thanks @mapledan for the PR. Would you mind completing the PR description to include testing instructions? Additionally adding unit tests would be highly beneficial.

cc @betodealmeida @jfrag1

@pull-request-size pull-request-size bot added size/M and removed size/S labels Oct 5, 2023
@mapledan
Copy link
Contributor Author

mapledan commented Oct 5, 2023

Sure, I have added the testing instructions and included unit tests as well.

if value in ("epoch_s", "epoch_ms"):
return
try:
datetime.now().strftime(value or "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mapledan would you mind using the dateutil.parser.isoparse method instead?

Copy link
Contributor Author

@mapledan mapledan Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can use the dateutil.parser.isoparse method to make it support a more accurate ISO 8601 format than the original regular expression validation.
However, let me confirm again, does python_date_format only support the ISO 8601 format?
Because based on the description I read, it is written as follows: "If the timestamp format does not adhere to the ISO 8601 standard, you will need to define an expression and type for transforming the string into a date or timestamp."
That's why I chose to use strftime for validation to handle the user's defined expression.

Copy link
Member

@john-bodley john-bodley Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mapledan it should only support ISO 8601—in addition to epoch_s and epoch_ms. The reason being is ISO 8601 guarantees lexigraphical ordering when comparing strings.

This is mentioned throughout the code.

Copy link
Contributor Author

@mapledan mapledan Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed the preceding sentence: 'which needs to adhere to the ISO 8601 standard to ensure that the lexicographical ordering coincides with the chronological ordering.'

In my test case, the dateutil.parser.isoparse strictly follows the ISO 8601 standard, so it does not allow YYYY/MM/DD format.
This will impact PR (#24113) as it allows the slash format.
And I traced the python_date_format.
It is using by pandas.to_datetime in superset.utils.core.normalize_dttm_col.

Which should we use as the validation, thepandas.to_datetime method, regex like (?P<date>%Y([-/]?%m([-/]?%d)?)?)([\sT](?P<time>%H(:%M(:%S(\.%f)?)?)?))?, or the dateutil.parser.isoparse method?

Copy link
Member

@john-bodley john-bodley Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mapledan for flagging the other PR. Per the tooltip (see attached) it clearly states that the format should be ISO 8601, though the placeholder text uses /. Granted the / (in addition to -) does adhere to lexicographical ordering, but we should strictly adhere to the ISO 8601 standard for historical reasons.

Screenshot 2023-11-17 at 8 48 07 AM

@jfrag1 this PR reverts the logic you defined in #24113. I've also authored #26017 which updates the placeholder text which likely lead people astray.

@mapledan feel free to merge the logic from #26017 into your PR if you think it makes sense to have both changes combined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we want to move towards strict ISO, we definitely need a migration to update existing python_date_format's that currently use /. Without a migration, any existing datasets with columns that use / will not be able to be updated without changing the python_date_format since they'll fail the validation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfrag1 a migration isn't possible as the string format is specified by the underly column in the dataset. Granted this change has the potential to churn users who were exposed to the relaxation in #24113, i.e., since 3.0 and thus they would need to update there dataset definition to use a SQL expression instead.

@michael-s-molina this is likely another great example of a breaking change and thus we likely need to wait until 4.0 to re-restrict the eligible formats.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@john-bodley #24113 didn't expose a new relaxation to users; that PR was a follow-up to #23678, which changed the endpoint used to update dataset definitions. The old endpoint did no validation at all for python_date_format, so there were people using slashes before that too (since the placeholder indicated to do so).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'm not sure what you mean by "a migration isn't possible", can't we have a migration that updates the table that stores column definitions?

Copy link
Member

@john-bodley john-bodley Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfrag1 the column in question isn't in the metadata database it's in the underlying analytical database(s), i.e., Trino, Hive, etc. where the temporal format is defined and thus Superset merely reflects how the data is defined. It's difficult to estimate the impact of non-strict ISO 8601 validation, though since #24113 it only impacts newly registered datasets where the format is %Y/%m/%d[...]. Organizations likely have many registered temporal columns where the format is complete nonsense pre-#23678.

I think there's two option:

  1. Relax requiring the ISO 8601 standard to include /, i.e., what is currently implemented (for right or wrong)*. This is potentially risky as the ISO 8601 standard guarantees lexicographical ordering** whereas custom formatters don't.
  2. Wait until 4.0 and mention in UPDATING.md the (re)restriction that only ISO 8601 formats are acceptable (enforced by way of validation at the API and database level) and that dataset owners will need to use a SQL expression instead to convert their string columns of the form %Y/%m/%d etc. to a DATE, DATETIME, etc. type.

* Note due to lack of validation prior to #23678 (when we switched to the new RESTful API) there are likely organizations which have datasets which have a slew of formats in table_columns.python_datetime_format column which don't adhere to the current format and thus wont work as expected. I was able to confirm this in Airbnb's Superset metadata database. As part of 4.0 we should add a database level CHECK constraint for the table_columns.python_datetime_format column which will force admins and dataset owners to correct any violating temporal columns.

** Lexicographical ordering is essential when we're filtering temporal string columns via the >, >=, <, and <= operators.

This is a classic "shift left" problem where the underlying validation should also reside in the database layer. This is mentioned in [SIP-99C] Proposal for model and business validation and something that has been somewhat of an achilles heal for community over the years.

tests/unit_tests/datasets/schema_tests.py Outdated Show resolved Hide resolved
superset/datasets/schemas.py Show resolved Hide resolved
superset/datasets/schemas.py Outdated Show resolved Hide resolved
superset/datasets/schemas.py Outdated Show resolved Hide resolved
@john-bodley john-bodley added the hold! On hold label Nov 17, 2023
Copy link

codecov bot commented Nov 18, 2023

Codecov Report

Attention: 12 lines in your changes are missing coverage. Please review.

Comparison is base (04445a3) 69.48% compared to head (daf90cb) 59.59%.

Files Patch % Lines
superset/daos/dataset.py 29.41% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #25510      +/-   ##
==========================================
- Coverage   69.48%   59.59%   -9.90%     
==========================================
  Files        1894     1894              
  Lines       74151    74173      +22     
  Branches     8243     8243              
==========================================
- Hits        51527    44200    -7327     
- Misses      20555    27904    +7349     
  Partials     2069     2069              
Flag Coverage Δ
hive 53.78% <31.03%> (-0.02%) ⬇️
javascript 56.86% <ø> (ø)
mysql ?
postgres ?
presto 53.73% <31.03%> (-0.02%) ⬇️
python 62.50% <58.62%> (-20.52%) ⬇️
sqlite ?
unit 56.40% <58.62%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@john-bodley
Copy link
Member

john-bodley commented Jan 18, 2024

@mapledan and @michael-s-molina I sense this PR should be resurrected and merged as part of the 4.0 breaking window. I realize that the proposal window has closed, but this is actually a (breaking) fix and thus I gather it's exempt from those rules.

@mapledan in addition to the current logic you've outlined here I sense the PR should be updated per (2) in #25510 (comment) to include:

  • Updating UPDATING.md. [required]
  • Adding a CHECK constraint (or similar) to the table_columns.python_datetime_format column to ensure that the specified format is either ISO 8601 compliant, epoch_ms, or epoch_s. [optional]

@john-bodley john-bodley added the v4.0 Label added by the release manager to track PRs to be included in the 4.0 branch label Jan 18, 2024
@michael-s-molina
Copy link
Member

I sense this PR should be resurrected and merged as part of the 4.0 breaking window. I realize that the proposal window has closed, but this is actually a (breaking) fix and thus I gather it's exempt from those rules.

This missed the 4.0 breaking window and was not submitted for lazy consensus. Someone might have disagreed with the proposed change if it was submitted for lazy consensus. I think the correct process here is to add a card to Punted to 5.0 column so we don't miss this again in the next breaking window.

@michael-s-molina michael-s-molina removed the v4.0 Label added by the release manager to track PRs to be included in the 4.0 branch label Jan 19, 2024
@sfirke
Copy link
Member

sfirke commented Jan 19, 2024

Maybe we could discuss at the townhall today? I feel like a bug fix should not be subject to as rigorous a community approval as the changes that were passed via lazy consensus and think perhaps we should be able to merge breaking bug fixes during this window.

Copy link
Member

@john-bodley john-bodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woot! LGTM. Thanks @mapledan for the fix and addressing the various requests.

@john-bodley john-bodley merged commit c2ab9bb into apache:master Jan 23, 2024
33 checks passed
@john-bodley john-bodley removed the hold! On hold label Jan 23, 2024
@mapledan mapledan deleted the fix/dataset_datetime_format branch January 24, 2024 02:17
rusackas pushed a commit that referenced this pull request Jan 24, 2024
Co-authored-by: John Bodley <john.bodley@gmail.com>
eschutho pushed a commit to preset-io/superset that referenced this pull request Jan 31, 2024
Co-authored-by: John Bodley <john.bodley@gmail.com>
sfirke pushed a commit to sfirke/superset that referenced this pull request Mar 22, 2024
Co-authored-by: John Bodley <john.bodley@gmail.com>
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 4.0.0 labels Apr 17, 2024
vinothkumar66 pushed a commit to vinothkumar66/superset that referenced this pull request Nov 11, 2024
Co-authored-by: John Bodley <john.bodley@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L 🚢 4.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants