-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: change the validation logic for python_date_format #25510
fix: change the validation logic for python_date_format #25510
Conversation
Thanks for submitting this! ping @Antonio-RiveroMartnez and @jfrag1 for review, they worked on that PR you linked. |
Thanks @mapledan for the PR. Would you mind completing the PR description to include testing instructions? Additionally adding unit tests would be highly beneficial. |
Sure, I have added the testing instructions and included unit tests as well. |
superset/datasets/schemas.py
Outdated
if value in ("epoch_s", "epoch_ms"): | ||
return | ||
try: | ||
datetime.now().strftime(value or "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapledan would you mind using the dateutil.parser.isoparse method instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can use the dateutil.parser.isoparse
method to make it support a more accurate ISO 8601 format than the original regular expression validation.
However, let me confirm again, does python_date_format only support the ISO 8601 format?
Because based on the description I read, it is written as follows: "If the timestamp format does not adhere to the ISO 8601 standard, you will need to define an expression and type for transforming the string into a date or timestamp."
That's why I chose to use strftime
for validation to handle the user's defined expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed the preceding sentence: 'which needs to adhere to the ISO 8601 standard to ensure that the lexicographical ordering coincides with the chronological ordering.'
In my test case, the dateutil.parser.isoparse
strictly follows the ISO 8601 standard, so it does not allow YYYY/MM/DD format.
This will impact PR (#24113) as it allows the slash format.
And I traced the python_date_format.
It is using by pandas.to_datetime in superset.utils.core.normalize_dttm_col
.
Which should we use as the validation, thepandas.to_datetime
method, regex like (?P<date>%Y([-/]?%m([-/]?%d)?)?)([\sT](?P<time>%H(:%M(:%S(\.%f)?)?)?))?
, or the dateutil.parser.isoparse
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mapledan for flagging the other PR. Per the tooltip (see attached) it clearly states that the format should be ISO 8601, though the placeholder text uses /
. Granted the /
(in addition to -
) does adhere to lexicographical ordering, but we should strictly adhere to the ISO 8601 standard for historical reasons.
@jfrag1 this PR reverts the logic you defined in #24113. I've also authored #26017 which updates the placeholder text which likely lead people astray.
@mapledan feel free to merge the logic from #26017 into your PR if you think it makes sense to have both changes combined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we want to move towards strict ISO, we definitely need a migration to update existing python_date_format
's that currently use /
. Without a migration, any existing datasets with columns that use /
will not be able to be updated without changing the python_date_format
since they'll fail the validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jfrag1 a migration isn't possible as the string format is specified by the underly column in the dataset. Granted this change has the potential to churn users who were exposed to the relaxation in #24113, i.e., since 3.0 and thus they would need to update there dataset definition to use a SQL expression instead.
@michael-s-molina this is likely another great example of a breaking change and thus we likely need to wait until 4.0 to re-restrict the eligible formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@john-bodley #24113 didn't expose a new relaxation to users; that PR was a follow-up to #23678, which changed the endpoint used to update dataset definitions. The old endpoint did no validation at all for python_date_format
, so there were people using slashes before that too (since the placeholder indicated to do so).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'm not sure what you mean by "a migration isn't possible", can't we have a migration that updates the table that stores column definitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jfrag1 the column in question isn't in the metadata database it's in the underlying analytical database(s), i.e., Trino, Hive, etc. where the temporal format is defined and thus Superset merely reflects how the data is defined. It's difficult to estimate the impact of non-strict ISO 8601 validation, though since #24113 it only impacts newly registered datasets where the format is %Y/%m/%d[...]
. Organizations likely have many registered temporal columns where the format is complete nonsense pre-#23678.
I think there's two option:
- Relax requiring the ISO 8601 standard to include
/
, i.e., what is currently implemented (for right or wrong)*. This is potentially risky as the ISO 8601 standard guarantees lexicographical ordering** whereas custom formatters don't. - Wait until 4.0 and mention in
UPDATING.md
the (re)restriction that only ISO 8601 formats are acceptable (enforced by way of validation at the API and database level) and that dataset owners will need to use a SQL expression instead to convert their string columns of the form%Y/%m/%d
etc. to aDATE
,DATETIME
, etc. type.
* Note due to lack of validation prior to #23678 (when we switched to the new RESTful API) there are likely organizations which have datasets which have a slew of formats in table_columns.python_datetime_format
column which don't adhere to the current format and thus wont work as expected. I was able to confirm this in Airbnb's Superset metadata database. As part of 4.0 we should add a database level CHECK constraint for the table_columns.python_datetime_format
column which will force admins and dataset owners to correct any violating temporal columns.
** Lexicographical ordering is essential when we're filtering temporal string columns via the >
, >=
, <
, and <=
operators.
This is a classic "shift left" problem where the underlying validation should also reside in the database layer. This is mentioned in [SIP-99C] Proposal for model and business validation and something that has been somewhat of an achilles heal for community over the years.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #25510 +/- ##
==========================================
- Coverage 69.48% 59.59% -9.90%
==========================================
Files 1894 1894
Lines 74151 74173 +22
Branches 8243 8243
==========================================
- Hits 51527 44200 -7327
- Misses 20555 27904 +7349
Partials 2069 2069
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@mapledan and @michael-s-molina I sense this PR should be resurrected and merged as part of the 4.0 breaking window. I realize that the proposal window has closed, but this is actually a (breaking) fix and thus I gather it's exempt from those rules. @mapledan in addition to the current logic you've outlined here I sense the PR should be updated per (2) in #25510 (comment) to include:
|
…dataset_datetime_format
This missed the 4.0 breaking window and was not submitted for lazy consensus. Someone might have disagreed with the proposed change if it was submitted for lazy consensus. I think the correct process here is to add a card to Punted to 5.0 column so we don't miss this again in the next breaking window. |
Maybe we could discuss at the townhall today? I feel like a bug fix should not be subject to as rigorous a community approval as the changes that were passed via lazy consensus and think perhaps we should be able to merge breaking bug fixes during this window. |
5c4b82f
to
7f3055a
Compare
…dataset_datetime_format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woot! LGTM. Thanks @mapledan for the fix and addressing the various requests.
Co-authored-by: John Bodley <john.bodley@gmail.com>
Co-authored-by: John Bodley <john.bodley@gmail.com>
Co-authored-by: John Bodley <john.bodley@gmail.com>
Co-authored-by: John Bodley <john.bodley@gmail.com>
SUMMARY
The
PUT /api/v1/dataset/{id}
endpoint includes thepython_date_format
validation.A PR #24113 optimizes the validation, but there's still an issue with the '%Y%m%d' format.
According to the
DATETIME FORMAT
description, "If the timestamp format does not adhere to the ISO 8601 standard, you will need to define an expression and type for transforming the string into a date or timestamp.". Therefore, the validation should be more flexible.In this PR, I have modified the logic to use the
strftime
function for validating the format string.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TESTING INSTRUCTIONS
COLUMNS
tab.DATETIME FORMAT
to '%Y%m%d'.python_date_format
.ADDITIONAL INFORMATION