-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make pipelines aware of a timezone configuration #249
Conversation
e3d4e24
to
082d134
Compare
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! You could, however, add a test. What do you think?
timezone: timestamp feature transformations will assume this timezone | ||
when they don't have a tz suffix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just inform here that spark con config (spark.sql.session.timeZone) and an env variable (TZ) will be set with this value.
@@ -1,4 +1,6 @@ | |||
"""FeatureSetPipeline entity.""" | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
14.4
Why? 📖
While Spark's TimestampType timezone is controlled by the
spark.sql.session.timeZone
configuration option, python's datetime objects have their timezone controlled by the system's timezone (when they don't have a fixed tz suffix). This means some transformations can have their timestamps converted in different ways when running on different systems.An example of possible irregular results happens when we automatically set the
start_date
ofAggregatedFeatureSets
(here). Sometimes the spark and the system can have different timezones, meaning that the timestamp coming from the spark dataframe, when collected into plain python as a datetime object can change, generating astart_date
different then expected.What? 🔧
This PR proposes to apply a timezone configuration that should be aware by each pipeline and that should be the same between spark and system. This timezone is configurable.
Type of change
Please delete options that are not relevant.
How everything was tested? 📏
TODO.
Checklist
bug
,enhancement
,feature
, andreview
.Attention Points⚠️
Replace me for what the reviewer will need to pay attention to in the PR or just to cover any concerns after the merge.