-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows #36537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fix problems with pyspark in Windowns: 1. Fixed datetime conversion to timestamp before 1970; 2. Fixed datetime conversion when timestamp is negative;
Add dateTime to test code in RDD
| else time.mktime(dt.timetuple())) | ||
| except: | ||
| # On Windows, the current value is converted to a timestamp when the current value is less than 1970 | ||
| seconds = (dt - datetime.datetime.fromtimestamp(int(time.localtime(0).tm_sec) / 1000)).total_seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this Windows specific issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, linux does not have this problem, and it should be a bug of python3, but this method can solve this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC 1970 handling issue is not OS specific problem. It would be great if you link some reported issues related to that.
|
Can one of the admins verify this patch? |
|
Is there a supervisor for approval? |
python/pyspark/sql/types.py
Outdated
| try: | ||
| seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo | ||
| else time.mktime(dt.timetuple())) | ||
| except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't better rely on exception handling for regular data parsing path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this with an if-else with OS and negative value check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll change the test again.
| wr_s21 = rdd.sample(True, 0.4, 21).collect() | ||
| self.assertNotEqual(set(wr_s11), set(wr_s21)) | ||
|
|
||
| def test_datetime(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably add a comment like:
SPARK-39176: ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has been added and modified, please approve it again.
|
@AnywalkerGiser mind creating a PR against |
|
@HyukjinKwon It hasn't been tested in master, I found the problem in 3.0.1, and I can test it in master later. |
What changes were proposed in this pull request?
Fix problems with pyspark in Windows:
Why are the changes needed?
Pyspark has problems serializing pre-1970 times in Windows.
An exception occurs when executing the following code under Windows:
and
After updating the code, the above code was run successfully!
Does this PR introduce any user-facing change?
No
How was this patch tested?
New and existing test suites