[SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows #36537

dingsl-giser · 2022-05-13T08:37:42Z

What changes were proposed in this pull request?

Fix problems with pyspark in Windows:

Fixed datetime conversion to timestamp before 1970;
Fixed datetime conversion when timestamp is negative;
Adding a test script.

Why are the changes needed?

Pyspark has problems serializing pre-1970 times in Windows.

An exception occurs when executing the following code under Windows:

rdd = sc.parallelize([('a', datetime(1957, 1, 9, 0, 0)),
                      ('b', datetime(2014, 1, 27, 0, 0))])
df = spark.createDataFrame(rdd, ["id", "date"])

df.show()
df.printSchema()

print(df.collect())

  File "...\spark\python\lib\pyspark.zip\pyspark\sql\types.py", line 195, in toInternal
    else time.mktime(dt.timetuple()))
OverflowError: mktime argument out of range

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	... 1 more

and

File ...\spark\python\lib\pyspark.zip\pyspark\sql\types.py, in fromInternal:
Line 207:   return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)

OSError: [Errno 22] Invalid argument

After updating the code, the above code was run successfully！

+---+-------------------+
| id|               date|
+---+-------------------+
|  a|1957-01-08 16:00:00|
|  b|2014-01-26 16:00:00|
+---+-------------------+

root
 |-- id: string (nullable = true)
 |-- date: timestamp (nullable = true)

[Row(id='a', date=datetime.datetime(1957, 1, 9, 0, 0)), Row(id='b', date=datetime.datetime(2014, 1, 27, 0, 0))]

Does this PR introduce any user-facing change?

No

How was this patch tested?

New and existing test suites

Fix problems with pyspark in Windowns: 1. Fixed datetime conversion to timestamp before 1970; 2. Fixed datetime conversion when timestamp is negative;

Add dateTime to test code in RDD

HyukjinKwon · 2022-05-13T11:52:34Z

python/pyspark/sql/types.py

+                           else time.mktime(dt.timetuple()))
+            except:
+                # On Windows, the current value is converted to a timestamp when the current value is less than 1970
+                seconds = (dt - datetime.datetime.fromtimestamp(int(time.localtime(0).tm_sec) / 1000)).total_seconds()


Is this Windows specific issue?

Yes, linux does not have this problem, and it should be a bug of python3, but this method can solve this problem.

IIRC 1970 handling issue is not OS specific problem. It would be great if you link some reported issues related to that.

AmplabJenkins · 2022-05-13T14:02:49Z

Can one of the admins verify this patch?

dingsl-giser · 2022-05-16T02:27:54Z

Is there a supervisor for approval?

HyukjinKwon · 2022-05-16T03:26:05Z

python/pyspark/sql/types.py

+            try:
+                seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
+                           else time.mktime(dt.timetuple()))
+            except:


I think we shouldn't better rely on exception handling for regular data parsing path.

Can we do this with an if-else with OS and negative value check?

Sure, I'll change the test again.

HyukjinKwon · 2022-05-16T03:27:12Z

python/pyspark/tests/test_rdd.py

        wr_s21 = rdd.sample(True, 0.4, 21).collect()
        self.assertNotEqual(set(wr_s11), set(wr_s21))

+    def test_datetime(self):


Should probably add a comment like:

SPARK-39176: ...

See also https://spark.apache.org/contributing.html

It has been added and modified, please approve it again.

HyukjinKwon · 2022-05-16T03:29:03Z

@AnywalkerGiser mind creating a PR against master branch?

dingsl-giser · 2022-05-16T05:22:52Z

@HyukjinKwon It hasn't been tested in master, I found the problem in 3.0.1, and I can test it in master later.

…970 datetime in windows

dingsl-giser added 2 commits May 13, 2022 15:10

Create types.py

18ec585

Fix problems with pyspark in Windowns: 1. Fixed datetime conversion to timestamp before 1970; 2. Fixed datetime conversion when timestamp is negative;

Update test_rdd.py

b32cd01

Add dateTime to test code in RDD

dingsl-giser closed this May 13, 2022

dingsl-giser reopened this May 13, 2022

HyukjinKwon reviewed May 13, 2022

View reviewed changes

dingsl-giser requested a review from HyukjinKwon May 13, 2022 14:49

HyukjinKwon reviewed May 16, 2022

View reviewed changes

dingsl-giser closed this May 16, 2022

dingsl-chn added 2 commits May 16, 2022 14:08

[SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1…

1f916f4

…970 datetime in windows

[SPARK-39176][PYSPARK] Pre - 1970 time serialization test

667b1f0

dingsl-giser reopened this May 16, 2022

dingsl-giser requested a review from HyukjinKwon May 16, 2022 06:23

dingsl-giser changed the title ~~[SPARK-39176][PYSPARK][WINDOWS] Fixed a problem with pyspark serializing pre-1970 datetime in windows~~ [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows May 16, 2022

dingsl-giser mentioned this pull request May 16, 2022

[SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows #36566

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows #36537

[SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows #36537

Uh oh!

dingsl-giser commented May 13, 2022

Uh oh!

HyukjinKwon May 13, 2022

Uh oh!

dingsl-giser May 13, 2022

Uh oh!

HyukjinKwon May 16, 2022

Uh oh!

AmplabJenkins commented May 13, 2022

Uh oh!

dingsl-giser commented May 16, 2022

Uh oh!

HyukjinKwon May 16, 2022

Uh oh!

HyukjinKwon May 16, 2022

Uh oh!

dingsl-giser May 16, 2022

Uh oh!

HyukjinKwon May 16, 2022

Uh oh!

dingsl-giser May 16, 2022

Uh oh!

HyukjinKwon commented May 16, 2022

Uh oh!

dingsl-giser commented May 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows #36537

[SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows #36537

Uh oh!

Conversation

dingsl-giser commented May 13, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 13, 2022

Uh oh!

dingsl-giser commented May 16, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 16, 2022

Uh oh!

dingsl-giser commented May 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants