[SPARK-27612][PYTHON] Use Python's default protocol instead of highes…

…t protocol ## What changes were proposed in this pull request? This PR partially reverts #20691 After we changed the Python protocol to highest ones, seems like it introduced a correctness bug. This potentially affects all Python related code paths. I suspect a bug related to Pryolite (maybe opcodes `MEMOIZE`, `FRAME` and/or our `RowPickler`). I would like to stick to default protocol for now and investigate the issue separately. I will separately investigate later to bring highest protocol back. ## How was this patch tested? Unittest was added. ```bash ./run-tests --python-executables=python3.7 --testname "pyspark.sql.tests.test_serde SerdeTests.test_int_array_serialization" ``` Closes #24519 from HyukjinKwon/SPARK-27612. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
apache · May 3, 2019 · 5c47924 · 5c47924
1 parent 3859ca3
commit 5c47924
Show file tree

Hide file tree

Showing 2 changed files with 8 additions and 1 deletion.
diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py
@@ -62,11 +62,12 @@
 if sys.version < '3':
     import cPickle as pickle
     from itertools import izip as zip, imap as map
+    pickle_protocol = 2
 else:
     import pickle
     basestring = unicode = str
     xrange = range
-pickle_protocol = pickle.HIGHEST_PROTOCOL
+    pickle_protocol = 3
 
 from pyspark import cloudpickle
 from pyspark.util import _exception_message

diff --git a/python/pyspark/sql/tests/test_serde.py b/python/pyspark/sql/tests/test_serde.py
@@ -126,6 +126,12 @@ def test_BinaryType_serialization(self):
         df = self.spark.createDataFrame(data, schema=schema)
         df.collect()
 
+    def test_int_array_serialization(self):
+        # Note that this test seems dependent on parallelism.
+        data = self.spark.sparkContext.parallelize([[1, 2, 3, 4]] * 100, numSlices=12)
+        df = self.spark.createDataFrame(data, "array<integer>")
+        self.assertEqual(len(list(filter(lambda r: None in r.value, df.collect()))), 0)
+
 
 if __name__ == "__main__":
     import unittest