Skip to content

Commit

Permalink
[SPARK-27612][PYTHON] Use Python's default protocol instead of highes…
Browse files Browse the repository at this point in the history
…t protocol

## What changes were proposed in this pull request?

This PR partially reverts #20691

After we changed the Python protocol to highest ones, seems like it introduced a correctness bug. This potentially affects all Python related code paths.

I suspect a bug related to Pryolite (maybe opcodes `MEMOIZE`, `FRAME` and/or our `RowPickler`). I would like to stick to default protocol for now and investigate the issue separately.

I will separately investigate later to bring highest protocol back.

## How was this patch tested?

Unittest was added.

```bash
./run-tests --python-executables=python3.7 --testname "pyspark.sql.tests.test_serde SerdeTests.test_int_array_serialization"
```

Closes #24519 from HyukjinKwon/SPARK-27612.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
  • Loading branch information
HyukjinKwon committed May 3, 2019
1 parent 3859ca3 commit 5c47924
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 1 deletion.
3 changes: 2 additions & 1 deletion python/pyspark/serializers.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,11 +62,12 @@
if sys.version < '3':
import cPickle as pickle
from itertools import izip as zip, imap as map
pickle_protocol = 2
else:
import pickle
basestring = unicode = str
xrange = range
pickle_protocol = pickle.HIGHEST_PROTOCOL
pickle_protocol = 3

from pyspark import cloudpickle
from pyspark.util import _exception_message
Expand Down
6 changes: 6 additions & 0 deletions python/pyspark/sql/tests/test_serde.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,12 @@ def test_BinaryType_serialization(self):
df = self.spark.createDataFrame(data, schema=schema)
df.collect()

def test_int_array_serialization(self):
# Note that this test seems dependent on parallelism.
data = self.spark.sparkContext.parallelize([[1, 2, 3, 4]] * 100, numSlices=12)
df = self.spark.createDataFrame(data, "array<integer>")
self.assertEqual(len(list(filter(lambda r: None in r.value, df.collect()))), 0)


if __name__ == "__main__":
import unittest
Expand Down

0 comments on commit 5c47924

Please sign in to comment.