Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29536][PYTHON] Upgrade cloudpickle to 1.1.1 to support Python 3.8 #26194

Closed
wants to merge 8 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Oct 21, 2019

What changes were proposed in this pull request?

Inline cloudpickle in PySpark to cloudpickle 1.1.1. See https://github.com/cloudpipe/cloudpickle/blob/v1.1.1/cloudpickle/cloudpickle.py

cloudpipe/cloudpickle#269 was added for Python 3.8 support (fixed from 1.1.0). Using 1.2.2 seems breaking PyPy 2 due to cloudpipe/cloudpickle#278 so this PR currently uses 1.1.1.

Once we drop Python 2, we can switch to the highest version.

Why are the changes needed?

positional-only arguments was newly introduced from Python 3.8 (see https://docs.python.org/3/whatsnew/3.8.html#positional-only-parameters)

Particularly the newly added argument to types.CodeType was the problem (https://docs.python.org/3/whatsnew/3.8.html#changes-in-the-python-api):

types.CodeType has a new parameter in the second position of the constructor (posonlyargcount) to support positional-only arguments defined in PEP 570. The first argument (argcount) now represents the total number of positional arguments (including positional-only arguments). The new replace() method of types.CodeType can be used to make the code future-proof.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually tested. Note that the optional dependency PyArrow looks not yet supporting Python 3.8; therefore, it was not tested. See "Details" below.

cd python
./run-tests --python-executables=python3.8
Running PySpark tests. Output is in /Users/hyukjin.kwon/workspace/forked/spark/python/unit-tests.log
Will test against the following Python executables: ['python3.8']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Starting test(python3.8): pyspark.ml.tests.test_algorithms
Starting test(python3.8): pyspark.ml.tests.test_feature
Starting test(python3.8): pyspark.ml.tests.test_base
Starting test(python3.8): pyspark.ml.tests.test_evaluation
Finished test(python3.8): pyspark.ml.tests.test_base (12s)
Starting test(python3.8): pyspark.ml.tests.test_image
Finished test(python3.8): pyspark.ml.tests.test_evaluation (14s)
Starting test(python3.8): pyspark.ml.tests.test_linalg
Finished test(python3.8): pyspark.ml.tests.test_feature (23s)
Starting test(python3.8): pyspark.ml.tests.test_param
Finished test(python3.8): pyspark.ml.tests.test_image (22s)
Starting test(python3.8): pyspark.ml.tests.test_persistence
Finished test(python3.8): pyspark.ml.tests.test_param (25s)
Starting test(python3.8): pyspark.ml.tests.test_pipeline
Finished test(python3.8): pyspark.ml.tests.test_linalg (37s)
Starting test(python3.8): pyspark.ml.tests.test_stat
Finished test(python3.8): pyspark.ml.tests.test_pipeline (7s)
Starting test(python3.8): pyspark.ml.tests.test_training_summary
Finished test(python3.8): pyspark.ml.tests.test_stat (21s)
Starting test(python3.8): pyspark.ml.tests.test_tuning
Finished test(python3.8): pyspark.ml.tests.test_persistence (45s)
Starting test(python3.8): pyspark.ml.tests.test_wrapper
Finished test(python3.8): pyspark.ml.tests.test_algorithms (83s)
Starting test(python3.8): pyspark.mllib.tests.test_algorithms
Finished test(python3.8): pyspark.ml.tests.test_training_summary (32s)
Starting test(python3.8): pyspark.mllib.tests.test_feature
Finished test(python3.8): pyspark.ml.tests.test_wrapper (20s)
Starting test(python3.8): pyspark.mllib.tests.test_linalg
Finished test(python3.8): pyspark.mllib.tests.test_feature (32s)
Starting test(python3.8): pyspark.mllib.tests.test_stat
Finished test(python3.8): pyspark.mllib.tests.test_algorithms (70s)
Starting test(python3.8): pyspark.mllib.tests.test_streaming_algorithms
Finished test(python3.8): pyspark.mllib.tests.test_stat (37s)
Starting test(python3.8): pyspark.mllib.tests.test_util
Finished test(python3.8): pyspark.mllib.tests.test_linalg (70s)
Starting test(python3.8): pyspark.sql.tests.test_arrow
Finished test(python3.8): pyspark.sql.tests.test_arrow (1s) ... 53 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_catalog
Finished test(python3.8): pyspark.mllib.tests.test_util (15s)
Starting test(python3.8): pyspark.sql.tests.test_column
Finished test(python3.8): pyspark.sql.tests.test_catalog (24s)
Starting test(python3.8): pyspark.sql.tests.test_conf
Finished test(python3.8): pyspark.sql.tests.test_column (21s)
Starting test(python3.8): pyspark.sql.tests.test_context
Finished test(python3.8): pyspark.ml.tests.test_tuning (125s)
Starting test(python3.8): pyspark.sql.tests.test_dataframe
Finished test(python3.8): pyspark.sql.tests.test_conf (9s)
Starting test(python3.8): pyspark.sql.tests.test_datasources
Finished test(python3.8): pyspark.sql.tests.test_context (29s)
Starting test(python3.8): pyspark.sql.tests.test_functions
Finished test(python3.8): pyspark.sql.tests.test_datasources (32s)
Starting test(python3.8): pyspark.sql.tests.test_group
Finished test(python3.8): pyspark.sql.tests.test_dataframe (39s) ... 3 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf (1s) ... 6 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_cogrouped_map
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_cogrouped_map (0s) ... 14 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_agg
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_agg (1s) ... 15 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_map
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_map (1s) ... 20 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_scalar
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_scalar (1s) ... 49 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_window
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_window (1s) ... 14 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_readwriter
Finished test(python3.8): pyspark.sql.tests.test_functions (29s)
Starting test(python3.8): pyspark.sql.tests.test_serde
Finished test(python3.8): pyspark.sql.tests.test_group (20s)
Starting test(python3.8): pyspark.sql.tests.test_session
Finished test(python3.8): pyspark.mllib.tests.test_streaming_algorithms (126s)
Starting test(python3.8): pyspark.sql.tests.test_streaming
Finished test(python3.8): pyspark.sql.tests.test_serde (25s)
Starting test(python3.8): pyspark.sql.tests.test_types
Finished test(python3.8): pyspark.sql.tests.test_readwriter (38s)
Starting test(python3.8): pyspark.sql.tests.test_udf
Finished test(python3.8): pyspark.sql.tests.test_session (32s)
Starting test(python3.8): pyspark.sql.tests.test_utils
Finished test(python3.8): pyspark.sql.tests.test_utils (17s)
Starting test(python3.8): pyspark.streaming.tests.test_context
Finished test(python3.8): pyspark.sql.tests.test_types (45s)
Starting test(python3.8): pyspark.streaming.tests.test_dstream
Finished test(python3.8): pyspark.sql.tests.test_udf (44s)
Starting test(python3.8): pyspark.streaming.tests.test_kinesis
Finished test(python3.8): pyspark.streaming.tests.test_kinesis (0s) ... 2 tests were skipped
Starting test(python3.8): pyspark.streaming.tests.test_listener
Finished test(python3.8): pyspark.streaming.tests.test_context (28s)
Starting test(python3.8): pyspark.tests.test_appsubmit
Finished test(python3.8): pyspark.sql.tests.test_streaming (60s)
Starting test(python3.8): pyspark.tests.test_broadcast
Finished test(python3.8): pyspark.streaming.tests.test_listener (11s)
Starting test(python3.8): pyspark.tests.test_conf
Finished test(python3.8): pyspark.tests.test_conf (17s)
Starting test(python3.8): pyspark.tests.test_context
Finished test(python3.8): pyspark.tests.test_broadcast (39s)
Starting test(python3.8): pyspark.tests.test_daemon
Finished test(python3.8): pyspark.tests.test_daemon (5s)
Starting test(python3.8): pyspark.tests.test_join
Finished test(python3.8): pyspark.tests.test_context (31s)
Starting test(python3.8): pyspark.tests.test_profiler
Finished test(python3.8): pyspark.tests.test_join (9s)
Starting test(python3.8): pyspark.tests.test_rdd
Finished test(python3.8): pyspark.tests.test_profiler (12s)
Starting test(python3.8): pyspark.tests.test_readwrite
Finished test(python3.8): pyspark.tests.test_readwrite (23s) ... 3 tests were skipped
Starting test(python3.8): pyspark.tests.test_serializers
Finished test(python3.8): pyspark.tests.test_appsubmit (94s)
Starting test(python3.8): pyspark.tests.test_shuffle
Finished test(python3.8): pyspark.streaming.tests.test_dstream (110s)
Starting test(python3.8): pyspark.tests.test_taskcontext
Finished test(python3.8): pyspark.tests.test_rdd (42s)
Starting test(python3.8): pyspark.tests.test_util
Finished test(python3.8): pyspark.tests.test_serializers (11s)
Starting test(python3.8): pyspark.tests.test_worker
Finished test(python3.8): pyspark.tests.test_shuffle (12s)
Starting test(python3.8): pyspark.accumulators
Finished test(python3.8): pyspark.tests.test_util (7s)
Starting test(python3.8): pyspark.broadcast
Finished test(python3.8): pyspark.accumulators (8s)
Starting test(python3.8): pyspark.conf
Finished test(python3.8): pyspark.broadcast (8s)
Starting test(python3.8): pyspark.context
Finished test(python3.8): pyspark.tests.test_worker (19s)
Starting test(python3.8): pyspark.ml.classification
Finished test(python3.8): pyspark.conf (4s)
Starting test(python3.8): pyspark.ml.clustering
Finished test(python3.8): pyspark.context (22s)
Starting test(python3.8): pyspark.ml.evaluation
Finished test(python3.8): pyspark.tests.test_taskcontext (49s)
Starting test(python3.8): pyspark.ml.feature
Finished test(python3.8): pyspark.ml.clustering (43s)
Starting test(python3.8): pyspark.ml.fpm
Finished test(python3.8): pyspark.ml.evaluation (27s)
Starting test(python3.8): pyspark.ml.image
Finished test(python3.8): pyspark.ml.image (8s)
Starting test(python3.8): pyspark.ml.linalg.__init__
Finished test(python3.8): pyspark.ml.linalg.__init__ (0s)
Starting test(python3.8): pyspark.ml.recommendation
Finished test(python3.8): pyspark.ml.classification (63s)
Starting test(python3.8): pyspark.ml.regression
Finished test(python3.8): pyspark.ml.fpm (23s)
Starting test(python3.8): pyspark.ml.stat
Finished test(python3.8): pyspark.ml.stat (30s)
Starting test(python3.8): pyspark.ml.tuning
Finished test(python3.8): pyspark.ml.regression (51s)
Starting test(python3.8): pyspark.mllib.classification
Finished test(python3.8): pyspark.ml.feature (93s)
Starting test(python3.8): pyspark.mllib.clustering
Finished test(python3.8): pyspark.ml.tuning (39s)
Starting test(python3.8): pyspark.mllib.evaluation
Finished test(python3.8): pyspark.mllib.classification (38s)
Starting test(python3.8): pyspark.mllib.feature
Finished test(python3.8): pyspark.mllib.evaluation (25s)
Starting test(python3.8): pyspark.mllib.fpm
Finished test(python3.8): pyspark.mllib.clustering (64s)
Starting test(python3.8): pyspark.mllib.linalg.__init__
Finished test(python3.8): pyspark.ml.recommendation (131s)
Starting test(python3.8): pyspark.mllib.linalg.distributed
Finished test(python3.8): pyspark.mllib.linalg.__init__ (0s)
Starting test(python3.8): pyspark.mllib.random
Finished test(python3.8): pyspark.mllib.feature (36s)
Starting test(python3.8): pyspark.mllib.recommendation
Finished test(python3.8): pyspark.mllib.fpm (31s)
Starting test(python3.8): pyspark.mllib.regression
Finished test(python3.8): pyspark.mllib.random (16s)
Starting test(python3.8): pyspark.mllib.stat.KernelDensity
Finished test(python3.8): pyspark.mllib.stat.KernelDensity (1s)
Starting test(python3.8): pyspark.mllib.stat._statistics
Finished test(python3.8): pyspark.mllib.stat._statistics (25s)
Starting test(python3.8): pyspark.mllib.tree
Finished test(python3.8): pyspark.mllib.regression (44s)
Starting test(python3.8): pyspark.mllib.util
Finished test(python3.8): pyspark.mllib.recommendation (49s)
Starting test(python3.8): pyspark.profiler
Finished test(python3.8): pyspark.mllib.linalg.distributed (53s)
Starting test(python3.8): pyspark.rdd
Finished test(python3.8): pyspark.profiler (14s)
Starting test(python3.8): pyspark.serializers
Finished test(python3.8): pyspark.mllib.tree (30s)
Starting test(python3.8): pyspark.shuffle
Finished test(python3.8): pyspark.shuffle (2s)
Starting test(python3.8): pyspark.sql.avro.functions
Finished test(python3.8): pyspark.mllib.util (30s)
Starting test(python3.8): pyspark.sql.catalog
Finished test(python3.8): pyspark.serializers (17s)
Starting test(python3.8): pyspark.sql.column
Finished test(python3.8): pyspark.rdd (31s)
Starting test(python3.8): pyspark.sql.conf
Finished test(python3.8): pyspark.sql.conf (7s)
Starting test(python3.8): pyspark.sql.context
Finished test(python3.8): pyspark.sql.avro.functions (19s)
Starting test(python3.8): pyspark.sql.dataframe
Finished test(python3.8): pyspark.sql.catalog (16s)
Starting test(python3.8): pyspark.sql.functions
Finished test(python3.8): pyspark.sql.column (27s)
Starting test(python3.8): pyspark.sql.group
Finished test(python3.8): pyspark.sql.context (26s)
Starting test(python3.8): pyspark.sql.readwriter
Finished test(python3.8): pyspark.sql.group (52s)
Starting test(python3.8): pyspark.sql.session
Finished test(python3.8): pyspark.sql.dataframe (73s)
Starting test(python3.8): pyspark.sql.streaming
Finished test(python3.8): pyspark.sql.functions (75s)
Starting test(python3.8): pyspark.sql.types
Finished test(python3.8): pyspark.sql.readwriter (57s)
Starting test(python3.8): pyspark.sql.udf
Finished test(python3.8): pyspark.sql.types (13s)
Starting test(python3.8): pyspark.sql.window
Finished test(python3.8): pyspark.sql.session (32s)
Starting test(python3.8): pyspark.streaming.util
Finished test(python3.8): pyspark.streaming.util (1s)
Starting test(python3.8): pyspark.util
Finished test(python3.8): pyspark.util (0s)
Finished test(python3.8): pyspark.sql.streaming (30s)
Finished test(python3.8): pyspark.sql.udf (27s)
Finished test(python3.8): pyspark.sql.window (22s)
Tests passed in 855 seconds

@HyukjinKwon
Copy link
Member Author

cc @BryanCutler, @viirya @ueshin

@SparkQA
Copy link

SparkQA commented Oct 21, 2019

Test build #112392 has finished for PR 26194 at commit e8a2acb.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • skeleton_class = type_constructor(name, bases, type_kwargs)
  • enum_class = metacls.__new__(metacls, name, bases, classdict)

@HyukjinKwon HyukjinKwon changed the title [SPARK-29536][PYTHON] Upgrade cloudpickle to 1.2.2 to support Python 3.8 [WIP][SPARK-29536][PYTHON] Upgrade cloudpickle to 1.2.2 to support Python 3.8 Oct 21, 2019
@HyukjinKwon
Copy link
Member Author

Hm .. let me investigate test failures further.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems OK pending tests

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good reason to upgrade. Would now be a good time to stop including the file and use the official package or as a zip file in pyspark/lib?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fine. I've tried this locally with 3.8 beta before.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 22, 2019

@BryanCutler, yea, I think we should. I will take a look separately if you don't mind. It will need a fix here and there. One side concern is that it's pretty difficult to use it as the official package. We can try it with a zip I suspect ..

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 22, 2019

Test build #112421 has finished for PR 26194 at commit e8a2acb.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • skeleton_class = type_constructor(name, bases, type_kwargs)
  • enum_class = metacls.__new__(metacls, name, bases, classdict)

@SparkQA
Copy link

SparkQA commented Oct 22, 2019

Test build #112424 has finished for PR 26194 at commit 876fbff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2019

Test build #112429 has finished for PR 26194 at commit a897b85.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2019

Test build #112432 has finished for PR 26194 at commit fc7ffd3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2019

Test build #112438 has finished for PR 26194 at commit cbdf92c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 22, 2019

The test failure is by cloudpipe/cloudpickle#278 . Let me stick to 1.1.1 for now since we didn't drop Python 2 support yet.

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-29536][PYTHON] Upgrade cloudpickle to 1.2.2 to support Python 3.8 [SPARK-29536][PYTHON] Upgrade cloudpickle to 1.1.1 to support Python 3.8 Oct 22, 2019
@SparkQA
Copy link

SparkQA commented Oct 22, 2019

Test build #112441 has finished for PR 26194 at commit 847079d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2019

Test build #112443 has finished for PR 26194 at commit 9b2f117.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Last commit is just adding metadata which does not affect pip check. I double checked the Python linter pass.

Merged to master.

Thanks, @srowen, @BryanCutler and @viirya

@jackhhh
Copy link

jackhhh commented Jul 5, 2020

Spark 3.0 will support. You can try 3.0 preview.

So, does it mean that I have to upgrade Spark and PySpark to 3.0 to use Python3.8? I tried to run with Spark 2.3.2 and PySpark 3.0.0 but it failed.

@HyukjinKwon
Copy link
Member Author

Yup.

@dongjoon-hyun
Copy link
Member

I updated SPARK-29536 by adding 2.4.7 into Affected Versions.

@HyukjinKwon
Copy link
Member Author

Thanks @dongjoon-hyun.

@gliptak
Copy link
Contributor

gliptak commented Nov 28, 2020

capitalone/datacompy#88

Ennosigaeon pushed a commit to Ennosigaeon/spark-jobserver that referenced this pull request Jan 20, 2021
Python 2 has reached EOL last year and should not be used anymore. This commit replaces all references to the "python" binary with the more explicit "python3" binary. If desired, the build can still be performed for Python 2 by settings the "PYTHON_EXECUTABLE" environment variable to an appropriate version. Additionally, python wheels are the preferred way to distribute python code (see https://packaging.python.org/discussions/wheel-vs-egg/). This commit additionally builds the job-server-python wheel.

Spark-2.4 does not support python >= 3.8 (see apache/spark#26194) leading to failed test cases (TypeError: an integer is required (got type bytes)). If you encounter these issues try to state a python executable < 3.8 explicitly.
Ennosigaeon pushed a commit to Ennosigaeon/spark-jobserver that referenced this pull request Jan 20, 2021
Python 2 has reached EOL last year and should not be used anymore. This commit replaces all references to the "python" binary with the more explicit "python3" binary. If desired, the build can still be performed for Python 2 by settings the "PYTHON_EXECUTABLE" environment variable to an appropriate version. Additionally, python wheels are the preferred way to distribute python code (see https://packaging.python.org/discussions/wheel-vs-egg/). This commit additionally builds the job-server-python wheel.

Spark-2.4 does not support python >= 3.8 (see apache/spark#26194) leading to failed test cases (TypeError: an integer is required (got type bytes)). If you encounter these issues try to state a python executable < 3.8 explicitly.
Ennosigaeon pushed a commit to Ennosigaeon/spark-jobserver that referenced this pull request Jan 21, 2021
Python 2 has reached EOL last year and should not be used anymore. This commit replaces all references to the "python" binary with the more explicit "python3" binary. If desired, the build can still be performed for Python 2 by settings the "PYTHON_EXECUTABLE" environment variable to an appropriate version. Additionally, python wheels are the preferred way to distribute python code (see https://packaging.python.org/discussions/wheel-vs-egg/). This commit additionally builds the job-server-python wheel.

Spark-2.4 does not support python >= 3.8 (see apache/spark#26194) leading to failed test cases (TypeError: an integer is required (got type bytes)). If you encounter these issues try to state a python executable < 3.8 explicitly.
Ennosigaeon pushed a commit to Ennosigaeon/spark-jobserver that referenced this pull request Jan 21, 2021
Python 2 has reached EOL last year and should not be used anymore. This commit replaces all references to the "python" binary with the more explicit "python3" binary. If desired, the build can still be performed for Python 2 by settings the "PYTHON_EXECUTABLE" environment variable to an appropriate version. Additionally, python wheels are the preferred way to distribute python code (see https://packaging.python.org/discussions/wheel-vs-egg/). This commit additionally builds the job-server-python wheel.

Spark-2.4 does not support python >= 3.8 (see apache/spark#26194) leading to failed test cases (TypeError: an integer is required (got type bytes)). If you encounter these issues try to state a python executable < 3.8 explicitly.
Ennosigaeon pushed a commit to Ennosigaeon/spark-jobserver that referenced this pull request Jan 22, 2021
Python 2 has reached EOL last year and should not be used anymore. This commit replaces all references to the "python" binary with the more explicit "python3" binary. If desired, the build can still be performed for Python 2 by settings the "PYTHON_EXECUTABLE" environment variable to an appropriate version. Additionally, python wheels are the preferred way to distribute python code (see https://packaging.python.org/discussions/wheel-vs-egg/). This commit additionally builds the job-server-python wheel.

Spark-2.4 does not support python >= 3.8 (see apache/spark#26194) leading to failed test cases (TypeError: an integer is required (got type bytes)). If you encounter these issues try to state a python executable < 3.8 explicitly.
Ennosigaeon pushed a commit to Ennosigaeon/spark-jobserver that referenced this pull request Jan 22, 2021
Python 2 has reached EOL last year and should not be used anymore. This commit replaces all references to the "python" binary with the more explicit "python3" binary. If desired, the build can still be performed for Python 2 by settings the "PYTHON_EXECUTABLE" environment variable to an appropriate version. Additionally, python wheels are the preferred way to distribute python code (see https://packaging.python.org/discussions/wheel-vs-egg/). This commit additionally builds the job-server-python wheel.

Spark-2.4 does not support python >= 3.8 (see apache/spark#26194) leading to failed test cases (TypeError: an integer is required (got type bytes)). If you encounter these issues try to state a python executable < 3.8 explicitly.
Ennosigaeon pushed a commit to Ennosigaeon/spark-jobserver that referenced this pull request Jan 25, 2021
Python 2 has reached EOL last year and should not be used anymore. This commit replaces all references to the "python" binary with the more explicit "python3" binary. If desired, the build can still be performed for Python 2 by settings the "PYTHON_EXECUTABLE" environment variable to an appropriate version. Additionally, python wheels are the preferred way to distribute python code (see https://packaging.python.org/discussions/wheel-vs-egg/). This commit additionally builds the job-server-python wheel.

Spark-2.4 does not support python >= 3.8 (see apache/spark#26194) leading to failed test cases (TypeError: an integer is required (got type bytes)). If you encounter these issues try to state a python executable < 3.8 explicitly.
bsikander pushed a commit to spark-jobserver/spark-jobserver that referenced this pull request Jan 26, 2021
Python 2 has reached EOL last year and should not be used anymore. This commit replaces all references to the "python" binary with the more explicit "python3" binary. If desired, the build can still be performed for Python 2 by settings the "PYTHON_EXECUTABLE" environment variable to an appropriate version. Additionally, python wheels are the preferred way to distribute python code (see https://packaging.python.org/discussions/wheel-vs-egg/). This commit additionally builds the job-server-python wheel.

Spark-2.4 does not support python >= 3.8 (see apache/spark#26194) leading to failed test cases (TypeError: an integer is required (got type bytes)). If you encounter these issues try to state a python executable < 3.8 explicitly.
@Neeraj9697
Copy link

@HyukjinKwon
I am Using Python 3.9 and cloudpickle.py from below
https://github.com/cloudpipe/cloudpickle/blob/v1.1.1/cloudpickle/cloudpickle.py .

But Getting below error message while initializing pyspark in windows
Python 3.9.1 (tags/v3.9.1:1e5d33e, Dec 7 2020, 17:08:21) [MSC v.1927 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
File "C:\BigDataLocalSetup\Spark\python\pyspark\shell.py", line 31, in
from pyspark import SparkConf
File "C:\BigDataLocalSetup\Spark\python\lib\pyspark\pyspark_init_.py", line 51, in
from pyspark.context import SparkContext
File "C:\BigDataLocalSetup\Spark\python\lib\pyspark\pyspark\context.py", line 33, in
from pyspark.broadcast import Broadcast, BroadcastPickleRegistry
File "C:\BigDataLocalSetup\Spark\python\lib\pyspark\pyspark\broadcast.py", line 25, in
from pyspark.cloudpickle import print_exec
ImportError: cannot import name 'print_exec' from 'pyspark.cloudpickle' (C:\BigDataLocalSetup\Spark\python\lib\pyspark\pyspark\cloudpickle.py)

@dongjoon-hyun
Copy link
Member

Hi, @Neeraj9697 . The following is the result of Apache Spark 3.1.1 on Python 3.9.1 on Mac.

  • Which Spark Version are you using?
  • Are you reporting that PySpark doesn't work on Windows OS specifically?
$ bin/pyspark
Python 3.9.1 (default, Jan 19 2021, 12:49:23)
[Clang 12.0.0 (clang-1200.0.32.28)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
21/03/05 11:34:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Python version 3.9.1 (default, Jan 19 2021 12:49:23)
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1614972879606).
SparkSession available as 'spark'.
>>>

@dongjoon-hyun
Copy link
Member

Could you file a JIRA issue with your specific information please, @Neeraj9697 ?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 5, 2021

Like the following, we officially documented that some features (like Arrow) doesn't work on Python 3.9 due to its issue. Python 3.9 is not tested throughly in Apache Spark community yet.

For Python 3.9, Arrow optimization and pandas UDFs might not work due to the supported Python versions in Apache Arrow. Please refer to the latest Python Compatibility page.

@HyukjinKwon
Copy link
Member Author

Assuming from the error message, looks like you have multiple PySpark installed in your local and the paths messed up for some reasons. pyspark.cloudpickle is a package now but your error message said that this is a module (before this fix).

@Neeraj9697
Copy link

Thanks @HyukjinKwon and @dongjoon-hyun
Using Spark 3.1.1 its working fine now!

https://gist.github.com/Neeraj9697/f7aa1c6951bd3021eb48ccb919cbfd57
please help me with this problem also

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 7, 2021

Great! Thank you for confirmation, @Neeraj9697 .

For your new issue, please install python3 in your system.

java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessBuilder.start(Unknown Source)

superstar305 added a commit to superstar305/spark_jobserver that referenced this pull request Mar 19, 2023
Python 2 has reached EOL last year and should not be used anymore. This commit replaces all references to the "python" binary with the more explicit "python3" binary. If desired, the build can still be performed for Python 2 by settings the "PYTHON_EXECUTABLE" environment variable to an appropriate version. Additionally, python wheels are the preferred way to distribute python code (see https://packaging.python.org/discussions/wheel-vs-egg/). This commit additionally builds the job-server-python wheel.

Spark-2.4 does not support python >= 3.8 (see apache/spark#26194) leading to failed test cases (TypeError: an integer is required (got type bytes)). If you encounter these issues try to state a python executable < 3.8 explicitly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants