Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37254: [Python] Parametrize all pickling tests to use both the pickle and cloudpickle modules #37255

Merged
merged 7 commits into from
Aug 24, 2023

Conversation

danepitkin
Copy link
Member

@danepitkin danepitkin commented Aug 18, 2023

Rationale for this change

Cloudpickle was not tested in most parts of the pyarrow test suite. Improving this coverage will make the Cython 3.0.0 upgrade cleaner as cloudpickle was failing in a few places where the default pickle module was not. This has been verified using Cython 0.29.36.

What changes are included in this PR?

  • __reduce__ methods that need to pass kwargs have been changed from classmethod to staticmethod
  • All pytests that pickle objects are parameterized to use both pickle and cloudpickle

Are these changes tested?

Yes, pytests run successfully with Cython 0.29.36

Are there any user-facing changes?

No.

@danepitkin danepitkin requested review from kou and pitrou August 18, 2023 19:23
@github-actions
Copy link

⚠️ GitHub issue #37254 has been automatically assigned in GitHub to PR creator.

@danepitkin
Copy link
Member Author

danepitkin commented Aug 18, 2023

Cython 3 will set binding=True by default for the compiler. With Cython 0.29.36, we just need to add the binding=True compiler option to make it backwards compatible. Here's a small repro as an example with Cython 0.29:

my_test.pyx

# cython : language_level = 3
import cython

cdef class Foo():
    @staticmethod
    @cython.binding(True)  # This works
    def _reconstruct():
        return Foo()

    def __reduce__(self):
        return Foo._reconstruct, tuple()


cdef class Bar():
    @staticmethod  # This does not work
    def _reconstruct():
        return Bar()

    def __reduce__(self):
        return Bar._reconstruct, tuple()

Usage here:

In [1]: from my_test import Foo, Bar

In [2]: f = Foo()

In [3]: b = Bar()

In [4]: import pickle

In [5]: pickle.dumps(f)
Out[5]: b'\x80\x04\x95#\x00\x00\x00\x00\x00\x00\x00\x8c\x07my_test\x94\x8c\x10Foo._reconstruct\x94\x93\x94)R\x94.'

In [6]: pickle.dumps(b)
---------------------------------------------------------------------------
PicklingError                             Traceback (most recent call last)
Cell In[6], line 1
----> 1 pickle.dumps(b)

PicklingError: Can't pickle <built-in function _reconstruct>: it's not the same object as my_test._reconstruct

@danepitkin
Copy link
Member Author

There seems to be an intermittent failure reported:

==================================== ERRORS ====================================
___ ERROR at setup of test_filesystem_pickling[builtin_pickle-S3FileSystem] ____

request = <SubRequest 's3fs' for <Function test_filesystem_pickling[builtin_pickle-S3FileSystem]>>
s3_server = {'connection': ('localhost', 38507, 'arrow', 'apachearrow'), 'process': <Popen: returncode: None args: ['minio', '--compat', 'server', '--quiet', '-...>, 'tempdir': '/tmp/tmpoas4aj3y'}

    @pytest.fixture
    def s3fs(request, s3_server):
        request.config.pyarrow.requires('s3')
        from pyarrow.fs import S3FileSystem
    
        host, port, access_key, secret_key = s3_server['connection']
        bucket = 'pyarrow-filesystem/'
    
        fs = S3FileSystem(
            access_key=access_key,
            secret_key=secret_key,
            endpoint_override='{}:{}'.format(host, port),
            scheme='http',
            allow_bucket_creation=True,
            allow_bucket_deletion=True
        )
>       fs.create_dir(bucket)

opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/tests/test_fs.py:250: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/_fs.pyx:603: in pyarrow._fs.FileSystem.create_dir
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   OSError: When creating bucket 'pyarrow-filesystem': AWS Error UNKNOWN (HTTP status 503) during HeadBucket operation: No response body.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry, I can't review this deeply because I'm not familiar with pickle.)

python/pyarrow/_dataset_parquet.pyx Outdated Show resolved Hide resolved
python/pyarrow/tests/parquet/test_dataset.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Aug 18, 2023
@danepitkin
Copy link
Member Author

danepitkin commented Aug 21, 2023

For the intermittent failure, I've also seen it in the CI tests for #37097. The 503 error means "Service unavailable". Seeing as this is a session-scoped pytest fixture and it only is failing on the very first invocation so far, I'm pretty sure that minio is just not fully started yet when the pytest tries to call the server after starting it the first time. We can introduce pytest_xprocess to better manage processes as fixtures. I'll add that to this PR. https://pytest-xprocess.readthedocs.io/en/latest/

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 21, 2023
@danepitkin
Copy link
Member Author

danepitkin commented Aug 21, 2023

For the intermittent failure, I've also seen it in the CI tests for #37097. The 503 error means "Service unavailable". Seeing as this is a session-scoped pytest fixture and it only is failing on the very first invocation so far, I'm pretty sure that minio is just not fully started yet when the pytest tries to call the server after starting it the first time. We can introduce pytest_xprocess to better manage processes as fixtures. I'll add that to this PR. https://pytest-xprocess.readthedocs.io/en/latest/

xprocess doesn't quite fit cleanly into what we want. I tried adding a connection_timeout parameter to s3fs in our tests instead to see if that helps. We could also add the minio python client as a dependency if we want. This will add retries for 503 errors automatically. Not sure if we want to add another test dependency for that though..

@danepitkin
Copy link
Member Author

@github-actions crossbow submit python

@github-actions
Copy link

Unable to match any tasks for `python`
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/5930280856

@danepitkin
Copy link
Member Author

@github-actions crossbow submit python

@github-actions
Copy link

Revision: 0ddbff5

Submitted crossbow builds: ursacomputing/crossbow @ actions-7eabdb808a

Task Status
example-python-minimal-build-fedora-conda Github Actions
example-python-minimal-build-ubuntu-venv Github Actions
python-sdist Github Actions
test-conda-python-3.10 Github Actions
test-conda-python-3.10-hdfs-2.9.2 Github Actions
test-conda-python-3.10-hdfs-3.2.1 Github Actions
test-conda-python-3.10-pandas-latest Github Actions
test-conda-python-3.10-pandas-nightly Github Actions
test-conda-python-3.10-spark-v3.4.1 Github Actions
test-conda-python-3.10-substrait Github Actions
test-conda-python-3.11 Github Actions
test-conda-python-3.11-dask-latest Github Actions
test-conda-python-3.11-dask-upstream_devel Github Actions
test-conda-python-3.11-hypothesis Github Actions
test-conda-python-3.11-pandas-upstream_devel Github Actions
test-conda-python-3.11-spark-master Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-pandas-1.0 Github Actions
test-conda-python-3.8-spark-v3.4.1 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-pandas-latest Github Actions
test-cuda-python Github Actions
test-debian-11-python-3 Azure
test-fedora-35-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 Github Actions
verify-rc-source-python-linux-almalinux-8-amd64 Github Actions
verify-rc-source-python-linux-conda-latest-amd64 Github Actions
verify-rc-source-python-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-python-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-python-macos-amd64 Github Actions
verify-rc-source-python-macos-arm64 Github Actions
verify-rc-source-python-macos-conda-amd64 Github Actions

@danepitkin
Copy link
Member Author

@github-actions crossbow submit test-conda-python-3.11-hypothesis

@danepitkin danepitkin marked this pull request as draft August 22, 2023 00:57
@github-actions
Copy link

Revision: fa83e39

Submitted crossbow builds: ursacomputing/crossbow @ actions-7d987d430e

Task Status
test-conda-python-3.11-hypothesis Github Actions

@danepitkin
Copy link
Member Author

Still need to fix the intermittent failure with minio server startup.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actual code changes (apart from the S3 issue) look good, thanks a lot for this!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Aug 22, 2023
@danepitkin danepitkin marked this pull request as ready for review August 22, 2023 21:16
@danepitkin danepitkin force-pushed the danepitkin/pytest-cloudpickle branch from 06a21ac to 7710a86 Compare August 22, 2023 21:16
@danepitkin
Copy link
Member Author

@github-actions crossbow submit python

@github-actions
Copy link

Revision: 7710a86

Submitted crossbow builds: ursacomputing/crossbow @ actions-c53268ddd6

Task Status
example-python-minimal-build-fedora-conda Github Actions
example-python-minimal-build-ubuntu-venv Github Actions
python-sdist Github Actions
test-conda-python-3.10 Github Actions
test-conda-python-3.10-hdfs-2.9.2 Github Actions
test-conda-python-3.10-hdfs-3.2.1 Github Actions
test-conda-python-3.10-pandas-latest Github Actions
test-conda-python-3.10-pandas-nightly Github Actions
test-conda-python-3.10-spark-v3.4.1 Github Actions
test-conda-python-3.10-substrait Github Actions
test-conda-python-3.11 Github Actions
test-conda-python-3.11-dask-latest Github Actions
test-conda-python-3.11-dask-upstream_devel Github Actions
test-conda-python-3.11-hypothesis Github Actions
test-conda-python-3.11-pandas-upstream_devel Github Actions
test-conda-python-3.11-spark-master Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-pandas-1.0 Github Actions
test-conda-python-3.8-spark-v3.4.1 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-pandas-latest Github Actions
test-cuda-python Github Actions
test-debian-11-python-3 Azure
test-fedora-35-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 Github Actions
verify-rc-source-python-linux-almalinux-8-amd64 Github Actions
verify-rc-source-python-linux-conda-latest-amd64 Github Actions
verify-rc-source-python-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-python-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-python-macos-amd64 Github Actions
verify-rc-source-python-macos-arm64 Github Actions
verify-rc-source-python-macos-conda-amd64 Github Actions

@danepitkin
Copy link
Member Author

@github-actions crossbow submit test-conda-python-3.10-hdfs*

@github-actions
Copy link

Revision: 8eb925d

Submitted crossbow builds: ursacomputing/crossbow @ actions-a1c4d5ead7

Task Status
test-conda-python-3.10-hdfs-2.9.2 Github Actions
test-conda-python-3.10-hdfs-3.2.1 Github Actions

@danepitkin danepitkin force-pushed the danepitkin/pytest-cloudpickle branch from 8eb925d to 1a5b32b Compare August 23, 2023 18:31
@danepitkin
Copy link
Member Author

@github-actions crossbow submit python

@github-actions
Copy link

Revision: 1a5b32b

Submitted crossbow builds: ursacomputing/crossbow @ actions-c0a5b0e9b9

Task Status
example-python-minimal-build-fedora-conda Github Actions
example-python-minimal-build-ubuntu-venv Github Actions
python-sdist Github Actions
test-conda-python-3.10 Github Actions
test-conda-python-3.10-hdfs-2.9.2 Github Actions
test-conda-python-3.10-hdfs-3.2.1 Github Actions
test-conda-python-3.10-pandas-latest Github Actions
test-conda-python-3.10-pandas-nightly Github Actions
test-conda-python-3.10-spark-v3.4.1 Github Actions
test-conda-python-3.10-substrait Github Actions
test-conda-python-3.11 Github Actions
test-conda-python-3.11-dask-latest Github Actions
test-conda-python-3.11-dask-upstream_devel Github Actions
test-conda-python-3.11-hypothesis Github Actions
test-conda-python-3.11-pandas-upstream_devel Github Actions
test-conda-python-3.11-spark-master Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-pandas-1.0 Github Actions
test-conda-python-3.8-spark-v3.4.1 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-pandas-latest Github Actions
test-cuda-python Github Actions
test-debian-11-python-3 Azure
test-fedora-35-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 Github Actions
verify-rc-source-python-linux-almalinux-8-amd64 Github Actions
verify-rc-source-python-linux-conda-latest-amd64 Github Actions
verify-rc-source-python-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-python-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-python-macos-amd64 Github Actions
verify-rc-source-python-macos-arm64 Github Actions
verify-rc-source-python-macos-conda-amd64 Github Actions

@danepitkin
Copy link
Member Author

CI looks good! No S3 intermittent failures reported after 2 full runs. All current failures are unrelated.

Ready to merge IMO.

@kou kou merged commit 175b2a2 into apache:main Aug 24, 2023
10 checks passed
@kou kou removed the awaiting merge Awaiting merge label Aug 24, 2023
@jorisvandenbossche
Copy link
Member

Nice, thanks a lot Dane!

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 175b2a2.

There were 4 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…he pickle and cloudpickle modules (apache#37255)

### Rationale for this change

Cloudpickle was not tested in most parts of the pyarrow test suite. Improving this coverage will make the Cython 3.0.0 upgrade cleaner as cloudpickle was failing in a few places where the default pickle module was not. This has been verified using Cython 0.29.36.

### What changes are included in this PR?

* `__reduce__` methods that need to pass kwargs have been changed from classmethod to staticmethod
* All pytests that pickle objects are parameterized to use both `pickle` and `cloudpickle`

### Are these changes tested?

Yes, pytests run successfully with Cython 0.29.36

### Are there any user-facing changes?

No.
* Closes: apache#37254

Authored-by: Dane Pitkin <dane@voltrondata.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Include cloudpickle in pytests that pickle objects
3 participants