Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37465][PYTHON] Bump minimum pandas version to 1.0.5 #34717

Closed
wants to merge 1 commit into from

Conversation

Yikun
Copy link
Member

@Yikun Yikun commented Nov 26, 2021

What changes were proposed in this pull request?

Bump minimum pandas version to 1.0.5 (or a better version)

Why are the changes needed?

Initial discussion from SPARK-37465 and #34314 (comment) .

Does this PR introduce any user-facing change?

Yes, bump pandas minimun version.

How was this patch tested?

PySpark test passed with pandas v1.0.5.

@Yikun
Copy link
Member Author

Yikun commented Nov 26, 2021

Just to start the discussion, by using below sql according [1], we can get the all download stat of Pandas in last 3 months.

SELECT
  file.version AS file_version,
  COUNT(*) AS num_downloads,
FROM `the-psf.pypi.file_downloads`
WHERE file.project = 'pandas'
AND 
  -- Only query the last 3 months of history
  DATE(timestamp)
    BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 3 MONTH)
    AND CURRENT_DATE()
GROUP BY `file_version`
ORDER BY `num_downloads` DESC

Here is the Top 20 data, about 77% of the overall data, complete result can be found in here:

  version number percent
1 0.25.3 35149221 14.28%
2 1.1.5 28722806 11.67%
3 1.3.4 20944236 8.51%
4 1.3.3 16861573 6.85%
5 0.24.2 13235233 5.38%
6 1.0.5 9201989 3.74%
7 1.3.2 9077326 3.69%
8 1.2.5 7902532 3.21%
9 1.2.4 5754284 2.34%
10 1.1.4 5710439 2.32%
11 1.1.0 4760847 1.93%
12 1.1.2 4621441 1.88%
13 1.2.3 4607043 1.87%
14 1.0.3 4601230 1.87%
15 0.23.4 4251044 1.73%
16 0.25.0 3862673 1.57%
17 1.2.1 2952346 1.20%
18 1.0.1 2690006 1.09%
19 0.22.0 2680710 1.09%
20 1.2.0 2645339 1.07%
21 0.24.1 2635411 1.07%
  • There are more than 60+% users downloaded the 1.x version in last 3 months
  • There are 26+% users downloaded version v0.23.2 to v1.0

[1] https://packaging.python.org/guides/analyzing-pypi-package-downloads/

@HyukjinKwon
Copy link
Member

cc @ueshin @xinrong-databricks @itholic FYI

@HyukjinKwon
Copy link
Member

Let's also update https://github.com/apache/spark/blob/master/python/docs/source/getting_started/install.rst#dependencies

@SparkQA
Copy link

SparkQA commented Nov 26, 2021

Test build #145645 has finished for PR 34717 at commit 7986d55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50117/

@SparkQA
Copy link

SparkQA commented Nov 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50117/

@Yikun Yikun force-pushed the pandas-min-version branch from 7986d55 to e521b76 Compare November 27, 2021 01:01
@Yikun Yikun changed the title Bump minimum pandas version to 1.0.0 [SPARK-37465][PYTHON] Bump minimum pandas version to 1.0.0 Nov 27, 2021
@SparkQA
Copy link

SparkQA commented Nov 27, 2021

Test build #145671 has finished for PR 34717 at commit e521b76.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50141/

@SparkQA
Copy link

SparkQA commented Nov 27, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50141/

@Yikun Yikun marked this pull request as ready for review November 27, 2021 02:46
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems OK to go ahead and require the stable-r 1.x release

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. cc @ueshin @BryanCutler @viirya @xinrong-databricks @itholic FYI

@rshkv
Copy link
Contributor

rshkv commented Nov 28, 2021

I noticed that IntegralExtensionOpsTest.test_invert fails on Pandas 1.0.0 and succeeds on 1.0.1 (error below). So maybe safer to recommend that version. Otherwise everything seems to work with 1.0.0.

Test failure
======================================================================
ERROR [0.404s]: test_invert (pyspark.pandas.tests.data_type_ops.test_num_ops.IntegralExtensionOpsTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/circleci/project/python/pyspark/testing/pandasutils.py", line 122, in assertPandasEqual
    **kwargs
  File "/opt/pyenv/versions/3.7.3/lib/python3.7/site-packages/pandas/_testing.py", line 1137, in assert_series_equal
    assert_attr_equal("dtype", left, right, obj=f"Attributes of {obj}")
  File "/opt/pyenv/versions/3.7.3/lib/python3.7/site-packages/pandas/_testing.py", line 772, in assert_attr_equal
    raise_assert_detail(obj, msg, left_attr, right_attr)
  File "/opt/pyenv/versions/3.7.3/lib/python3.7/site-packages/pandas/_testing.py", line 915, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: Attributes of Series are different

Attribute "dtype" are different
[left]:  object
[right]: Int8

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py", line 498, in test_invert
    self.check_extension(~pser, ~psser)
  File "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/testing_utils.py", line 248, in check_extension
    self.assert_eq(left, right)
  File "/home/circleci/project/python/pyspark/testing/pandasutils.py", line 223, in assert_eq
    self.assertPandasEqual(lobj, robj, check_exact=check_exact)
  File "/home/circleci/project/python/pyspark/testing/pandasutils.py", line 130, in assertPandasEqual
    raise AssertionError(msg) from e
AssertionError: Attributes of Series are different

Attribute "dtype" are different
[left]:  object
[right]: Int8

Left:
dtype: object
object

Right:
dtype: Int8
Int8

@itholic
Copy link
Contributor

itholic commented Nov 29, 2021

I noticed that IntegralExtensionOpsTest.test_invert fails on Pandas 1.0.0 and succeeds on 1.0.1 (error below). So maybe safer to recommend that version. Otherwise everything seems to work with 1.0.0.

Yeah, it seems the bug in pandas 1.0.0.

  • pandas 1.0.0
>>> pser = pd.Series([1, 2, 3, None], dtype="Int8")
>>> pser
0       1
1       2
2       3
3    <NA>
dtype: Int8
>>> ~pser
0      -2
1      -3
2      -4
3    <NA>
dtype: object  # this should've been `Int8`

Resolved in pandas 1.0.1.

  • pandas 1.0.1
>>> pser = pd.Series([1, 2, 3, None], dtype="Int8")
>>> pser
0       1
1       2
2       3
3    <NA>
dtype: Int8
>>> ~pser
0      -2
1      -3
2      -4
3    <NA>
dtype: Int8

For addressing this,

  1. manually separate this test for pandas 1.0.0.
  2. set the minimum pandas to version 1.0.1.

Not sure which way is better, but I think we can just go with 2 if there is no reason to stick with the 1.0.0.

@srowen
Copy link
Member

srowen commented Nov 29, 2021

Yeah just require 1.0.1 for this reason

@xinrong-meng
Copy link
Member

Thanks @Yikun, how do you think about bumping to 1.0.1?

@Yikun
Copy link
Member Author

Yikun commented Nov 29, 2021

Sure, thanks for your suggestion, I'd like to update. and I added a simple test to install pandas v1.0.1 and run test on #34730 , wait for the result.

: (, Update: pandas only publish ubuntu wheel after v1.2....we have to install many deps, otherwise it would be failed when using pip install pandas==1.0.1,so I just install in my local env (macos, x86, yes have the 1.0.1 wheel) and running pip install 'pandas==1.0.1' and python/run-tests --modules=pyspark-pandas,pyspark-pandas-slow --parallelism=2 --python-executable=python3 to test it.

and looks like there were some testcase are failed like:

Test failure
======================================================================
ERROR: test_astype (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py", line 204, in test_astype
    self.assert_eq(pser.astype(int), psser.astype(int))
  File "/Users/jiangyikun/spark/spark/python/pyspark/testing/pandasutils.py", line 224, in assert_eq
    robj = self._to_pandas(right)
  File "/Users/jiangyikun/spark/spark/python/pyspark/testing/pandasutils.py", line 245, in _to_pandas
    return obj.to_pandas()
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/series.py", line 1588, in to_pandas
    return self._to_pandas()
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/series.py", line 1594, in _to_pandas
    return self._to_internal_pandas().copy()
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/series.py", line 6349, in _to_internal_pandas
    return self._psdf._internal.to_pandas_frame[self.name]
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/utils.py", line 584, in wrapped_lazy_property
    setattr(self, attr_name, fn(self))
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/internal.py", line 1049, in to_pandas_frame
    pdf = sdf.toPandas()
  File "/Users/jiangyikun/spark/spark/python/pyspark/sql/pandas/conversion.py", line 185, in toPandas
    pdf = pd.DataFrame(columns=tmp_column_names).astype(
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 435, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 239, in init_dict
    val = construct_1d_arraylike_from_scalar(np.nan, len(index), nan_dtype)
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1449, in construct_1d_arraylike_from_scalar
    dtype = dtype.dtype
AttributeError: type object 'object' has no attribute 'dtype'

----------------------------------------------------------------------

@Yikun
Copy link
Member Author

Yikun commented Nov 29, 2021

Complete all pyspark-pandas test with:

python/run-tests --modules=pyspark-pandas --parallelism=2 --python-executable=python3

Serveral test cases failed (4 cases failed due to same issue) in 1.0.1 due to AttributeError: type object 'object' has no attribute 'dtype' and passed with pandas v1.0.5 (It might be fixed in pandas-dev/pandas#34667).

Test failure details
======================================================================
ERROR: test_astype (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py", line 204, in test_astype
    self.assert_eq(pser.astype(int), psser.astype(int))
  File "/Users/jiangyikun/spark/spark/python/pyspark/testing/pandasutils.py", line 224, in assert_eq
    robj = self._to_pandas(right)
  File "/Users/jiangyikun/spark/spark/python/pyspark/testing/pandasutils.py", line 245, in _to_pandas
    return obj.to_pandas()
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/series.py", line 1588, in to_pandas
    return self._to_pandas()
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/series.py", line 1594, in _to_pandas
    return self._to_internal_pandas().copy()
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/series.py", line 6349, in _to_internal_pandas
    return self._psdf._internal.to_pandas_frame[self.name]
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/utils.py", line 584, in wrapped_lazy_property
    setattr(self, attr_name, fn(self))
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/internal.py", line 1049, in to_pandas_frame
    pdf = sdf.toPandas()
  File "/Users/jiangyikun/spark/spark/python/pyspark/sql/pandas/conversion.py", line 185, in toPandas
    pdf = pd.DataFrame(columns=tmp_column_names).astype(
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 435, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 239, in init_dict
    val = construct_1d_arraylike_from_scalar(np.nan, len(index), nan_dtype)
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1449, in construct_1d_arraylike_from_scalar
    dtype = dtype.dtype
AttributeError: type object 'object' has no attribute 'dtype'

======================================================================
ERROR: test_read_csv (pyspark.pandas.tests.test_csv.CsvTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/tests/test_csv.py", line 151, in test_read_csv
    check(usecols=[])
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/tests/test_csv.py", line 138, in check
    self.assert_eq(expected, actual, almost=True)
  File "/Users/jiangyikun/spark/spark/python/pyspark/testing/pandasutils.py", line 224, in assert_eq
    robj = self._to_pandas(right)
  File "/Users/jiangyikun/spark/spark/python/pyspark/testing/pandasutils.py", line 245, in _to_pandas
    return obj.to_pandas()
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/frame.py", line 4856, in to_pandas
    return self._to_pandas()
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/frame.py", line 4862, in _to_pandas
    return self._internal.to_pandas_frame.copy()
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/utils.py", line 584, in wrapped_lazy_property
    setattr(self, attr_name, fn(self))
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/internal.py", line 1049, in to_pandas_frame
    pdf = sdf.toPandas()
  File "/Users/jiangyikun/spark/spark/python/pyspark/sql/pandas/conversion.py", line 185, in toPandas
    pdf = pd.DataFrame(columns=tmp_column_names).astype(
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 435, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 239, in init_dict
    val = construct_1d_arraylike_from_scalar(np.nan, len(index), nan_dtype)
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1449, in construct_1d_arraylike_from_scalar
    dtype = dtype.dtype
AttributeError: type object 'object' has no attribute 'dtype'

======================================================================
ERROR: test_kde_plot (pyspark.pandas.tests.plot.test_frame_plot_plotly.DataFramePlotPlotlyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/tests/plot/test_frame_plot_plotly.py", line 262, in test_kde_plot
    actual = psdf.plot.kde(bw_method=5, ind=3)
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/plot/core.py", line 946, in kde
    return self(kind="kde", bw_method=bw_method, ind=ind, **kwargs)
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/plot/core.py", line 498, in __call__
    return plot_backend.plot_pandas_on_spark(plot_data, kind=kind, **kwargs)
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/plot/plotly.py", line 44, in plot_pandas_on_spark
    return plot_kde(data, **kwargs)
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/plot/plotly.py", line 202, in plot_kde
    pd.DataFrame(
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 435, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 254, in init_dict
    return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 69, in arrays_to_mgr
    arrays = _homogenize(arrays, index, dtype)
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 322, in _homogenize
    val = sanitize_array(
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/construction.py", line 465, in sanitize_array
    subarr = construct_1d_arraylike_from_scalar(value, len(index), dtype)
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1461, in construct_1d_arraylike_from_scalar
    subarr = np.empty(length, dtype=dtype)
TypeError: Cannot interpret '<attribute 'dtype' of 'numpy.generic' objects>' as a data type

======================================================================
ERROR: test_kde_plot (pyspark.pandas.tests.plot.test_series_plot_plotly.SeriesPlotPlotlyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/tests/plot/test_series_plot_plotly.py", line 231, in test_kde_plot
    actual = psdf.a.plot.kde(bw_method=5, ind=3)
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/plot/core.py", line 946, in kde
    return self(kind="kde", bw_method=bw_method, ind=ind, **kwargs)
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/plot/core.py", line 498, in __call__
    return plot_backend.plot_pandas_on_spark(plot_data, kind=kind, **kwargs)
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/plot/plotly.py", line 44, in plot_pandas_on_spark
    return plot_kde(data, **kwargs)
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/plot/plotly.py", line 202, in plot_kde
    pd.DataFrame(
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 435, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 254, in init_dict
    return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 69, in arrays_to_mgr
    arrays = _homogenize(arrays, index, dtype)
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 322, in _homogenize
    val = sanitize_array(
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/construction.py", line 465, in sanitize_array
    subarr = construct_1d_arraylike_from_scalar(value, len(index), dtype)
  File "/Users/jiangyikun/venv/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1461, in construct_1d_arraylike_from_scalar
    subarr = np.empty(length, dtype=dtype)
TypeError: Cannot interpret '<attribute 'dtype' of 'numpy.generic' objects>' as a data type

At this time, I prefer to update to 1.0.5, I'm going to run pyspark-pandas-slow now.

@Yikun
Copy link
Member Author

Yikun commented Nov 29, 2021

There are only a precision error of pyspark-pandas-slow testcase, we could add almost flag:

Test failure details
======================================================================
FAIL: test_mad (pyspark.pandas.tests.test_series.SeriesTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jiangyikun/spark/spark/python/pyspark/pandas/tests/test_series.py", line 2235, in test_mad
    self.assert_eq(pser.mad(), psser.mad())
  File "/Users/jiangyikun/spark/spark/python/pyspark/testing/pandasutils.py", line 240, in assert_eq
    self.assertEqual(lobj, robj)
AssertionError: 21.555555555555554 != 21.555555555555557

----------------------------------------------------------------------

@Yikun Yikun force-pushed the pandas-min-version branch from e521b76 to 7b1de6d Compare November 29, 2021 11:45
@Yikun Yikun changed the title [SPARK-37465][PYTHON] Bump minimum pandas version to 1.0.0 [SPARK-37465][PYTHON] Bump minimum pandas version to 1.0.5 Nov 29, 2021
@Yikun
Copy link
Member Author

Yikun commented Nov 29, 2021

As a conclusion in here:

So, I bump minimum pandas version to v1.0.5, the v1.0.5 is also the latest version of Pandas verions 1.0.

Ready for review. : )

@SparkQA
Copy link

SparkQA commented Nov 29, 2021

Test build #145720 has finished for PR 34717 at commit 7b1de6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50189/

@SparkQA
Copy link

SparkQA commented Nov 29, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50189/

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems okay. One comment about the doc.

@@ -387,7 +387,7 @@ working with timestamps in ``pandas_udf``\s to get the best performance, see
Recommended Pandas and PyArrow Versions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For usage with pyspark.sql, the minimum supported versions of Pandas is 0.23.2 and PyArrow is 1.0.0.
For usage with pyspark.sql, the minimum supported versions of Pandas is 1.0.5 and PyArrow is 1.0.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention there are some issues with versions like 1.0.0, 1.0.1?

Copy link
Member Author

@Yikun Yikun Nov 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

For usage with pyspark.sql, the minimum supported versions of Pandas is 1.0.5 and PyArrow is 1.0.0. Lower versions (such as there are some known issues under with v1.0.0, v1.0.1, see more in link) or higher versions may be used, however, compatibility and data correctness can not be guaranteed and should be verified by the user.

Maybe need more suggestion from native speaker. T_T, and if it's necessary we could do it in next commits in this PR or followup.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think v1.0.5 is a reasonable minimum

@itholic
Copy link
Contributor

itholic commented Nov 30, 2021

LGTM if remaining comments are resolved.

@Yikun Yikun force-pushed the pandas-min-version branch from 7b1de6d to 054905f Compare November 30, 2021 03:23
@SparkQA
Copy link

SparkQA commented Nov 30, 2021

Test build #145744 has finished for PR 34717 at commit 054905f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50215/

@SparkQA
Copy link

SparkQA commented Nov 30, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50215/

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.