Skip to content

Deprecate read_feather nthreads argument + update feather-format to pyarrow.feather #23112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 1, 2018
Merged

Conversation

ingwinlu
Copy link
Contributor

@ingwinlu ingwinlu commented Oct 12, 2018

@pep8speaks
Copy link

pep8speaks commented Oct 12, 2018

Hello @ingwinlu! Thanks for updating the PR.

Comment last updated on October 15, 2018 at 05:42 Hours UTC

@jreback
Copy link
Contributor

jreback commented Oct 12, 2018

this would need a whatsnew note
and i believe this is tested so pls update the tests and catch the warning

@codecov
Copy link

codecov bot commented Oct 13, 2018

Codecov Report

Merging #23112 into master will increase coverage by <.01%.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23112      +/-   ##
==========================================
+ Coverage   92.21%   92.22%   +<.01%     
==========================================
  Files         161      161              
  Lines       51187    51191       +4     
==========================================
+ Hits        47202    47210       +8     
+ Misses       3985     3981       -4
Flag Coverage Δ
#multiple 90.6% <16.66%> (-0.05%) ⬇️
#single 42.26% <83.33%> (+0.04%) ⬆️
Impacted Files Coverage Δ
pandas/io/feather_format.py 89.74% <83.33%> (+12.6%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9019582...e7d576d. Read the comment docs.

@ingwinlu
Copy link
Contributor Author

I test that the warning is issued now. Also rebased and followed the commit message guidelines. Added an whatsnew entry as well.

@jorisvandenbossche
Copy link
Member

Shouldn't it be replaced with use_threads ?

@ingwinlu
Copy link
Contributor Author

this would require pinning pyarrow > 0.10.0 as a dependency

@@ -33,6 +33,8 @@ Fixed Regressions
Development
~~~~~~~~~~~
- The minimum required pytest version has been increased to 3.6 (:issue:`22319`)
- Deprecated the `nthreads` keyword of `pandas.read_feather()` in favor of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the :func: reference here

@@ -96,6 +101,11 @@ def read_feather(path, nthreads=1):
Number of CPU threads to use when reading to pandas.DataFrame

.. versionadded 0.21.0
.. deprecated 0.23.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.24.0

@ingwinlu
Copy link
Contributor Author

@jreback can you have another look?

path = _stringify_path(path)

if LooseVersion(feather.__version__) < LooseVersion('0.4.0'):
if LooseVersion(feather.__version__) < LooseVersion('0.4.0') or \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: start the line with parentheses instead of using a backslash to continue the line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a suggestion for the alternative? It does not line up as nicely (since it would require an additional indent of the second line). Making it in my opinion harder to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (LooseVersion(feather.__version__) < LooseVersion('0.4.0') or
        LooseVersion(pyarrow.__version__) < LooseVersion('0.11.0')):
    return feather.read_dataframe(path)

@@ -979,3 +979,5 @@ Other
- :meth:`~pandas.io.formats.style.Styler.bar` now also supports tablewise application (in addition to rowwise and columnwise) with ``axis=None`` and setting clipping range with ``vmin`` and ``vmax`` (:issue:`21548` and :issue:`21526`). ``NaN`` values are also handled properly.
- Logical operations ``&, |, ^`` between :class:`Series` and :class:`Index` will no longer raise ``ValueError`` (:issue:`22092`)
- Bug in :meth:`DataFrame.combine_first` in which column types were unexpectedly converted to float (:issue:`20699`)
- Deprecated the `nthreads` keyword of :func:`pandas.read_feather()` in favor of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the parentheses in read_feather()

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this emit any warnings in the tests?

I would actually be ok with removing support for feather < 0.4.0 if it makes it easier here. you may actually need to do this as importing an older feather doesn't have pyarrow as a dep so the importer might fail.

@jreback jreback added Deprecate Functionality to remove in pandas IO Data IO issues that don't fit into a more specific label labels Oct 15, 2018
@ingwinlu
Copy link
Contributor Author

ingwinlu commented Oct 15, 2018

does this emit any warnings in the tests?

None, but ci does not test with feather 0.3.1 AFAIK.

@jreback can we directly depend on pyarrow without getting there via feather? That would make it simpler.

I am not sure if going over feather + pyarrow combinations and checking them all for how to call them is the way to go.

feather-format, 0.3.1 - no pyarrow - does not support any args
feather-format, 0.4.0 - pyarrow > 0.4.0 - does support nthreads
feather-format, 0.4.0 - pyarrow >= 0.11.0 - does not support nthreads, supports use_threads.

So the current implementation with import pyarrow should not be merged as it would not play nice in case someone is still on feather-format 0.3.1.

//edit: maybe ask feather to release a new version that pins to a higher pyarrow version...

@jreback
Copy link
Contributor

jreback commented Oct 17, 2018

@ingwinlu actually I will revise the though above. Am ok with dropping support for feather entirely and just using pyarrow here. Can you revise?

@ingwinlu
Copy link
Contributor Author

Sure. Will probably work on it later today.

@ingwinlu
Copy link
Contributor Author

@jreback do you want to depend on pyarrow 0.4.0 or go for 0.11.0 directly?

@jreback
Copy link
Contributor

jreback commented Oct 17, 2018

if u can use our min version of pyarrow (we use for parquet)

we could bump that slightly also but.l not past 0.8.0

@ingwinlu
Copy link
Contributor Author

@jreback done. don't think the windows test fail is related to the changes in the PR.

@ingwinlu
Copy link
Contributor Author

This is still missing the removal of feather-format refs in ci configs as well as some doc entry in source/io I missed.

@h-vetinari
Copy link
Contributor

If dropping feather, will also close #21639

@jorisvandenbossche jorisvandenbossche changed the title Deprecate nthreads argument Deprecate read_feather nthreads argument + update feather-format to pyarrow.feather Oct 18, 2018
@ingwinlu
Copy link
Contributor Author

I did not rewrite the io.feather section of the docs. I feel like the currently linked repository (feather) provides more information.

If you feel like it is necessary we can add a point to the caveats listed where we express that we directly require the upstream pyarrow library.

@@ -1004,3 +1004,7 @@ Other
- :meth:`~pandas.io.formats.style.Styler.bar` now also supports tablewise application (in addition to rowwise and columnwise) with ``axis=None`` and setting clipping range with ``vmin`` and ``vmax`` (:issue:`21548` and :issue:`21526`). ``NaN`` values are also handled properly.
- Logical operations ``&, |, ^`` between :class:`Series` and :class:`Index` will no longer raise ``ValueError`` (:issue:`22092`)
- Bug in :meth:`DataFrame.combine_first` in which column types were unexpectedly converted to float (:issue:`20699`)
- Deprecated the `nthreads` keyword of :func:`pandas.read_feather` in favor of
`use_threads` to reflect the changes in pyarrow 0.11.0. (:issue:`23053`)
- Drop `feather-format` as a dependency for feather based storage and use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to api-breaking changes section

@jreback jreback added this to the 0.24.0 milestone Oct 18, 2018
@@ -233,6 +233,10 @@ If installed, we now require:
| scipy | 0.18.1 | |
+-----------------+-----------------+----------+


Additionally we no longer depend on `feather-format` for feather based storage
and replaced it with references to `pyarrow`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue refernce here

@jreback
Copy link
Contributor

jreback commented Oct 22, 2018

going to merge after passing: #23281, which you will need to revert the xfails.

@jreback
Copy link
Contributor

jreback commented Oct 23, 2018

can you rebase this and check changes in #23281

@ingwinlu
Copy link
Contributor Author

Rebased and reactivated part of the disabled tests from #23281. Did not check if the rest could also be resolved by the changes in this PR (parquet + gbq).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. let's just add a prior to change pyarrow version to test (0.10.0), ping on green.

@@ -235,6 +235,11 @@ If installed, we now require:
+-----------------+-----------------+----------+
| scipy | 0.18.1 | |
+-----------------+-----------------+----------+
| pyarrow | 0.4.1 | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently already the minimal supported version no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but not if you don't have feather

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingwinlu actually maybe take out pyarrow from the table, your comment below is fine

if LooseVersion(feather.__version__) < LooseVersion('0.4.0'):
return feather.read_dataframe(path)
if LooseVersion(pyarrow.__version__) < LooseVersion('0.11.0'):
return feather.read_feather(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should still pass nthreads here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingwinlu can you check this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nthreads is not available anymore after the conversion in the wrapper function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we need to do it in some other way, but not passing it here means breaking the functionality for people having pyarrow < 0.11 (which we still support)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingwinlu what you could do is remove the mapping part of the deprecation, then you can easily just pass the argument here as read_feather(path, nthreads=int(use_threads)) I think would work, then on the other branch, just pass read_feather(path, use_threads=bool(use_threads))

@jreback
Copy link
Contributor

jreback commented Oct 26, 2018

can you rebase and fixup?

@ingwinlu
Copy link
Contributor Author

I can not replicate the

>   import pyarrow.compat as compat
E   AttributeError: module 'pyarrow' has no attribute 'compat'

issues locally. Any pointers?

@TomAugspurger
Copy link
Contributor

Maybe a different version of pyarrow? https://travis-ci.org/pandas-dev/pandas/jobs/446626253#L2219

@ingwinlu
Copy link
Contributor Author

ingwinlu commented Oct 27, 2018

I replicated the conda environment of that test run. parquet tests run correctly but feather tests are not run at all due to an import arrow on pyarrow:

Python 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/winlu/miniconda3/envs/pandas/lib/python3.6/site-packages/pyarrow/__init__.py", line 60, in <module>
    from pyarrow.lib import cpu_count, set_cpu_count
ImportError: /home/winlu/miniconda3/envs/pandas/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.1: undefined symbol: _ZN5boost13match_resultsIN9__gnu_cxx17__normal_iteratorIPKcSsEESaINS_9sub_matchIS5_EEEE12maybe_assignERKS9_
>>> 

Could that be some caching issue?

Was not using boost from conda-forge and hence had a missmatch in ABI's.

The other issue with compat seems to be related to apache/arrow#2634 which I will test now.

@ingwinlu
Copy link
Contributor Author

@jreback should be good now.

if LooseVersion(feather.__version__) < LooseVersion('0.4.0'):
return feather.read_dataframe(path)
if LooseVersion(pyarrow.__version__) < LooseVersion('0.11.0'):
return feather.read_feather(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingwinlu can you check this comment?

@@ -235,6 +235,11 @@ If installed, we now require:
+-----------------+-----------------+----------+
| scipy | 0.18.1 | |
+-----------------+-----------------+----------+
| pyarrow | 0.4.1 | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingwinlu actually maybe take out pyarrow from the table, your comment below is fine

if LooseVersion(feather.__version__) < LooseVersion('0.4.0'):
return feather.read_dataframe(path)
if LooseVersion(pyarrow.__version__) < LooseVersion('0.11.0'):
return feather.read_feather(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingwinlu what you could do is remove the mapping part of the deprecation, then you can easily just pass the argument here as read_feather(path, nthreads=int(use_threads)) I think would work, then on the other branch, just pass read_feather(path, use_threads=bool(use_threads))

@ingwinlu
Copy link
Contributor Author

nthreads=0 makes some fun results >.>

@jreback
Copy link
Contributor

jreback commented Oct 28, 2018

@ingwinlu thanks for the responsiveness. @jorisvandenbossche over to you.

@jreback
Copy link
Contributor

jreback commented Nov 1, 2018

can you rebase once more.

@ingwinlu
Copy link
Contributor Author

ingwinlu commented Nov 1, 2018

done

The nthreads argument is no longer supported since pyarrow 0.11.0 and
was replaced with use_threads.
Hence we deprecate the argument now as well so we can remove it in the
future.

This commit also:
- removes feather-format as a dependency and replaces it with usage of
  pyarrow directly.
- sets CI dependencies to respect the changes above.

We test backwards compatibility with pyarrow 0.9.0 as conda does not
provide a pyarrow 0.10.0 and the conda-forge version has comatibility
issues with the rest of the installed packages.

Resolves #23053.
Resolves #21639.

return feather.read_dataframe(path, nthreads=nthreads)
return feather.read_feather(path, use_threads=bool(use_threads))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is fully correct. If someone did before nthreads=1 (which meant: no additional threads), this will be translated into use_threads=True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, maybe that is not really a problem since the default in pyarrow also actually changed from nthreads=1 to use_threads=True

@jorisvandenbossche jorisvandenbossche merged commit 6b9318c into pandas-dev:master Nov 1, 2018
@jorisvandenbossche
Copy link
Member

@ingwinlu Thanks a lot!

thoo added a commit to thoo/pandas that referenced this pull request Nov 3, 2018
…xamples

* repo_org/master: (66 commits)
  CLN: doc string (pandas-dev#23469)
  DOC: Add cookbook entry for triangular correlation matrix (GH22840) (pandas-dev#23032)
  add number of Errors, Warnings to scripts/validate_docstrings.py (pandas-dev#23150)
  BUG: Allow freq conversion from dt64 to period (pandas-dev#23460)
  ENH: Add FrozenList.union and .difference (pandas-dev#23394)
  REF: cython cleanup, typing, optimizations (pandas-dev#23464)
  strictness and checks for Timedelta _simple_new (pandas-dev#23433)
  Fixing flake8 problems new to flake8 3.6.0 (pandas-dev#23472)
  DOC: Updating the docstring of Series.dot  (pandas-dev#22890)
  TST: Fixturize series/test_analytics.py (pandas-dev#22755)
  BUG/ENH: Handle NonexistentTimeError in date rounding (pandas-dev#23406)
  PERF: speed up concat on Series by making _get_axis_number() a classmethod (pandas-dev#23404)
  REF: Remove DatetimelikeArrayMixin._shallow_copy (pandas-dev#23430)
  REF: strictness/simplification in DatetimeArray/Index _simple_new (pandas-dev#23431)
  REF: cython cleanup, typing, optimizations (pandas-dev#23456)
  TST: tweak Hypothesis configuration and idioms (pandas-dev#23441)
  BUG: fix HDFStore.append with all empty strings error (GH12242) (pandas-dev#23435)
  TST: Skip 32bit failing IntervalTree tests (pandas-dev#23442)
  BUG: Deprecate nthreads argument (pandas-dev#23112)
  style: fix import format at pandas/core/reshape (pandas-dev#23387)
  ...
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
The nthreads argument is no longer supported since pyarrow 0.11.0 and
was replaced with use_threads.
Hence we deprecate the argument now as well so we can remove it in the
future.

This commit also:
- removes feather-format as a dependency and replaces it with usage of
  pyarrow directly.
- sets CI dependencies to respect the changes above.

We test backwards compatibility with pyarrow 0.9.0 as conda does not
provide a pyarrow 0.10.0 and the conda-forge version has comatibility
issues with the rest of the installed packages.

Resolves pandas-dev#23053.
Resolves pandas-dev#21639.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
The nthreads argument is no longer supported since pyarrow 0.11.0 and
was replaced with use_threads.
Hence we deprecate the argument now as well so we can remove it in the
future.

This commit also:
- removes feather-format as a dependency and replaces it with usage of
  pyarrow directly.
- sets CI dependencies to respect the changes above.

We test backwards compatibility with pyarrow 0.9.0 as conda does not
provide a pyarrow 0.10.0 and the conda-forge version has comatibility
issues with the rest of the installed packages.

Resolves pandas-dev#23053.
Resolves pandas-dev#21639.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
The nthreads argument is no longer supported since pyarrow 0.11.0 and
was replaced with use_threads.
Hence we deprecate the argument now as well so we can remove it in the
future.

This commit also:
- removes feather-format as a dependency and replaces it with usage of
  pyarrow directly.
- sets CI dependencies to respect the changes above.

We test backwards compatibility with pyarrow 0.9.0 as conda does not
provide a pyarrow 0.10.0 and the conda-forge version has comatibility
issues with the rest of the installed packages.

Resolves pandas-dev#23053.
Resolves pandas-dev#21639.
@ingwinlu ingwinlu deleted the deprecate_nthreads branch April 27, 2021 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pandas/io/feather_format.py should call use_threads instead of nthreads to prevent breakage in pyarrow 0.11.0
6 participants