Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid Kedro fsspec requirements being mutually incompatible with pandas 1.1.0 #489

Closed
JMBurley opened this issue Aug 20, 2020 · 17 comments
Closed
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@JMBurley
Copy link

JMBurley commented Aug 20, 2020

Description

Kedro 16.4 enforces fsspec<0.7.0,>=0.5.1 & PANDAS = "pandas>=0.24, <1.0.4" in setup.py

However,

  1. the pd upper version limit is not enforced, so newer pd versions can coexist after pip install kedro
  2. The blocking issue (for kedro including higher pd versions) on parquet reads in pandas is resolved re: BUG: read_parquet no longer supports file-like objects pandas-dev/pandas#34467 in pd >1.0.5, so the key blocker on Kedro upgrading pandas versions no longer exists.

IF kedro allows pandas 1.1.0, then you are going to hit an incompatibility with fsspec, as pandas 1.1.0 requires fsspec>=0.7.4.

Context

You cannot run kedro and pandas>1.1.0 on the same environment. Pandas needs fsspec>0.7.4.

There are meaningful improvements in newer pandas, so I would like to be able to run them together out-of-the-box.

The current reason to not allow higher pd versions as per setup.py in the kedro source code (pandas-dev/pandas#34467) is no longer applicable, so i think it is time to make this change unless fsspec has deal-breaking problems in later versions.

Possible Implementation

Can you just update fsspec version requirements to fsspec<0.7.4,>=0.5.1? I'm not aware of any major problems in the newer versions

Possible Alternatives

Raise warnings or errors if kedro co-exists with pd>=1.1.0

@JMBurley JMBurley added the Issue: Feature Request New feature or improvement to existing feature label Aug 20, 2020
@JMBurley
Copy link
Author

JMBurley commented Aug 20, 2020

Happy to make pull request with my suggested implementation if desired. Although it is only a 1-liner edit so barely saving any labour...

@JMBurley
Copy link
Author

nb./ This issue is a superset of issue #488. Modin compatibility will be fully resolved with an update to fsspec and pandas 1.1.0

@fjp
Copy link

fjp commented Aug 25, 2020

@JMBurley would it also be possible to update to fsspec 0.8.0 if you create a PR or do you see any problems with the latest version?

@JMBurley
Copy link
Author

JMBurley commented Aug 25, 2020

@fjp I could update fsspec in kedro setup files, but I don't think I should.

Kedro has a lot of I/O functionality that could intersect with fsspec in a lot of ways. Unless quantumBlack is very confident in their hooks and tests I don't think I should be making that change (particularly as I don't understand fsspec very deeply).

On the other hand, I am a (minor) Pandas contributor and understand that library well enough to know that 1.1.0 is a good upgrade and will not break data pipelines if the fsspec dependency is correctly managed (gcsfs and similar have been stable for long enough that it shouldn't be an issue for kedro)

@lorenabalan
Copy link
Contributor

lorenabalan commented Aug 27, 2020

Hi guys, thanks a lot for the interest here. Unfortunately fsspec 0.7.* introduced a breaking change for us (namely on versioning), which was addressed in develop: 3218055. We've unpinned pandas for now, but broader fsspec boundaries will be made available in the next major release (0.17.0).

@JMBurley
Copy link
Author

JMBurley commented Aug 27, 2020

Thanks @lorenabalan, if I'm reading develop: 3218055 correctly then it would make sense to allow broader pandas boundaries as part of the same feature release in 0.17.0?

The reason quantum Black unpinned pandas (pandas-dev/pandas#34467) is solved, and the fsspec problem that would have been introduced by a pd update is now fixed.

Would only need to change the appropriate pandas-determining lines in requirements.txt and setup.py.

@tamsanh
Copy link
Contributor

tamsanh commented Aug 31, 2020

Oops. FYI, I'm running into an incompatibility with s3fs 0.5.0 as it requires fsspec>=0.8.0.

@sansagara
Copy link

Oops. FYI, I'm running into an incompatibility with s3fs 0.5.0 as it requires fsspec>=0.8.0.

Im having this same issue. Its keeping me from creating my environment. A lower version of s3fs is not really a good option for me.

@lorenabalan
Copy link
Contributor

Hey @JMBurley , I just noticed with Kedro 0.16.5, I get new pandas==1.1.2 and fsspec==0.6.3 after kedro install. Are you able to validate on your end that it's indeed the case and the original parquet problem is fixed? Feel free to close the issue once that's confirmed.

@JMBurley
Copy link
Author

JMBurley commented Sep 16, 2020

Hi @lorenabalan ,

Thanks for the update, appreciate having a higher pandas version available, that's great.

However, if I'm reading your comment correctly kedro 0.16.5 has created the exact problem I was forewarning against, pandas>=1.1.0 necessitates fsspec>=0.7.4.

If kedro enforces pandas==1.1.2 and fsspec==0.6.3 then all of the s3 operations in pandas will fail (eg. pd.read_csv("s3://my-bucket/some-file-key.csv").

QuantumBlack can keep track of Pandas dependencies here: https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html

PS. To clear up any confusion, I mentioned parquet as it was the blocker on kedro allowing higher pd versions. It is not the key part of this issue, which is flagging up kedro compatibility issues if it allows higher pandas versions without higher fsspec

@DmitriiDeriabinQB
Copy link
Contributor

@JMBurley pandas==1.1.2 has fsspec in optional arguments, meaning that dependency resolution tools have no idea that it depends on fsspec. You can confirm that by installing pandas==1.1.2 and running pip show pandas:

Name: pandas
Version: 1.1.2
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: None
Author-email: None
License: BSD
Location: .../python3.7/site-packages
Requires: pytz, python-dateutil, numpy  # <<< no fsspec here
Required-by:

Kedro has not created a problem, Kedro declares its dependencies and the corresponding acceptable version ranges using the standard mechanism - install_requires in setup.py. If pandas is not specifying the dependencies the way that is recognised by the dependency resolution tools, then Kedro has no control over it.

pandas==1.1.2 works within Kedro 0.16.5 without issues since Kedro pulls the data from the filesystem using fsspec directly and then pass the file-like object to pandas.read_* methods. If you, however, intend to read/write to non-local locations using pandas directly from within thee environment where Kedro is installed, that will not work. This is the reason we strongly recommend having isolated environment for each project. The solution would be to cap the upper bound version of pandas in src/requirements.txt (or src/requirements.in if already compiled) of your Kedro project.

Lastly, correct dependency version resolution is not the responsibility of Kedro library, but rather package management tools like pip, pip-compile, poetry, etc. Kedro uses pip-compile under the hood when installing project dependencies, however pandas doesn't expose enough information about its dependencies to pip-compile to resolve the versions correctly.

@lorenabalan
Copy link
Contributor

lorenabalan commented Sep 17, 2020

Hello again!
So what happened was we relaxed the pandas requirements, to be just >0.24.0. fsspec is an optional requirement for pandas, so it's not an impediment in fetching the latest pandas. fsspec is pulled according to Kedro core requirements, <0.7.0.
The way Kedro interacts with fsspec means pandas doesn't have to, so there are actually no compatibility issues on that front. We read the data using fsspec directly and then pass a file-like object to the pandas.read_csv. The example you gave, with passing a path/string indeed won't work, but fortunately that is not how Kedro datasets work so it should be safe.

We hear you all on the fsspec bump, and as I said this has been implemented on develop and will be made available on 0.17.0. Hopefully this release will come sooner rather than later, thanks to you all raising awareness/visibility on your issues! :)

@lorenabalan
Copy link
Contributor

@DmitriiDeriabinQB has beat me to it. 😂 ⏲️

@JMBurley
Copy link
Author

JMBurley commented Sep 17, 2020

Thanks both,

Would I be correct in understanding the official position here as:
"all file loads should be handled by .yml in your catalog folder; it is anti-pattern to directly read files with pandas, therefore we don't view it as critical for kedro to facilitate pandas s3 functionality, nor maintain it if it is incidentally present in some kedro versions"?

I understand the position if so, although I think there are good reasons to prototype eda in a project using pandas s3 functionality rather than hopping in and out of the data catalog to create potentially disposable items.

Regardless, looking forward to v0.17.0!

PS. I agree that anyone who doesn't use virtual environments deserves the problems they encounter

@DmitriiDeriabinQB
Copy link
Contributor

@JMBurley, we would be happy not to break pandas, however, since we require older version of fsspec and pandas doesn't register fsspec in its core dependencies, it's rather hard to do. We could cap pandas version in our requirements, however this would mean that other users (that don't require direct s3 access from it) would be unable to use the newest version of pandas. We will have a look at upgrading our dependency on fsspec, however it may affect other datasets.

Using pandas directly for eda makes perfect sense, we will have a look at what can be done here.

@andrii-ivaniuk
Copy link
Contributor

fsspec maximum required version was updated in 151cbcd commit.

fsspec>=0.5.1, <0.9

@andrii-ivaniuk
Copy link
Contributor

The changes will be available in the upcoming Kedro 0.17.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

7 participants