Skip to content

Stop inheriting from pyarrow.filesystem for pyarrow>=2.0 #411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jorisvandenbossche
Copy link
Contributor

See discussion in #295

Starting with pyarrow 2.0, the legacy pyarrow.filesystem FileSystems will be officially deprecated. As preparation, I think that fsspec can also stop inheriting from it.

Almost all methods on the base class from pyarrow are actually overridden / implemented in fsspec, so no longer inheriting from it shouldn't have a big impact, except for isinstance checks. I am doing some changes in pyarrow to allow fsspec filesystems instead of doing a strict isinstance.

@@ -68,7 +69,10 @@ def __call__(cls, *args, **kwargs):
# TODO: it should be possible to disable this
import pyarrow as pa

up = pa.filesystem.DaskFileSystem
if LooseVersion(pa.__version__) < LooseVersion("2.0"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use an else clause for this, rather than putting it in the try block, so that the code in the try block is minimized?

@martindurant
Copy link
Member

Great to see this.

It would be nice not to import pyarrow at all, to improve fsspec import times. Can the version check be done without importing? This works, but needs distutils:

import pkg_resources
pkg_resources.get_distribution('pyarrow').version

I am doing some changes in pyarrow

Can you please link?

@jorisvandenbossche
Copy link
Contributor Author

The pyarrow PR is here: apache/arrow#8149

@jorisvandenbossche
Copy link
Contributor Author

It would be nice not to import pyarrow at all, to improve fsspec import times. Can the version check be done without importing? This works, but needs distutils:

In my development environment, I get an error for this:

In [6]: import pkg_resources
   ...: pkg_resources.get_distribution('pyarrow').version
---------------------------------------------------------------------------
DistributionNotFound                      Traceback (most recent call last)
<ipython-input-6-da7d87549900> in <module>
      1 import pkg_resources
----> 2 pkg_resources.get_distribution('pyarrow').version

~/miniconda3/envs/arrow-dev/lib/python3.7/site-packages/pkg_resources/__init__.py in get_distribution(dist)
    479         dist = Requirement.parse(dist)
    480     if isinstance(dist, Requirement):
--> 481         dist = get_provider(dist)
    482     if not isinstance(dist, Distribution):
    483         raise TypeError("Expected string, Requirement, or Distribution", dist)

~/miniconda3/envs/arrow-dev/lib/python3.7/site-packages/pkg_resources/__init__.py in get_provider(moduleOrReq)
    355     """Return an IResourceProvider for the named module or requirement"""
    356     if isinstance(moduleOrReq, Requirement):
--> 357         return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
    358     try:
    359         module = sys.modules[moduleOrReq]

~/miniconda3/envs/arrow-dev/lib/python3.7/site-packages/pkg_resources/__init__.py in require(self, *requirements)
    898         included, even if they were already activated in this working set.
    899         """
--> 900         needed = self.resolve(parse_requirements(requirements))
    901 
    902         for dist in needed:

~/miniconda3/envs/arrow-dev/lib/python3.7/site-packages/pkg_resources/__init__.py in resolve(self, requirements, env, installer, replace_conflicting, extras)
    784                     if dist is None:
    785                         requirers = required_by.get(req, None)
--> 786                         raise DistributionNotFound(req, requirers)
    787                 to_activate.append(dist)
    788             if dist not in req:

DistributionNotFound: The 'pyarrow' distribution was not found and is required by the application

Maybe that doesn't work with editable installs? (as in another environment it does work)

Based on a quick non-scientific test (time python -c "import pyarrow" vs time python -c "import pkg_resources; pkg_resources.get_distribution('pyarrow').version"), the pkg_resources way is 2x faster

@martindurant
Copy link
Member

Apparently importlib.metadata does this, but that only exists in py38. A backport exists, and is really fast

import importlib_metadata
importlib_metadata.distribution('pyarrow').version

Does this work with your dev env?

@jorisvandenbossche
Copy link
Contributor Author

Nope, that gives PackageNotFoundError: pyarrow.

Now, I could also add another layer of try/except, and first try the pkg_resources way, and only if that fails, still try an actual import. Then at least for the majority that has a normal pyarrow install and not a dev install, it will work faster?

@martindurant
Copy link
Member

Asking the anaconda crowd, someone must have a nice way. This would be a nice little function in utils, rather than all the logic in this import/don't import block.

@martindurant
Copy link
Member

I am only getting regex suggestions, so I would say:

Make a function in utils which, for a given package name, in order

  • checks sys.modules
  • uses importlib.metadata
  • tries importlib_metadata
  • tries pkg_resources
  • does the import

@jamesmyatt
Copy link
Contributor

jamesmyatt commented Sep 17, 2020

There's no reason you can't make importlib_metadata an optional dependency, as described in #5 of https://packaging.python.org/guides/single-sourcing-package-version/. Then you never need to try pkg_resources or the import.

@martindurant
Copy link
Member

Then you never need to try pkg_resources or the import.

If it's optional, it might not be there - so importing fsspec would fail unless it's a hard dependency.

@jamesmyatt
Copy link
Contributor

jamesmyatt commented Sep 17, 2020

Then you never need to try pkg_resources or the import.

If it's optional, it might not be there - so importing fsspec would fail unless it's a hard dependency.

Sorry, I meant "conditional" rather than "optional".

setup(
    ...
    install_requires=[
        ...
        'importlib-metadata ~= 1.0 ; python_version < "3.8"',
        ...
    ],
    ...
)

@martindurant
Copy link
Member

I am uncertain, as fsspec currently has no dependencies at all.
Can we try with the set of checks I outlines, which doesn't assume any dependencies? That it might in some rarer cases take a little longer is OK with me.

@jorisvandenbossche
Copy link
Contributor Author

Make a function in utils which, for a given package name, in order

Could that potentially be left for a separate PR?
It would be nice to avoid the pyarrow import for sure, but in the end, this PR is not adding any import of pyarrow. That's already the case on master as well.

@martindurant
Copy link
Member

OK fine, I'll try to get around to doing that myself.
Last question: _isfilestore -> False, what does this do, and should it not depend on whether we have inherited from (legacy) filesystem or not?

@jorisvandenbossche
Copy link
Contributor Author

Yeah, I don't really know if the is_filestore is needed. But, it's actually already included in fsspec's local filesystem as well:

https://github.com/intake/filesystem_spec/blob/fbc04cd38acd43a1e9768bd1c6312f5226286de2/fsspec/implementations/local.py#L150-L154

and we actually use it in one place in pyarrow (to create a new directory): https://github.com/apache/arrow/blob/7a532edeabc6f30838e5a53dfef35f37fdf99737/python/pyarrow/parquet.py#L1718-L1723
I suppose we could also do that without checking the attribute (eg in a try / except or so), but for backwards compatibility it seemed best to just add this here for the time being.

From checking the other methods, this _isfilestore seems the only thing that fsspec actually inherited from pyarrow (other methods were in practice overridden by fsspec)

@martindurant
Copy link
Member

OK, I'll take your word for it :)
Have you run dask parquet tests against these changes, which is the place they are most likely to show problems from our point of view?

@jorisvandenbossche
Copy link
Contributor Author

Good idea, I only ran the pyarrow parquet tests with it (but those have only limited tests with fsspec filesystems), will run the dask parquet tests as well

@martindurant
Copy link
Member

I intend to release soon, so please let me know how the dask tests go.

@jorisvandenbossche
Copy link
Contributor Author

Tests seem to pass

@martindurant
Copy link
Member

"Seem" !!

@martindurant martindurant merged commit f23346e into fsspec:master Sep 24, 2020
@jorisvandenbossche
Copy link
Contributor Author

"Seem" !!

To be honest, I would be more confident I was correctly testing it if there was a failure to fix .. ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants