Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Petl use fsspec #494

Merged
merged 6 commits into from
Jun 29, 2020
Merged

Petl use fsspec #494

merged 6 commits into from
Jun 29, 2020

Conversation

juarezr
Copy link
Member

@juarezr juarezr commented Jun 25, 2020

Changes

  • handle remote files using fsspec
  • removed sources and codecs created in v1.5.0 as all cases are handled by fsspec.
  • Keep SMBHandler class as fsspec does not handle it yet.

Checklist

  • Includes unit tests
  • New functions have docstrings with examples that can be run with doctest
  • New functions are included in API docs
  • Docstrings include notes for any changes to API or behaviour
  • Travis CI passes (unit tests run under Linux)
  • AppVeyor CI passes (unit tests run under Windows)
  • Unit test coverage has not decreased (see Coveralls)
  • All changes documented in docs/changes.rst

use fsspec for handling remote protocols
Add RemoteSource for handling all remote sources with fsspec
Remove S3Source replaced by fsspec
Keep SMBSource as it isn't handled by fsspec
Refactored namespaces to remotes
remove codec implementation in petl
@coveralls
Copy link

coveralls commented Jun 25, 2020

Coverage Status

Coverage decreased (-1.08%) to 90.594% when pulling a620194 on juarezr:petl_use_fsspec into ee769a7 on petl-developers:master.

@alimanfoo
Copy link
Collaborator

Hi @juarezr, looking good so far. Please feel free to request a review whenever you're ready.

@juarezr
Copy link
Member Author

juarezr commented Jun 25, 2020

Oh! Sorry! I forgot to request your review.

I think it's ready to review and merge after:

  1. The sources and codecs introduced in v1.5.0 were replaced by fsspec except SMBHandler that is missing.
  2. I'd tried to maintain compatibility by:
    1. Handling all protocols through fsspec except http and https
    2. Continuing to handle raw paths and http through old sources for avoiding breakages.
    3. Updated docs for deprecating the direct use of some sources.
    4. In a future break compatibility release the code could be simplified further.
  3. Kept the source register functions public because:
    1. Should work as workaround in some shortcomings
    2. One should decide use them instead of fsspec
    3. fsspec doesn't works for python2.

Can you review it, please?

Copy link
Collaborator

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @juarezr, a couple of minor comments.

One slightly bigger question, currently this does not support local files or http(s) URLs with zstd, lz4 or lzma (xz) compression, because these are not opened via fsspec, is that correct?

petl/io/remotes.py Outdated Show resolved Hide resolved
petl/io/remotes.py Outdated Show resolved Hide resolved
Ready to merge.

Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
@juarezr
Copy link
Member Author

juarezr commented Jun 26, 2020

Hi @juarezr, a couple of minor comments.

One slightly bigger question, currently this does not support local files or http(s) URLs with zstd, lz4 or lzma (xz) compression, because these are not opened via fsspec, is that correct?

If using /path/to/file.csv.gz it uses GzipSource handling for not breaking compatibility.

If using file:///path/to/file.csv.gz it uses fsspec handling that supports zstd, lz4 or xz.

@alimanfoo
Copy link
Collaborator

If using /path/to/file.csv.gz it uses GzipSource handling for not breaking compatibility.

If using file:///path/to/file.csv.gz it uses fsspec handling that supports zstd, lz4 or xz.

I see, thanks.

Would it be possible to use fsspec for all values except those with 'smb://' protocol? If so, then all compressors would be supported for local files without having to add the 'file://' protocol. I don't think there is any issue with compatibility, i.e., reading '/path/to/file.csv.gz' should still behave exactly the same as before, it just gets opened via fsspec.

@alimanfoo
Copy link
Collaborator

Btw I noticed that fsspec does not support Python 2, but petl can drop support for Python 2 also, it is EOL. I realise that dropping Python 2 support may take some work, however, and so might be too much to deal with in this PR.

@juarezr
Copy link
Member Author

juarezr commented Jun 26, 2020

Btw I noticed that fsspec does not support Python 2, but petl can drop support for Python 2 also, it is EOL

Also PIP is hammering another nail in the python2's coffin. It's planned for in pip 21.0 (Jan 2021). But I don't know how this impacts existing python2 packages.

Would it be possible to use fsspec for all values except those with 'smb://' protocol?

For doing this we need a little change: prefix raw paths with file:// and remove existing sources.

Doing this will also makes fsspec a hard requirement for petl.

I think that supporting petl standardalone is worth it. And existing sources enable petl working without dependencies.

Because of this decoupling, I kept in this PR:

  1. raw paths are handled by existing sources
  2. http:// and https:// continue to be handled by URLSource
    1. In this case fsspec doesn't add much more. They also don't support upload.
    2. However this handler doesn't benefits from additional codecs Zstandard, lz4, snappy and xz.
  3. kepping this behaviour reduces the changes of breakages also.

If so, then all compressors would be supported for local files without having to add the 'file://' protocol.

It's possible to ressurect the codec handlers removed in this PR (#494). But:

  1. It's required installing an additional package for each codec working. In this case also installing fsspec looks like an acceptable trade off.
  2. The excpetion is the xz compression that is included in python3. Maybe adding just a new XZSource in the future, is worth it.

I realise that dropping Python 2 support may take some work, however, and so might be too much to deal with in this PR.

I agree.

BTW, I started digging in a smb:// filesystem for fsspec. But the work for getting it working is bigger because i'ts required to implement all filesystem functionality: connect, open, ls, mkdir, rm. etc...

@alimanfoo
Copy link
Collaborator

Thanks @juarezr. I'm happy to follow what you think is the best approach here.

@juarezr
Copy link
Member Author

juarezr commented Jun 26, 2020

Thans @alimanfoo !
It has been pretty fun working in this codebase.
And pretty challenging trying to respect english language semantics... 😄

@alimanfoo
Copy link
Collaborator

It has been pretty fun working in this codebase.
And pretty challenging trying to respect english language semantics... 😄

Well I speak only one language, and even that badly, so I am in awe of anyone who can do technical communication in a second language 😄

@alimanfoo
Copy link
Collaborator

Ready to merge?

@juarezr
Copy link
Member Author

juarezr commented Jun 29, 2020

If you don't have any objections, I think it's ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants