Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial support for filesystems in FileSet #374

Merged
merged 20 commits into from
Nov 10, 2020

Conversation

gerritholl
Copy link
Contributor

@gerritholl gerritholl commented Oct 6, 2020

This PR adds partial support for fsspec AbstractFileSystem implementation in FileSet, in particular for finding files. An example using sf3s:

import s3fs
from typhon.files.fileset import FileSet
abi_fileset = FileSet(
        path="noaa-goes16/ABI-L1b-RadF/{year}/{doy}/{hour}/OR_ABI-L1b-RadF-M6C02_G16_s{year}{doy}{hour}{minute}{second}*_e{end_year}{end_doy}{end_hour}{end_minute}{end_second}*_c*.nc",                                                                                                 
        name="abi",
        fs=s3fs.S3FileSystem(anon=True))
for f in abi_fileset.find("2019-11-18T05:30", "2019-11-18T06:15"):
    print(f)

Gives:

noaa-goes16/ABI-L1b-RadF/2019/322/05/OR_ABI-L1b-RadF-M6C02_G16_s20193220530285_e20193220539593_c20193220540030.nc
noaa-goes16/ABI-L1b-RadF/2019/322/05/OR_ABI-L1b-RadF-M6C02_G16_s20193220540285_e20193220549593_c20193220550041.nc
noaa-goes16/ABI-L1b-RadF/2019/322/05/OR_ABI-L1b-RadF-M6C02_G16_s20193220550285_e20193220559593_c20193220600041.nc
noaa-goes16/ABI-L1b-RadF/2019/322/06/OR_ABI-L1b-RadF-M6C02_G16_s20193220600285_e20193220609593_c20193220610030.nc
noaa-goes16/ABI-L1b-RadF/2019/322/06/OR_ABI-L1b-RadF-M6C02_G16_s20193220610285_e20193220619593_c20193220620040.nc

First steps to add filesystem support in fsspec.
It turns out that glob.glob adds a trailing / for directories, whereas
FileSystem.glob does not.  Add this trailing / manually for directories,
as further processing expects this.
Because Bill and his friends decided to use the forward slash for
parameters because they didn't have directories anyway before I was even
born, they're making life harder for us.
@gerritholl
Copy link
Contributor Author

I don't know why it still fails on Windows despite replacing / by os.sep.

Add unit tests to specifically test that other filesystems work with
fileset (not passing yet).
Make necessary adaptations to actually support other file systems.
Not all file systems have a root (s3fs, zipfile do not), and if they
don't, then the FileHandler insistance on absolute paths means it's not
possible for any file to exist on such a system.  The tests pass on my
system on Python 3.8.
@gerritholl
Copy link
Contributor Author

This still fails on Windows (I don't know why) and it doesn't work as intended with S3FileSystem (it seems globbing behaves differently there): fs_s3.glob("noaa-goes16/ABI-L1b-RadF/2020/045/*") returns nothing even though fs_s3.ls("noaa-goes16/ABI-L1b-RadF/2020/045/") does, and fs_s3.glob("noaa-goes16/ABI-L1b-RadF/2020/*") is very slow, suggesting there is something recursive going on.

@gerritholl
Copy link
Contributor Author

Finding files is extremely slow due to globbing in s3fs being extremely slow, probably due to the problem reported at fsspec/s3fs#378.

@gerritholl gerritholl changed the title Start work on fileset filesystem support. Partial support for filesystems in FileSet Oct 8, 2020
@gerritholl gerritholl marked this pull request as ready for review October 8, 2020 08:38
@gerritholl
Copy link
Contributor Author

gerritholl commented Oct 8, 2020

Note that at present the s3fs example only works with s3fs 0.5.0 and not with the latest released s3fs 0.5.1, due to the bug reported at fsspec/s3fs#378. Hopefully that problem will be fixed so that this searching works with s3fs 0.5.2 or 0.6.

I don't understand why this raises IndexError on Windows and I don't
have a Windows machine with Python setup, but maybe adding it to the
exception list makes this pass.  Not sure if it would.
Remove unused imports and convert a static method to a regular method to
account for the need of self.file_system.
@gerritholl
Copy link
Contributor Author

Can anyone with access to a Windows Python installation shed light on why the tests may be failing on Windows? Did the tests actually succeed on Windows prior to this PR?

Fix one more hardcoded / replacing or extending by os.sep.  I don't know
what happens if on Windows one interacts with remote filesystems such as
FTP or S3FS (/ or \?), so test for both.
@olemke
Copy link
Member

olemke commented Oct 9, 2020

Hi @gerritholl! I can test it on a Windows machine, but not before some time later next week since I don't have a setup ready.
However, the tests were working before this PR as you can see in the build log: https://github.com/atmtools/typhon/actions/runs/288892462

@gerritholl
Copy link
Contributor Author

Meanwhile fsspec/s3fs#379 was merged so the typhon functionality and the documented example should work with the latest s3fs master or the next release.

@olemke
Copy link
Member

olemke commented Oct 16, 2020

This seems to be caused by inconsistent handling of path separators. One place I found to be inconsistent is in

https://github.com/gerritholl/typhon/blob/af22c48f0721e2961388a9305cc1cde26a7cf6a1/typhon/files/fileset.py#L1328-L1329

self.file_system.glob returns a path with / separators while regex uses the \\ version. There might be other places where this causes a problem, but this is all I had time to look at now.

@gerritholl
Copy link
Contributor Author

Ah, that makes sense. If we search on a remote file system like this PR intends to support, then the assumption that the file system uses os.sep as a separator is no longer guaranteed to be true. I will try to think of a solution.

In a fileset, accept \\ to match / on Windows.  The generic nature of
AbstractFileSystem means that os.sep may not match the separator used on
the filesystem.  fsspec solves this by always using /, even on Windows,
but forcing this would break backward compatibility, therefore accept
both.
With filesystems support, filesestyms always use / even on Windows,
therefore use posixpath.join to construct test cases.
Replace separater in base testdir because fsspec always uses normal
separators, not windows ones.
@gerritholl
Copy link
Contributor Author

With all the commits to make it work on Windows it has become regrettably somewhat ugly, and perhaps breaking backward compatibility despite tests passing. Can Windows users please comment if it breaks their workflow or not?

@olemke
Copy link
Member

olemke commented Oct 21, 2020

I don't know if there are any Windows users of Typhon at the moment, so don't expect an answer to your question. We mainly added Windows as a supported platform because Typhon is a Python-only package and it seemed like the right thing to do to be good Python citizens.
Thanks for putting in the work to keep this cross-platform. From my side this is ready to be merged. I'll have a chat with Lukas if he's also fine with it when he's back next week. I don't think you'll need to add John to PRs in the future since he has moved on to other things.

Copy link
Member

@lkluft lkluft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general 👍 I only left some minor comments that you may address or ignore ;)

def _ensure_local_filesystem(self, file_info):
if not isinstance(file_info.file_system, LocalFileSystem):
raise NotImplementedError(
f"File handler {str(type(self).__name__):s} can only "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess type(...).__name__ is always a string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, simplified.

Comment on lines 330 to +334
def __repr__(self):
return f"FileInfo(\n '{self.path}',\n" \
f" times={self.times},\n" \
f" attr={self.attr}\n)"
f" attr={self.attr},\n" \
f" fs={self.file_system})"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The __repr__ method is used to represent an object in containers (e.g. a list of FileInfo) so this implementation might be a bit wordy. But this is just a fine-tuning comment that you can happily ignore.

typhon/tests/files/test_fileset.py Outdated Show resolved Hide resolved
@gerritholl
Copy link
Contributor Author

I thought I had write access here, but apparently not (anymore?) as no "merge" button shows up ("Only those with write access to this repository can merge pull requests.).

@olemke
Copy link
Member

olemke commented Nov 10, 2020

Merges can only be done by maintainers which are currently Lukas and me.

@olemke olemke merged commit abf98bd into atmtools:master Nov 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: Add fsspec FileSystem support in FileSet
3 participants