Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3FileSystem.find appears not to respect max_depth #378

Closed
gerritholl opened this issue Oct 7, 2020 · 7 comments · Fixed by #379
Closed

S3FileSystem.find appears not to respect max_depth #378

gerritholl opened this issue Oct 7, 2020 · 7 comments · Fixed by #379

Comments

@gerritholl
Copy link
Contributor

What happened:

When I call find(d, maxdepth=1) on a directory with 24 subdirectories, it returns not only the 24 subdirectories but also all the files contained in all those directories. That means, the output is:

24
2304

What you expected to happen:

Due to the maxdepth=1 parameter, I expected it would only return the 24 subdirectories. That means, I expect the output to be:

24
24

Minimal Complete Verifiable Example:

import s3fs
fs=s3fs.S3FileSystem(anon=True)
d = "noaa-goes16/ABI-L1b-RadF/2020/045/"
print(len(fs.ls(d)))
print(len(fs.find(d, maxdepth=1)))

Anything else we need to know?:

This makes globbing in higher level directories so slow as to be unusable. For example, a glob of "noaa-goes16/ABI-L1b-RadF/*" will call .find("noaa-goes16/ABI-L1b-RadF", maxdepth=1) which (ignoring the maxdepth parameter) will list the entire contents of the subdirectory, millions of files, and will take a very long time to complete.

Environment:

  • s3fs version: s3fs-0.5.1+24.gf94478e (latest master)
  • Python version: 3.8.5
  • Operating System: OpenSUSE 15.0
  • Install method (conda, pip, source): pip install from git master
@martindurant
Copy link
Member

Sorry about this, a case of choosing one optimisation over another.

#360 changed find to not be recursive, so that you can complete the request in a single call for a nested directory tree, because find is usually used to get everything. The previous version was usually slower except in the case you have before you, or a small value for depth.

We could resurrect the old code, and use it for the case that depth is small. What do you think?

@gerritholl
Copy link
Contributor Author

gerritholl commented Oct 7, 2020

I haven't followed the history so it's difficult for me to comment on the optimisation of find (and I don't understand the linked PR; from its diff it looks like the method didn't exist at all before?), but the current situation in which there exists a max_depth argument that is ignored (even without warning) seems undesirable. I'm not personally using find directly, but rather fs.glob("noaa-goes16/ABI-L1B-RadF/*") (or a different pattern, but with a single * only), so if find becomes too complicated when maintaining two approaches for small or large values of depth, then it would work for me if glob with a single * got its listing elsewhere (such as from ls?).

@gerritholl
Copy link
Contributor Author

Oops, I just realised that the glob method implementation is not in s3fs at all, but inherited from s3fs.AbstractFileSystem. So it would still need to be resolved here (changing the implementation of find seems then more sensible than overloading glob).

@gerritholl
Copy link
Contributor Author

I can confirm that in v0.5.0 the MCVE (adding withdirs=True) works as expected, but in v0.5.1 there is a problem. Slightly longer example:

import s3fs
fs=s3fs.S3FileSystem(anon=True)
d = "noaa-goes16/ABI-L1b-RadF/2020/045/"
print(len(fs.ls(d)))
print(len(fs.find(d, maxdepth=1, withdirs=True)))
print(len(fs.glob(d + "*")))

With 0.5.0:

24
24
24

With 0.5.1:

24
2304
0

i.e. it's not just trading one optimisation against another, I think we can call this broken.

@martindurant
Copy link
Member

I would appreciate if you would for this into a (failing) test function with expectations in a PR, so that a fix is easier to make.

@martindurant
Copy link
Member

Note that with current master, the last result is also 24, so something got fixed along the line.

gerritholl added a commit to gerritholl/s3fs that referenced this issue Oct 8, 2020
Added a test to verify that the find method respects the maxdepth
parameter.  With ``maxdepth=1``, the results of ``find`` should be
the same as those ``ls``, without returning subdirectories.  See also
issue 378.
@gerritholl
Copy link
Contributor Author

See #379 for a unit test exposing the problem. I need to focus on something else right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants