Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow pruning dirnames in a topdown walk to improve speed on large filesystems #1240

Closed
wants to merge 0 commits into from

Conversation

epizut
Copy link
Contributor

@epizut epizut commented Apr 6, 2023

Solving #1223

Allowing modifying dirnames in order to improve browsing large filesystems:

os.walk

When topdown is True, the caller can modify the dirnames list in-place (perhaps using del or slice assignment),
and walk() will only recurse into the subdirectories whose names remain in dirnames;
this can be used to prune the search, impose a specific order of visiting, or even to inform walk()
about directories the caller creates or renames before it resumes walk() again.

Related issue: Add topdown kwarg to AbstractFileSystem.walk()

@martindurant
Copy link
Member

Sorry, this one seems to have passed me by. Please ping me again, and I'll look early next week.

@epizut
Copy link
Contributor Author

epizut commented May 9, 2023 via email

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ianthomas23 , I wouldn't mind your eyes on this. I think it looks good, but I have not had time to think about any unintended side effects.

fsspec/spec.py Outdated
Comment on lines 378 to 381
When topdown is True, the caller can modify the dirs list in-place,
and walk() will only recurse into the subdirectories whose names remain in dirs.
Modifying dirs when topdown is False has no effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When topdown is True, the caller can modify the dirs list in-place,
and walk() will only recurse into the subdirectories whose names remain in dirs.
Modifying dirs when topdown is False has no effect.

Text repeats below

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be inclined to remove this text block as the block below is similar and contains more information.

@martindurant
Copy link
Member

(please fix the linting failure, for which you might choose to run locally:

pre-commit install
pre-commit run -a

)

@@ -302,18 +302,30 @@ def test_find(self, scenario: ArchiveTestScenario):
assert fs.find("deeply/") == fs.find("deeply")

@pytest.mark.parametrize("topdown", [True, False])
def test_walk(self, scenario: ArchiveTestScenario, topdown):
@pytest.mark.parametrize("purne_nested", [True, False])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that purne should be prune?

fsspec/spec.py Outdated
and walk() will only recurse into the subdirectories whose names remain in dirs.
Modifying dirs when topdown is False has no effect.

When topdown is True, the caller can modify the dirnames list in-place (perhaps using del or slice assignment),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI linting failures are all about these lines being too long. @epizut If you keep each line to 88 characters then the CI will pass.

list(m.walk(dir0, maxdepth=0, topdown=False))
list(m.walk(dir1, maxdepth=0, topdown=False))

# purne dir111
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

purne should be prune again.

Copy link
Collaborator

@ianthomas23 ianthomas23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally this looks good to me, and I don't think there are any side-effects. Just a few nitpick comments to address from my point of view.

@efiop
Copy link
Member

efiop commented May 10, 2023

For the record: we were doing a similar thing in our localfs https://github.com/iterative/dvc-objects/blob/8dfef0b07f91c8c91dacbbfe5130d34fbaca68cb/src/dvc_objects/fs/local.py#L73 but didn't get to contributing it back yet. I'm really happy to see this PR not only tackle regular implementation but also async. Thank you @epizut 🙏

@martindurant
Copy link
Member

@epizut , I made fixes, and this closed your PR, not sure why.

@martindurant
Copy link
Member

#1263 replaces this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants