Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster HfFileSystem.find #1809

Merged
merged 13 commits into from
Nov 10, 2023
Merged

Faster HfFileSystem.find #1809

merged 13 commits into from
Nov 10, 2023

Conversation

mariosasko
Copy link
Contributor

Supersedes #1443

Makes the HfFileSystem.find method faster by making it "one-shot" instead of calling ls on each subdirectory. It also makes HfFileSystem.info faster by replacing the default implementation that fetches the parent path's /tree with a call to the /paths-info endpoint.

To implement this, this PR adds the following methods to the HfApi:

  • list_repo_tree (to optimize HfFileSystem.find)
  • get_paths_info (to optimize HfFileSystem.get_paths_info)

Additional improvements:

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Nov 8, 2023

The documentation is not available anymore as the PR was closed or merged.

@mariosasko
Copy link
Contributor Author

A HfFileSystem.find speed comparison on datasets/bigcode/the-stack-dedup between this PR's branch and main:

faster-hffs-find:
  hffs.find("datasets/bigcode/the-stack-dedup", detail=False) = 1.63 s
  hffs.find("datasets/bigcode/the-stack-dedup", detail=True) = 24.2 s

main:
  hffs.find("datasets/bigcode/the-stack-dedup", detail=False) = 46.2 s
  hffs.find("datasets/bigcode/the-stack-dedup", detail=True) = 47.3 s

@mariosasko mariosasko marked this pull request as ready for review November 8, 2023 20:08
@mariosasko mariosasko requested review from Wauplin and lhoestq November 8, 2023 20:08
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mariosasko for the PR and all the tests. I reviewed the logic and it looks good to me but I'd prefer a second pair of eyes on this (@lhoestq?). I only spotted 1 potential issue when detecting the common path of a list of dirs.

Otherwise, I like the list_repo_tree and get_paths_info that are more consistent with the server-side API. I was wondering if we could get rid of list_paths_info that is a mix of both, without any optimization. I saw it is used in datasets but list_repo_tree can be a drop-in replacement for it. In any case let's not do it in this PR that is already quite big.

src/huggingface_hub/hf_api.py Outdated Show resolved Hide resolved
@@ -226,18 +226,21 @@ def repo_type_and_id_from_hf_id(hf_id: str, hub_url: Optional[str] = None) -> Tu
return repo_type, namespace, repo_id


class LastCommitInfo(TypedDict, total=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is not your fault but I'm not a fan of those TypedDict... 😕

I think we should open a separate PR to switch all of them (BlobLfsInfo, LastCommitInfo, BlobSecurityInfo, TransformersInfo, SafeTensorsInfo) to dataclasses, in a backward compatible way (and with a deprecation warning when dict-only method is used). This way we'll finally have a single type to represent data returned by the server (now that you've removed ReprMixin).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Maybe this can be a "good first issue"?

if (recursive and dirs_not_in_dircache) or (detail and dirs_not_expanded):
# If the dircache is incomplete, find the common path of the missing and non-expanded entries
# and extend the output with the result of `_ls_tree(common_path, recursive=True)`
common_path = os.path.commonprefix(dirs_not_in_dircache + dirs_not_expanded).rstrip("/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Python docs, common_path is not guaranteed to be a valid path. For example,

>>> import os
>>> os.path.commonprefix(["data/train/", "data/test/"])
"data/t"

which can lead to an unexpected result in self._ls_tree(common_path, ...) just below.

I saw there is os.path.commonpath as well but I'm afraid it will not work on Windows. What we can do is to use PurePosixPath like this:

from pathlib import PurePosixPath
from typing import List


def unix_commonpath(paths: List[str]) -> str:
    """Return the longest common subpath of a list of Unix paths."""
    common_parents = set.intersection(
        *[set(parent.as_posix() for parent in PurePosixPath(path).parents) for path in paths]
    )
    return max(common_parents, key=lambda p: len(p))
>>> unix_commonpath(["data/train/", "data/test/"])
"data"

Let me know if you think of a simpler solution. If we go with this one, could you add a few test cases to ensure it works cross-platform?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also feel free to add "/" at the end of the common path, otherwise the subsequent .startswith(common_path) calls will match directories like "data_with_a_suffix"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a simpler fix to this. I added a test to ensure it works.

"name": path,
"size": 0,
"type": "directory",
"tree_id": None, # TODO: tree_id of the root directory?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Not sure there's one.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! 🔥 🔥 🔥

if (recursive and dirs_not_in_dircache) or (detail and dirs_not_expanded):
# If the dircache is incomplete, find the common path of the missing and non-expanded entries
# and extend the output with the result of `_ls_tree(common_path, recursive=True)`
common_path = os.path.commonprefix(dirs_not_in_dircache + dirs_not_expanded).rstrip("/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also feel free to add "/" at the end of the common path, otherwise the subsequent .startswith(common_path) calls will match directories like "data_with_a_suffix"

src/huggingface_hub/hf_file_system.py Show resolved Hide resolved

def info(self, path: str, **kwargs) -> Dict[str, Any]:
resolved_path = self.resolve_path(path)
def info(self, path: str, refresh: bool = False, revision: Optional[str] = None, **kwargs) -> Dict[str, Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we want the info with detail=False ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec doesn't expose the detail parameter in .info: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/spec.py#L639.

IMO, not exposing this parameter makes sense, as the .info method should return all the existing info about a file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make calls to isdir pretty slow no ? We use it a lot in datasets

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally isdir calls info() with detail=False and info() doesn't need to call expand=True in this case, which is slow on the Hub side.

It's a bit out of scope of this PR though, we can see later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change to allow calling .info with detail=False (through the kwargs) to support efficient exists/isdir/isfile, but detail is not epoxsed as the method parameter to follow the spec.

Co-authored-by: Lucain <lucainp@gmail.com>
@mariosasko
Copy link
Contributor Author

@Wauplin

Otherwise, I like the list_repo_tree and get_paths_info that are more consistent with the server-side API. I was wondering if we could get rid of list_paths_info that is a mix of both, without any optimization. I saw it is used in datasets but list_repo_tree can be a drop-in replacement for it. In any case let's not do it in this PR that is already quite big.

list_paths_info is not in the API. Did you mean list_files_info?

@Wauplin
Copy link
Contributor

Wauplin commented Nov 9, 2023

list_paths_info is not in the API. Did you mean list_files_info?

Ah yes, I meant list_files_info

@mariosasko mariosasko requested review from lhoestq and Wauplin November 9, 2023 17:48
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@lhoestq
Copy link
Member

lhoestq commented Nov 10, 2023

I also opened #1815 (continuation of this PR) to apply this optimization to glob and introduce the expand_info argument. Let me know what you think :)

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for making the changes!

@mariosasko mariosasko merged commit 7297957 into main Nov 10, 2023
12 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants