Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer the hffs code to hfh #1420

Merged
merged 19 commits into from
Apr 6, 2023
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@
title: Repository
- local: guides/search
title: Search
- local: guides/filesystem
title: Filesystem
- local: guides/inference
title: Inference
- local: guides/community
Expand Down Expand Up @@ -52,6 +54,8 @@
title: Mixins & serialization methods
- local: package_reference/inference_api
title: Inference API
- local: package_reference/hf_filesystem
title: Hugging Face Hub Filesystem
mariosasko marked this conversation as resolved.
Show resolved Hide resolved
- local: package_reference/utilities
title: Utilities
- local: package_reference/community
Expand Down
105 changes: 105 additions & 0 deletions docs/source/guides/filesystem.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Interact with the Hub through the Filesystem API

In addition to the [`HfApi`], the `huggingface_hub` library provides [`HfFileSystem`], a pythonic [fsspec-compatible](https://filesystem-spec.readthedocs.io/en/latest/) file interface to the Hugging Face Hub. The [`HfFileSystem`] builds of top of the [`HfApi`] and offers typical filesystem style operations like `cp`, `mv`, `ls`, `du`, `glob`, `get_file`, and `put_file`.

## Usage

```python
>>> from huggingface_hub import HfFileSystem
>>> fs = HfFileSystem()

>>> # List all files in a directory
>>> fs.ls("datasets/my-username/my-dataset-repo/data", detail=False)
['datasets/my-username/my-dataset-repo/data/train.csv', 'datasets/my-username/my-dataset-repo/data/test.csv']

>>> # List all ".csv" files in a repo
>>> fs.glob("datasets/my-username/my-dataset-repo/**.csv")
['datasets/my-username/my-dataset-repo/data/train.csv', 'datasets/my-username/my-dataset-repo/data/test.csv']

>>> # Read a remote file
>>> with fs.open("datasets/my-username/my-dataset-repo/data/train.csv", "r") as f:
... train_data = f.readlines()

>>> # Read the contents of a remote file as a string
mariosasko marked this conversation as resolved.
Show resolved Hide resolved
>>> train_data = fs.read_text("datasets/my-username/my-dataset-repo/data/train.csv", revision="dev")

>>> # Write a remote file
>>> with fs.open("datasets/my-username/my-dataset-repo/data/validation.csv", "w") as f:
... f.write("text,label")
... f.write("Fantastic movie!,good")
```

The optional `revision` argument can be passed to run an operation from a specific commit such as a branch, tag name, or a commit hash.

Unlike Python's built-in `open`, `fsspec`'s `open` defaults to binary mode, `"rb"`. This means you must explicitly set encoding as `"r"` for reading and `"w"` for writing in text mode.
mariosasko marked this conversation as resolved.
Show resolved Hide resolved

## Integrations

The [`HfFileSystem`] can be used with any library that integrates `fsspec`, provided the URL follows the scheme:

```
hf://[<repo_type_prefix>]<repo_id>[@<revision>]/<path/in/repo>
```

The `repo_type_prefix` is `datasets/` for datasets, `spaces/` for spaces, and models don't need a prefix in the URL.

Some interesting integrations where [`HfFileSystem`] simplifies interacting with the Hub are listed below:

* Reading/writing a [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-writing-remote-files) DataFrame from/to a Hub repository:

```python
>>> import pandas as pd

>>> # Read a remote CSV file into a dataframe
>>> df = pd.read_csv("hf://datasets/my-username/my-dataset-repo/train.csv")

>>> # Write a dataframe to a remote CSV file
>>> df.to_csv("hf://datasets/my-username/my-dataset-repo/test.csv")
```

The same workflow can also be used for [Dask](https://docs.dask.org/en/stable/how-to/connect-to-remote-data.html) and [Polars](https://pola-rs.github.io/polars/py-polars/html/reference/io.html) DataFrames.

* Querying (remote) Hub files with [DuckDB](https://duckdb.org/docs/guides/python/filesystems):

```python
>>> from huggingface_hub import HfFileSystem
>>> import duckdb

>>> fs = HfFileSystem()
>>> duckdb.register_filesystem(fs)
>>> # Query a remote file and get the result back as a dataframe
>>> fs_query_file = "hf://datasets/my-username/my-dataset-repo/data_dir/data.parquet"
>>> df = duckdb.query(f"SELECT * FROM '{fs_query_file}' LIMIT 10").df()
```

* Using the Hub as an array store with [Zarr](https://zarr.readthedocs.io/en/stable/tutorial.html#io-with-fsspec):

```python
>>> import numpy as np
>>> import zarr

>>> embeddings = np.random.randn(50000, 1000).astype("float32")

>>> # Write an array to a repo
>>> with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="w") as root:
... foo = root.create_group("embeddings")
... foobar = foo.zeros('experiment_0', shape=(50000, 1000), chunks=(10000, 1000), dtype='f4')
... foobar[:] = embeddings

>>> # Read an array from a repo
>>> with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="r") as root:
... first_row = root["embeddings/experiment_0"][0]
```

## Authentication

In many cases, you must be logged in with a Hugging Face account to interact with the Hub. Refer to the [Login](../quick-start#login) section of the documentation to learn more about authentication methods on the Hub.

It is also possible to login programmatically by passing your `token` as an argument to [`HfFileSystem`]:

```python
>>> from huggingface_hub import HfFileSystem
>>> fs = hffs.HfFileSystem(token=token)
```

If you login this way, be careful not to accidentally leak the token when sharing your source code!
9 changes: 9 additions & 0 deletions docs/source/guides/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,15 @@ Take a look at these guides to learn how to use huggingface_hub to solve real-wo
</p>
</a>

<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg"
href="./filesystem">
<div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">
Filesystem
</div><p class="text-gray-700">
How to interact with the Hub through a convenient interface that mimics Python's file interface?
</p>
</a>

<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg"
href="./inference">
<div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">
Expand Down
12 changes: 12 additions & 0 deletions docs/source/package_reference/hf_filesystem.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Filesystem API

The `HfFileSystem` class provides a pythonic file interface to the Hugging Face Hub based on [`fssepc`](https://filesystem-spec.readthedocs.io/en/latest/).

## HfFileSystem

`HfFileSystem` is based on [fsspec](https://filesystem-spec.readthedocs.io/en/latest/), so it is compatible with most of the APIs that it offers. For more details, check out the fsspec's [API Reference](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem).
mariosasko marked this conversation as resolved.
Show resolved Hide resolved

[[autodoc]] HfFileSystem
- __init__
- resolve_path
- ls
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ known_third_party =
faiss-cpu
fastprogress
fire
fsspec
fugashi
git
graphviz
Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ def get_version() -> str:

install_requires = [
"filelock",
"fsspec",
"requests",
"tqdm>=4.42.1",
"pyyaml>=5.1",
Expand Down
10 changes: 10 additions & 0 deletions src/huggingface_hub/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,11 @@
"upload_folder",
"whoami",
],
"hf_file_system": [
"HfFile",
"HfFileSystem",
"ResolvedPath",
],
"hub_mixin": [
"ModelHubMixin",
"PyTorchModelHubMixin",
Expand Down Expand Up @@ -421,6 +426,11 @@ def __dir__():
upload_folder, # noqa: F401
whoami, # noqa: F401
)
from .hf_file_system import (
HfFile, # noqa: F401
HfFileSystem, # noqa: F401
ResolvedPath, # noqa: F401
)
from .hub_mixin import (
ModelHubMixin, # noqa: F401
PyTorchModelHubMixin, # noqa: F401
Expand Down
Loading