Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC API docstring improvements #731

Merged
merged 9 commits into from
Mar 15, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
223 changes: 180 additions & 43 deletions src/huggingface_hub/file_download.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,20 +86,49 @@ def hf_hub_url(
repo_type: Optional[str] = None,
revision: Optional[str] = None,
) -> str:
"""
Resolve a model identifier, a file name, and an optional revision id, to a huggingface.co-hosted url, redirecting
to Cloudfront (a Content Delivery Network, or CDN) for large files (more than a few MBs).

Cloudfront is replicated over the globe so downloads are way faster for the end user (and it also lowers our
bandwidth costs).
"""Resolve a url of a file from the given information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe people don't know what resolve a URL means. WDYT Of something along these lines?

Suggested change
"""Resolve a url of a file from the given information.
"""Creates a URL of a file based in the given information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since there's an example, it should be clear. I'm not sure about using create here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "construct the URL"?


Cloudfront aggressively caches files by default (default TTL is 24 hours), however this is not an issue here
because we implement a git-based versioning system on huggingface.co, which means that we store the files on S3/Cloudfront
in a content-addressable way (i.e., the file name is its hash). Using content-addressable filenames means cache
can't ever be stale.
The resolved address can either be a huggingface.co-hosted url, or a link
to Cloudfront (a Content Delivery Network, or CDN) for large files which
are more than a few MBs.

In terms of client-side caching from this library, we base our caching on the objects' ETag. An object's ETag is:
its git-sha1 if stored in git, or its sha256 if stored in git-lfs.
Args:
repo_id: A user or an organization name and a repo name seperated by a
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
``/``.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think we add empty lines between parameters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put them there in cases where there are many arguments. Without the lines it's not very readable to me, and they don't affect the rendered version anyway. Happy to remove them if you think they really should be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go for consistency now and update later, WDYT?

filename: The name of the file in the repo.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
filename: The name of the file in the repo.
filename (``str``):
The name of the file in the repo.


subfolder: An optional value corresponding to a folder inside the model
repo.
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved

repo_type: Set to :obj:`"dataset"` or :obj:`"space"` if uploading
to a dataset or space, :obj:`None` or :obj:`"model"` if uploading
to a model. Default is :obj:`None`.

revision: An optional Git revision id which can be a branch name, a
tag, or a commit hash.

Example:
>>> from huggingface_hub import hf_hub_url
>>> hf_hub_url(
... repo_id="julien-c/EsperBERTo-small", filename="pytorch_model.bin"
... )
'https://huggingface.co/julien-c/EsperBERTo-small/resolve/main/pytorch_model.bin'
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved

Notes:
Cloudfront is replicated over the globe so downloads are way faster for
the end user (and it also lowers our bandwidth costs).

Cloudfront aggressively caches files by default (default TTL is 24
hours), however this is not an issue here because we implement a
git-based versioning system on huggingface.co, which means that we
store the files on S3/Cloudfront in a content-addressable way (i.e.,
the file name is its hash). Using content-addressable filenames means
cache can't ever be stale.

In terms of client-side caching from this library, we base our caching
on the objects' ETag. An object's ETag is: its git-sha1 if stored in
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
git, or its sha256 if stored in git-lfs.
"""
if subfolder is not None:
filename = f"{subfolder}/{filename}"
Expand All @@ -118,11 +147,20 @@ def hf_hub_url(


def url_to_filename(url: str, etag: Optional[str] = None) -> str:
"""
Convert `url` into a hashed filename in a repeatable way. If `etag` is specified, append its hash to the url's,
delimited by a period. If the url ends with .h5 (Keras HDF5 weights) adds '.h5' to the name so that TF 2.0 can
identify it as a HDF5 file (see
"""Generate a local filename from a url.

Convert `url` into a hashed filename in a reproducible way. If `etag` is
specified, append its hash to the url's, delimited by a period. If the url
ends with .h5 (Keras HDF5 weights) adds '.h5' to the name so that TF 2.0
can identify it as a HDF5 file (see
https://github.com/tensorflow/tensorflow/blob/00fad90125b18b80fe054de1055770cfb8fe4ba3/tensorflow/python/keras/engine/network.py#L1380)

Args:
url: The address to the file
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
etag: The ETag of the file
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved

Returns:
The generated filename.
"""
url_bytes = url.encode("utf-8")
filename = sha256(url_bytes).hexdigest()
Expand Down Expand Up @@ -168,8 +206,16 @@ def http_user_agent(
library_version: Optional[str] = None,
user_agent: Union[Dict, str, None] = None,
) -> str:
"""
Formats a user-agent string with basic info about a request.
"""Formats a user-agent string with basic info about a request.

Args:
library_name: The name of the library to which the object corresponds.
library_version: The version of the library.
user_agent: The user agent info in the form of a dictionary or a single
string.

Returns:
The formated user-agent string.
"""
if library_name is not None:
ua = f"{library_name}/{library_version}"
Expand Down Expand Up @@ -216,12 +262,21 @@ def _request_with_retry(
Note that if the environment variable HF_HUB_OFFLINE is set to 1, then a OfflineModeIsEnabled error is raised.

Args:
method (str): HTTP method, such as 'GET' or 'HEAD'
url (str): The URL of the ressource to fetch
max_retries (int): Maximum number of retries, defaults to 0 (no retries)
base_wait_time (float): Duration (in seconds) to wait before retrying the first time. Wait time between
retries then grows exponentially, capped by max_wait_time.
max_wait_time (float): Maximum amount of time between two retries, in seconds
method: HTTP method, such as 'GET' or 'HEAD'

url: The URL of the ressource to fetch

max_retries: Maximum number of retries, defaults to 0 (no retries)

base_wait_time: Duration (in seconds) to wait before retrying the first
time. Wait time between retries then grows exponentially, capped by
``max_wait_time``.

max_wait_time: Maximum amount of time between two retries, in seconds

timeout: How many seconds to wait for the server to send data before
giving up which is passed to ``requests.request``.

**params: Params to pass to `requests.request`
"""
_raise_if_offline_mode_is_enabled(f"Tried to reach {url}")
Expand Down Expand Up @@ -303,15 +358,58 @@ def cached_download(
use_auth_token: Union[bool, str, None] = None,
local_files_only=False,
) -> Optional[str]: # pragma: no cover
"""
Given a URL, look for the corresponding file in the local cache. If it's not there, download it. Then return the
path to the cached file.
"""Download a given file if it's not already present in the local cache.
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved

Given a URL, this function looks for the corresponding file in the local
cache. If it's not there, download it. Then return the path to the cached
file.

Args:
url: The path to the file to be downloaded.

library_name: The name of the library to which the object corresponds.

library_version: The version of the library.

cache_dir: Path to the folder where cached files are stored.

user_agent: The user-agent info in the form of a dictionary or a
string.

force_download: Whether the file should be downloaded even if it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is a default value other than None:

Suggested change
force_download: Whether the file should be downloaded even if it
force_download (``bool``, `optional`, defaults to ``False``):
Whether the file should be downloaded even if it

already exists in the local cache.

force_filename: Use this name instead of a generated file name.

proxies: Dictionary mapping protocol to the URL of the proxy passed to
``requests.request``.

etag_timeout: When fetching ETag, how many seconds to wait for the
server to send data before giving up which is passed to
``requests.request``.

resume_download: If ``True``, resume a previously interrupted download.

use_auth_token: A token to be used for the download.
- If ``True``, the token is read from the HuggingFace config
folder.
- If a string, it's used as the authentication token.

local_files_only: If ``True``, avoid downloading the file and return
the path to the local cached file if it exists.

Return:
Local path (string) of file or if networking is off, last version of file cached on disk.
Local path (string) of file or if networking is off, last version of
file cached on disk.

Raises:
In case of non-recoverable file (non-existent or inaccessible url + no cache on disk).
- ``EnvironmentError`` if ``use_auth_token=True`` and the token cannot
be found.

- ``OSError`` if ETag cannot be determined.

- ``ValueError`` if the file cannot be downloaded and cannot be found
locally.
"""
if cache_dir is None:
cache_dir = HUGGINGFACE_HUB_CACHE
Expand Down Expand Up @@ -503,28 +601,67 @@ def hf_hub_download(
use_auth_token: Union[bool, str, None] = None,
local_files_only=False,
):
"""
Resolve a model identifier, a file name, and an optional revision id, to a huggingface.co file distributed through
Cloudfront (a Content Delivery Network, or CDN) for large files (more than a few MBs).
"""Download a given file if it's not already present in the local cache.

Args:
repo_id: A user or an organization name and a repo name seperated by a
``/``.

filename: The name of the file in the repo.

The file is cached locally: look for the corresponding file in the local cache. If it's not there,
download it. Then return the path to the cached file.
subfolder: An optional value corresponding to a folder inside the model
repo.

Cloudfront is replicated over the globe so downloads are way faster for the end user.
repo_type: Set to :obj:`"dataset"` or :obj:`"space"` if uploading
to a dataset or space, :obj:`None` or :obj:`"model"` if uploading
to a model. Default is :obj:`None`.

Cloudfront aggressively caches files by default (default TTL is 24 hours), however this is not an issue here
because we implement a git-based versioning system on huggingface.co, which means that we store the files on S3/Cloudfront
in a content-addressable way (i.e., the file name is its hash). Using content-addressable filenames means cache
can't ever be stale.
revision: An optional Git revision id which can be a branch name, a
tag, or a commit hash.

In terms of client-side caching from this library, we base our caching on the objects' ETag. An object's ETag is:
its git-sha1 if stored in git, or its sha256 if stored in git-lfs.
library_name: The name of the library to which the object corresponds.

library_version: The version of the library.

cache_dir: Path to the folder where cached files are stored.

user_agent: The user-agent info in the form of a dictionary or a
string.

force_download: Whether the file should be downloaded even if it
already exists in the local cache.

force_filename: Use this name instead of a generated file name.

proxies: Dictionary mapping protocol to the URL of the proxy passed to
``requests.request``.

etag_timeout: When fetching ETag, how many seconds to wait for the
server to send data before giving up which is passed to
``requests.request``.

resume_download: If ``True``, resume a previously interrupted download.

use_auth_token: A token to be used for the download.
- If ``True``, the token is read from the HuggingFace config
folder.
- If a string, it's used as the authentication token.

local_files_only: If ``True``, avoid downloading the file and return
the path to the local cached file if it exists.

Return:
Local path (string) of file or if networking is off, last version of file cached on disk.
Local path (string) of file or if networking is off, last version of
file cached on disk.

Raises:
In case of non-recoverable file (non-existent or inaccessible url + no cache on disk).
- ``EnvironmentError`` if ``use_auth_token=True`` and the token cannot
be found.

- ``OSError`` if ETag cannot be determined.

- ``ValueError`` if the file cannot be downloaded and cannot be found
locally.
Comment on lines 692 to +699
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love that!

"""
url = hf_hub_url(
repo_id, filename, subfolder=subfolder, repo_type=repo_type, revision=revision
Expand Down
58 changes: 53 additions & 5 deletions src/huggingface_hub/snapshot_download.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,8 @@ def snapshot_download(
allow_regex: Optional[Union[List[str], str]] = None,
ignore_regex: Optional[Union[List[str], str]] = None,
) -> str:
"""
"""Download all files of a repo.

Downloads a whole snapshot of a repo's files at the specified revision.
This is useful when you want all files from a repo, because you don't know
which ones you will need a priori.
Expand All @@ -43,13 +44,60 @@ def snapshot_download(
An alternative would be to just clone a repo but this would require that
the user always has git and git-lfs installed, and properly configured.

Note: at some point maybe this format of storage should actually replace
the flat storage structure we've used so far (initially from allennlp
if I remember correctly).
Args:
repo_id: A user or an organization name and a repo name seperated by a
``/``.

revision: An optional Git revision id which can be a branch name, a
tag, or a commit hash.

cache_dir: Path to the folder where cached files are stored.

library_name: The name of the library to which the object corresponds.

library_version: The version of the library.

user_agent: The user-agent info in the form of a dictionary or a
string.

proxies: Dictionary mapping protocol to the URL of the proxy passed to
``requests.request``.

etag_timeout: When fetching ETag, how many seconds to wait for the
server to send data before giving up which is passed to
``requests.request``.

resume_download: If ``True``, resume a previously interrupted download.

use_auth_token: A token to be used for the download.
- If ``True``, the token is read from the HuggingFace config
folder.
- If a string, it's used as the authentication token.

local_files_only: If ``True``, avoid downloading the file and return
the path to the local cached file if it exists.

allow_regex: If provided, only files matching this regex are downladed.

ignore_regex: If provided, files matching this regex are not
downloaded.

Return:
Local folder path (string) of repo snapshot

Raises:
- ``EnvironmentError`` if ``use_auth_token=True`` and the token cannot
be found.

- ``OSError`` if ETag cannot be determined.

- ``ValueError`` if the file cannot be downloaded and cannot be found
locally.
"""
# Note: at some point maybe this format of storage should actually replace
# the flat storage structure we've used so far (initially from allennlp
# if I remember correctly).

Comment on lines +95 to +98
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better suited here, indeed!

if cache_dir is None:
cache_dir = HUGGINGFACE_HUB_CACHE
if revision is None:
Expand All @@ -68,7 +116,7 @@ def snapshot_download(
else:
token = None

# remove all `/` occurances to correctly convert repo to directory name
# remove all `/` occurrences to correctly convert repo to directory name
repo_id_flattened = repo_id.replace("/", REPO_ID_SEPARATOR)

# if we have no internet connection we will look for the
Expand Down