Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HuggingFace --> Hugging Face #618

Merged
merged 1 commit into from
Mar 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ The library provides:
- Support for checkpoints in various formats, including checkpoints in HF format
- Training recipes for popular fine-tuning techniques with reference benchmarks and comprehensive correctness checks
- Evaluation of trained models with EleutherAI Eval Harness
- Integration with HuggingFace Datasets for training
- Integration with Hugging Face Datasets for training
- Support for distributed training using FSDP from PyTorch Distributed
- YAML configs for easily configuring training runs
- [Upcoming] Support for lower precision dtypes and quantization techniques from [TorchAO](https://github.com/pytorch-labs/ao)
Expand Down Expand Up @@ -182,7 +182,7 @@ TorchTune embodies PyTorch’s design philosophy [[details](https://pytorch.org/

#### Native PyTorch

TorchTune is a native-PyTorch library. While we provide integrations with the surrounding ecosystem (eg: HuggingFace Datasets, EluetherAI Eval Harness), all of the core functionality is written in PyTorch.
TorchTune is a native-PyTorch library. While we provide integrations with the surrounding ecosystem (eg: Hugging Face Datasets, EluetherAI Eval Harness), all of the core functionality is written in PyTorch.

#### Simplicity and Extensibility

Expand Down
6 changes: 3 additions & 3 deletions docs/source/examples/first_finetune_tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,13 @@ job using TorchTune.
Downloading a model
-------------------
First, you need to download a model. TorchTune's supports an integration
with the `HuggingFace Hub <https://huggingface.co/docs/hub/en/index>`_ - a collection of the latest and greatest model weights.
with the `Hugging Face Hub <https://huggingface.co/docs/hub/en/index>`_ - a collection of the latest and greatest model weights.

For this tutorial, you're going to use the `Llama2 model from Meta <https://llama.meta.com/>`_. Llama2 is a "gated model",
meaning that you need to be granted access in order to download the weights. Follow `these instructions <https://huggingface.co/meta-llama>`_ on the official Meta page
hosted on HuggingFace to complete this process. (This should take less than 5 minutes.)
hosted on Hugging Face to complete this process. (This should take less than 5 minutes.)

Once you have authorization, you will need to authenticate with HuggingFace Hub. The easiest way to do so is to provide an
Once you have authorization, you will need to authenticate with Hugging Face Hub. The easiest way to do so is to provide an
access token to the download script. You can find your token `here <https://huggingface.co/settings/tokens>`_.

Then, it's as simple as:
Expand Down
4 changes: 2 additions & 2 deletions docs/source/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ TorchTune provides:
- Modular native-PyTorch implementations of popular LLMs
- Interoperability with popular model zoos through checkpoint-conversion utilities
- Training recipes for a variety of fine-tuning techniques
- Integration with `HuggingFace Datasets <https://huggingface.co/docs/datasets/en/index>`_ for training and `EleutherAI's Eval <https://github.com/EleutherAI/lm-evaluation-harness>`_ Harness for evaluation
- Integration with `Hugging Face Datasets <https://huggingface.co/docs/datasets/en/index>`_ for training and `EleutherAI's Eval <https://github.com/EleutherAI/lm-evaluation-harness>`_ Harness for evaluation
- Support for distributed training using `FSDP <https://pytorch.org/docs/stable/fsdp.html>`_
- Yaml configs for easily configuring training runs

Expand Down Expand Up @@ -55,7 +55,7 @@ TorchTune embodies `PyTorch’s design philosophy <https://pytorch.org/docs/stab

**Native PyTorch**

TorchTune is a native-PyTorch library. While we provide integrations with the surrounding ecosystem (eg: HuggingFace Datasets, EluetherAI Eval Harness), all of the core functionality is written in PyTorch.
TorchTune is a native-PyTorch library. While we provide integrations with the surrounding ecosystem (eg: Hugging Face Datasets, EluetherAI Eval Harness), all of the core functionality is written in PyTorch.


**Simplicity and Extensibility**
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# HuggingFace Integration Reqs
# Hugging Face Integration Reqs
datasets
huggingface_hub

Expand Down
2 changes: 1 addition & 1 deletion tests/torchtune/_cli/test_download.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def test_download_calls_snapshot(self, capsys, monkeypatch, snapshot_download):
with pytest.raises(SystemExit, match="2"):
runpy.run_path(TUNE_PATH, run_name="__main__")
err = capsys.readouterr().err
assert "not found on the HuggingFace Hub" in err
assert "not found on the Hugging Face Hub" in err

# Call the third time and get the expected output
runpy.run_path(TUNE_PATH, run_name="__main__")
Expand Down
2 changes: 1 addition & 1 deletion tests/torchtune/data/test_templates.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
SummarizeTemplate,
)

# Taken from Open-Orca/SlimOrca-Dedup on HuggingFace:
# Taken from Open-Orca/SlimOrca-Dedup on Hugging Face:
# https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
CHAT_SAMPLE = {
"system": "You are an AI assistant. User will you give you a task. Your goal is to complete the task as faithfully as you can. While performing the task think step-by-step and justify your steps.", # noqa: B950
Expand Down
20 changes: 10 additions & 10 deletions torchtune/_cli/download.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,28 +24,28 @@ def __init__(self, subparsers: argparse._SubParsersAction):
"download",
prog="tune download",
usage="tune download <repo-id> [OPTIONS]",
help="Download a model from the HuggingFace Hub.",
description="Download a model from the HuggingFace Hub.",
help="Download a model from the Hugging Face Hub.",
description="Download a model from the Hugging Face Hub.",
epilog=textwrap.dedent(
"""\
examples:
# Download a model from the HuggingFace Hub with a Hugging Face API token
# Download a model from the Hugging Face Hub with a Hugging Face API token
$ tune download meta-llama/Llama-2-7b-hf --hf-token <TOKEN> --output-dir /tmp/model
Successfully downloaded model repo and wrote to the following locations:
./model/config.json
./model/README.md
./model/consolidated.00.pth
...

# Download an ungated model from the HuggingFace Hub
# Download an ungated model from the Hugging Face Hub
$ tune download mistralai/Mistral-7B-Instruct-v0.2
Successfully downloaded model repo and wrote to the following locations:
./model/config.json
./model/README.md
./model/model-00001-of-00002.bin
...

For a list of all models, visit the HuggingFace Hub https://huggingface.co/models.
For a list of all models, visit the Hugging Face Hub https://huggingface.co/models.
"""
),
formatter_class=argparse.RawTextHelpFormatter,
Expand All @@ -58,7 +58,7 @@ def _add_arguments(self) -> None:
self._parser.add_argument(
"repo_id",
type=str,
help="Name of the repository on HuggingFace Hub.",
help="Name of the repository on Hugging Face Hub.",
)
self._parser.add_argument(
"--output-dir",
Expand All @@ -72,7 +72,7 @@ def _add_arguments(self) -> None:
type=str,
required=False,
default=os.getenv("HF_TOKEN", None),
help="HuggingFace API token. Needed for gated models like Llama2.",
help="Hugging Face API token. Needed for gated models like Llama2.",
)
self._parser.add_argument(
"--ignore-patterns",
Expand All @@ -84,7 +84,7 @@ def _add_arguments(self) -> None:
)

def _download_cmd(self, args: argparse.Namespace) -> None:
"""Downloads a model from the HuggingFace Hub."""
"""Downloads a model from the Hugging Face Hub."""
# Download the tokenizer and PyTorch model files
try:
true_output_dir = snapshot_download(
Expand All @@ -96,13 +96,13 @@ def _download_cmd(self, args: argparse.Namespace) -> None:
except GatedRepoError:
self._parser.error(
"It looks like you are trying to access a gated repository. Please ensure you "
"have access to the repository and have provided the proper HuggingFace API token "
"have access to the repository and have provided the proper Hugging Face API token "
"using the option `--hf-token` or by running `huggingface-cli login`."
"You can find your token by visiting https://huggingface.co/settings/tokens"
)
except RepositoryNotFoundError:
self._parser.error(
f"Repository '{args.repo_id}' not found on the HuggingFace Hub."
f"Repository '{args.repo_id}' not found on the Hugging Face Hub."
)
except Exception as e:
self._parser.error(e)
Expand Down
2 changes: 1 addition & 1 deletion torchtune/data/_templates.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,7 @@ class ChatMLTemplate(PromptTemplate):
"""
OpenAI's Chat Markup Language used by their chat models:
https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/ai-services/openai/includes/chat-markup-language.md
It is the default template used by HuggingFace models.
It is the default template used by Hugging Face models.

Example:
<|im_start|>system
Expand Down
2 changes: 1 addition & 1 deletion torchtune/datasets/_alpaca.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ def alpaca_dataset(
use_clean: bool = False,
) -> InstructDataset:
"""
Support for the Alpaca dataset and its variants from HuggingFace Datasets.
Support for the Alpaca dataset and its variants from Hugging Face Datasets.
https://huggingface.co/datasets/tatsu-lab/alpaca

Data input format: https://huggingface.co/datasets/tatsu-lab/alpaca#data-instances
Expand Down
4 changes: 2 additions & 2 deletions torchtune/datasets/_chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ class ChatDataset(Dataset):

Args:
tokenizer (Tokenizer): Tokenizer used to encode data. Tokenize must implement an `encode` and `decode` method.
source (str): path string of dataset, anything supported by HuggingFace's `load_dataset`
source (str): path string of dataset, anything supported by Hugging Face's `load_dataset`
(https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path)
convert_to_dialogue (Callable[[Mapping[str, Any]], Dialogue]): function that keys into the desired field in the sample
and converts to a list of `Messages` that follows the llama format with the expected keys
Expand Down Expand Up @@ -151,7 +151,7 @@ def chat_dataset(

Args:
tokenizer (Tokenizer): Tokenizer used to encode data. Tokenize must implement an `encode` and `decode` method.
source (str): path string of dataset, anything supported by HuggingFace's `load_dataset`
source (str): path string of dataset, anything supported by Hugging Face's `load_dataset`
(https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path)
conversation_format (str): string specifying expected format of conversations in the dataset
for automatic conversion to the llama format. Supported formats are: "sharegpt"
Expand Down
2 changes: 1 addition & 1 deletion torchtune/datasets/_grammar.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ def grammar_dataset(
train_on_input: bool = False,
) -> InstructDataset:
"""
Support for the Grammar dataset and its variants from HuggingFace Datasets.
Support for the Grammar dataset and its variants from Hugging Face Datasets.
https://huggingface.co/datasets/liweili/c4_200m

Data input format: https://huggingface.co/datasets/liweili/c4_200m#description
Expand Down
4 changes: 2 additions & 2 deletions torchtune/datasets/_instruct.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ class InstructDataset(Dataset):

Args:
tokenizer (Tokenizer): Tokenizer used to encode data. Tokenize must implement an `encode` and `decode` method.
source (str): path string of dataset, anything supported by HuggingFace's `load_dataset`
source (str): path string of dataset, anything supported by Hugging Face's `load_dataset`
(https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path)
template (PromptTemplate): template used to format the prompt. If the placeholder variable
names in the template do not match the column/key names in the dataset, use `column_map` to map them.
Expand Down Expand Up @@ -103,7 +103,7 @@ def instruct_dataset(

Args:
tokenizer (Tokenizer): Tokenizer used to encode data. Tokenize must implement an `encode` and `decode` method.
source (str): path string of dataset, anything supported by HuggingFace's `load_dataset`
source (str): path string of dataset, anything supported by Hugging Face's `load_dataset`
(https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path)
template (str): class name of template used to format the prompt. If the placeholder variable
names in the template do not match the column/key names in the dataset, use `column_map` to map them.
Expand Down
2 changes: 1 addition & 1 deletion torchtune/datasets/_samsum.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ def samsum_dataset(
train_on_input: bool = False,
) -> InstructDataset:
"""
Support for the Summarize dataset and its variants from HuggingFace Datasets.
Support for the Summarize dataset and its variants from Hugging Face Datasets.
https://huggingface.co/datasets/samsum

Data input format: https://huggingface.co/datasets/samsum#data-fields
Expand Down
2 changes: 1 addition & 1 deletion torchtune/modules/lr_schedulers.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ def get_cosine_schedule_with_warmup(
0.0 to lr over num_warmup_steps, then decreases to 0.0 on a cosine schedule over
the remaining num_training_steps-num_warmup_steps (assuming num_cycles = 0.5).

This is based on the HuggingFace implementation
This is based on the Hugging Face implementation
https://github.com/huggingface/transformers/blob/v4.23.1/src/transformers/optimization.py#L104.

Args:
Expand Down
Loading