Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update doc #1019

Merged
merged 1 commit into from
Oct 2, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 15 additions & 3 deletions OPTIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,15 +62,18 @@ The script `configure.py` in the project root can be used via `python configure.
Provided by Hugging Face, the optimum-quanto library has robust support across all supported platforms.

- `int8-quanto` is the most broadly compatible and probably produces the best results
- fastest training for RTX4090 and probably other GPUs
- uses hardware-accelerated matmul on CUDA devices for int8, int4
- int4 is still abysmally slow
- works with `TRAINING_DYNAMO_BACKEND=inductor` (`torch.compile()`)
- `fp8uz-quanto` is an experimental fp8 variant for CUDA and ROCm devices.
- better-supported on AMD silicon such as Instinct or newer architecture
- can be slightly faster than `int8-quanto` on a 4090 for training, but not inference (1 second slower)
- works with `TRAINING_DYNAMO_BACKEND=inductor` (`torch.compile()`)
- `fp8-quanto` will not (currently) use fp8 matmul, does not work on Apple systems.
- does not have hardware fp8 matmul yet on CUDA or ROCm devices, so it will possibly be noticeably slower than int8
- incompatible with dynamo, will automatically switch to `int8-quanto` for you and keep dynamo enabled for speedup.
- uses MARLIN kernel for fp8 GEMM
- incompatible with dynamo, will automatically disable dynamo if the combination is attempted.

#### TorchAO

Expand All @@ -79,8 +82,9 @@ A newer library from Pytorch, AO allows us to replace the linears and 2D convolu

- `int8-torchao` will reduce memory consumption to the same level as any of Quanto's precision levels
- at the time of writing, runs slightly slower (11s/iter) than Quanto does (9s/iter) on Apple MPS
- Same speed and memory use as `int8-quanto` on CUDA devices, unknown speed profile on ROCm
- `fp8-torchao` is not enabled for use due to bugs in the implementation.
- When not using `torch.compile`, same speed and memory use as `int8-quanto` on CUDA devices, unknown speed profile on ROCm
- When using `torch.compile`, slower than `int8-quanto`
- `fp8-torchao` is not enabled due to bugs in the implementation.

#### Torch Dynamo

Expand All @@ -89,6 +93,14 @@ To enable `torch.compile()`, add the following line to `config/config.env`:
TRAINING_DYNAMO_BACKEND=inductor
```

If you wish to use added features like max-autotune, run the following:

```bash
accelerate config
```

Carefully answer the questions and use bf16 mixed precision training when prompted. Say **yes** to using Dynamo, **no** to fullgraph, and **yes** to max-autotune.

Note that the first several steps of training will be slower than usual because of compilation occuring in the background.

---
Expand Down
Loading