From e10a07d8e75d328d7b4b55765a058e9ae7dfb12e Mon Sep 17 00:00:00 2001 From: bghira Date: Wed, 2 Oct 2024 13:53:10 -0600 Subject: [PATCH] update guidance --- OPTIONS.md | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/OPTIONS.md b/OPTIONS.md index 29eb601b..95c56ec3 100644 --- a/OPTIONS.md +++ b/OPTIONS.md @@ -62,7 +62,9 @@ The script `configure.py` in the project root can be used via `python configure. Provided by Hugging Face, the optimum-quanto library has robust support across all supported platforms. - `int8-quanto` is the most broadly compatible and probably produces the best results + - fastest training for RTX4090 and probably other GPUs - uses hardware-accelerated matmul on CUDA devices for int8, int4 + - int4 is still abysmally slow - works with `TRAINING_DYNAMO_BACKEND=inductor` (`torch.compile()`) - `fp8uz-quanto` is an experimental fp8 variant for CUDA and ROCm devices. - better-supported on AMD silicon such as Instinct or newer architecture @@ -70,7 +72,8 @@ Provided by Hugging Face, the optimum-quanto library has robust support across a - works with `TRAINING_DYNAMO_BACKEND=inductor` (`torch.compile()`) - `fp8-quanto` will not (currently) use fp8 matmul, does not work on Apple systems. - does not have hardware fp8 matmul yet on CUDA or ROCm devices, so it will possibly be noticeably slower than int8 - - incompatible with dynamo, will automatically switch to `int8-quanto` for you and keep dynamo enabled for speedup. + - uses MARLIN kernel for fp8 GEMM + - incompatible with dynamo, will automatically disable dynamo if the combination is attempted. #### TorchAO @@ -79,8 +82,9 @@ A newer library from Pytorch, AO allows us to replace the linears and 2D convolu - `int8-torchao` will reduce memory consumption to the same level as any of Quanto's precision levels - at the time of writing, runs slightly slower (11s/iter) than Quanto does (9s/iter) on Apple MPS - - Same speed and memory use as `int8-quanto` on CUDA devices, unknown speed profile on ROCm -- `fp8-torchao` is not enabled for use due to bugs in the implementation. + - When not using `torch.compile`, same speed and memory use as `int8-quanto` on CUDA devices, unknown speed profile on ROCm + - When using `torch.compile`, slower than `int8-quanto` +- `fp8-torchao` is not enabled due to bugs in the implementation. #### Torch Dynamo @@ -89,6 +93,14 @@ To enable `torch.compile()`, add the following line to `config/config.env`: TRAINING_DYNAMO_BACKEND=inductor ``` +If you wish to use added features like max-autotune, run the following: + +```bash +accelerate config +``` + +Carefully answer the questions and use bf16 mixed precision training when prompted. Say **yes** to using Dynamo, **no** to fullgraph, and **yes** to max-autotune. + Note that the first several steps of training will be slower than usual because of compilation occuring in the background. ---