From e10a07d8e75d328d7b4b55765a058e9ae7dfb12e Mon Sep 17 00:00:00 2001
From: bghira <bghira@users.github.com>
Date: Wed, 2 Oct 2024 13:53:10 -0600
Subject: [PATCH] update guidance

---
 OPTIONS.md | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/OPTIONS.md b/OPTIONS.md
index 29eb601b..95c56ec3 100644
--- a/OPTIONS.md
+++ b/OPTIONS.md
@@ -62,7 +62,9 @@ The script `configure.py` in the project root can be used via `python configure.
 Provided by Hugging Face, the optimum-quanto library has robust support across all supported platforms.
 
 - `int8-quanto` is the most broadly compatible and probably produces the best results
+  - fastest training for RTX4090 and probably other GPUs
   - uses hardware-accelerated matmul on CUDA devices for int8, int4
+    - int4 is still abysmally slow
   - works with `TRAINING_DYNAMO_BACKEND=inductor` (`torch.compile()`)
 - `fp8uz-quanto` is an experimental fp8 variant for CUDA and ROCm devices.
   - better-supported on AMD silicon such as Instinct or newer architecture
@@ -70,7 +72,8 @@ Provided by Hugging Face, the optimum-quanto library has robust support across a
   - works with `TRAINING_DYNAMO_BACKEND=inductor` (`torch.compile()`)
 - `fp8-quanto` will not (currently) use fp8 matmul, does not work on Apple systems.
   - does not have hardware fp8 matmul yet on CUDA or ROCm devices, so it will possibly be noticeably slower than int8
-  - incompatible with dynamo, will automatically switch to `int8-quanto` for you and keep dynamo enabled for speedup.
+    - uses MARLIN kernel for fp8 GEMM
+  - incompatible with dynamo, will automatically disable dynamo if the combination is attempted.
   
 #### TorchAO
 
@@ -79,8 +82,9 @@ A newer library from Pytorch, AO allows us to replace the linears and 2D convolu
 
 - `int8-torchao` will reduce memory consumption to the same level as any of Quanto's precision levels
   - at the time of writing, runs slightly slower (11s/iter) than Quanto does (9s/iter) on Apple MPS
-  - Same speed and memory use as `int8-quanto` on CUDA devices, unknown speed profile on ROCm
-- `fp8-torchao` is not enabled for use due to bugs in the implementation.
+  - When not using `torch.compile`, same speed and memory use as `int8-quanto` on CUDA devices, unknown speed profile on ROCm
+  - When using `torch.compile`, slower than `int8-quanto`
+- `fp8-torchao` is not enabled due to bugs in the implementation.
 
 #### Torch Dynamo
 
@@ -89,6 +93,14 @@ To enable `torch.compile()`, add the following line to `config/config.env`:
 TRAINING_DYNAMO_BACKEND=inductor
 ```
 
+If you wish to use added features like max-autotune, run the following:
+
+```bash
+accelerate config
+```
+
+Carefully answer the questions and use bf16 mixed precision training when prompted. Say **yes** to using Dynamo, **no** to fullgraph, and **yes** to max-autotune.
+
 Note that the first several steps of training will be slower than usual because of compilation occuring in the background.
 
 ---