NVIDIA · ericharper · Mar 16, 2024 · Mar 15, 2024 · Mar 16, 2024
diff --git a/docs/source/nlp/nemo_megatron/peft/landing_page.rst b/docs/source/nlp/nemo_megatron/peft/landing_page.rst
@@ -10,16 +10,18 @@ points, PEFT achieves comparable performance to full finetuning at a
 fraction of the computational and storage costs.
 
 NeMo supports four PEFT methods which can be used with various
-transformer-based models.
+transformer-based models. `Here <https://github.com/NVIDIA/NeMo/tree/main/scripts/nlp_language_modeling>`__
+is a collection of conversion scripts that convert
+popular models from HF format to nemo format.
 
-==================== ===== ======== ========= ====== ==
-\                    GPT 3 Nemotron LLaMa 1/2 Falcon T5
-==================== ===== ======== ========= ====== ==
-LoRA                  ✅    ✅      ✅        ✅     ✅
-P-Tuning              ✅    ✅      ✅        ✅     ✅
-Adapters (Canonical)  ✅    ✅      ✅               ✅
-IA3                   ✅    ✅      ✅               ✅
-==================== ===== ======== ========= ====== ==
+==================== ===== ======== ========= ====== ========= ===== ==
+\                    GPT 3 Nemotron LLaMa 1/2 Falcon Starcoder Gemma T5
+==================== ===== ======== ========= ====== ========= ===== ==
+LoRA                  ✅    ✅      ✅        ✅     ✅        ✅    ✅
+P-Tuning              ✅    ✅      ✅        ✅     ✅        ✅    ✅
+Adapters (Canonical)  ✅    ✅      ✅               ✅        ✅    ✅
+IA3                   ✅    ✅      ✅               ✅        ✅    ✅
+==================== ===== ======== ========= ====== ========= ===== ==
 
 Learn more about PEFT in NeMo with the :ref:`peftquickstart` which provides an overview on how PEFT works
 in NeMo. Read about the supported PEFT methods

diff --git a/docs/source/nlp/nemo_megatron/peft/supported_methods.rst b/docs/source/nlp/nemo_megatron/peft/supported_methods.rst
@@ -17,24 +17,29 @@ NeMo supports the following PFET tuning methods
       each case, the output linear layer is initialized to 0 to ensure
       that an untrained adapter does not affect the normal forward pass
       of the transformer layer.
+   -  In NeMo, you can customize the adapter bottleneck dimension,
+      adapter dropout amount, as well as the type and position of
+      normalization layer.
 
 2. **LoRA**: `LoRA: Low-Rank Adaptation of Large Language
    Models <http://arxiv.org/abs/2106.09685>`__
 
    -  LoRA makes fine-tuning efficient by representing weight updates
       with two low rank decomposition matrices. The original model
       weights remain frozen, while the low rank decomposition matrices
-      are updated to adapt to the new data , so the number of trainable
+      are updated to adapt to the new data, so the number of trainable
       parameters is kept low. In contrast with adapters, the original
       model weights and adapted weights can be combined during
       inference, avoiding any architectural change or additional latency
       in the model at inference time.
-   -  The matrix decomposition operation can be applied to any linear
-      layer, but in practice, it is only applied to the K, Q, V
-      projection matrices (sometimes just applied to the Q,V layers).
-      Since NeMo's attention implementation fuses KQV into a single
-      projection, our LoRA implementation learns a single Low-Rank
-      projection for KQV in a combined fashion.
+   -  In NeMo, you can customize the adapter bottleneck dimension and
+      the target modules to apply LoRA. LoRA can be applied to any linear
+      layer. In a transformer model, this includes 1) Q, K, V attention
+      projections, 2) attention output layer, and 3) either or both of
+      the two transformer MLP layers. For QKV, NeMo's attention
+      implementation fuses QKV into a single projection, so our LoRA
+      implementation learns a single Low-Rank projection for QKV
+      combined.
 
 3. **IA3**: `Few-Shot Parameter-Efficient Fine-Tuning is Better and
    Cheaper than In-Context Learning <http://arxiv.org/abs/2205.05638>`__
@@ -51,6 +56,7 @@ NeMo supports the following PFET tuning methods
       learning rescaling vectors can also be merged with the base
       weights, leading to no architectural change and no additional
       latency at inference time.
+   -  There is no hyperparameter to tune for the IA3 adapter.
 
 4. **P-Tuning**: `GPT Understands,
    Too <https://arxiv.org/abs/2103.10385>`__
@@ -63,9 +69,11 @@ NeMo supports the following PFET tuning methods
       vocabulary. They are simply 1D vectors that match the
       dimensionality of real tokens which make up the model's
       vocabulary.
-   -  In p-tuning, an intermediate LSTM or MLP model is used to generate
+   -  In p-tuning, an intermediate MLP model is used to generate
       virtual token embeddings. We refer to this intermediate model as
       our ``prompt_encoder``. The prompt encoder parameters are randomly
       initialized at the start of p-tuning. All base model parameters
       are frozen, and only the prompt encoder weights are updated at
       each training step.
+   -  In Nemo, you can customize the number of virtual tokens, as well
+      as the embedding and MLP bottleneck dimensions.