NVIDIA · ekmb · Oct 17, 2023 · Sep 8, 2023 · Sep 8, 2023 · Sep 8, 2023
diff --git a/docs/source/nlp/nemo_megatron/flash_attention.rst b/docs/source/nlp/nemo_megatron/flash_attention.rst
@@ -0,0 +1,28 @@
+Flash attention
+---------------
+Flash Attention :cite:`nlp-megatron-dao2022flashattention` is a method designed to enhance the efficiency of Transformer models, which are widely utilized in applications such as natural language processing. Traditional Transformers are slow and consume a lot of memory, especially with long sequences, due to the quadratic time and memory complexity of self-attention. FlashAttention, an IO-aware exact attention algorithm that leverages tiling to minimize the number of memory reads/writes between the GPU's high bandwidth memory (HBM) and on-chip SRAM. This approach is designed to be more efficient in terms of IO complexity compared to standard attention mechanisms.
+
+GPT
+^^^
+To enable Flash Attention while Megatron GPT model training or fine-tuning, modify the following configuration: 
+
+.. code::
+
+   model.use_flash_attention=True
+
+T5
+^^
+To enable Flash Attention while Megatron T5 model training, modify the following configuration: 
+
+.. code::
+
+   model.encoder.use_flash_attention=True
+   model.decoder.use_flash_attention=True
+
+References
+----------
+
+.. bibliography:: ../nlp_all.bib
+    :style: plain
+    :labelprefix: nlp-megatron
+    :keyprefix: nlp-megatron-
diff --git a/docs/source/nlp/nemo_megatron/intro.rst b/docs/source/nlp/nemo_megatron/intro.rst
@@ -26,6 +26,8 @@ team at NVIDIA. NeMo Megatron supports several types of models:
    retro/retro_model
    hiddens/hiddens_module
    peft/landing_page
+   flash_attention
+   positional_embeddings
 
 
 References

diff --git a/docs/source/nlp/nemo_megatron/positional_embeddings.rst b/docs/source/nlp/nemo_megatron/positional_embeddings.rst
@@ -0,0 +1,111 @@
+Positional embeddings
+---------------------
+
+Positional embeddings are used to give the model information about the position of each element in a sequence.  Megatron LLM supports the following positional embedding types:
+
+GPT
+^^^
+
+.. list-table:: *Supported positional embeddings in GPT models*
+   :widths: 10 30 60
+   :header-rows: 1
+
+   * - Parameter value
+     - How to use
+     - Description
+
+   * - **learned_absolute**
+     - .. code::
+
+          model.position_embedding_type='learned_absolute'
+     - Absolute Position Encodings :cite:`nlp-megatron-vaswani2023attention` are position embeddings used in Transformer-based models, added to input embeddings in the encoder and decoder sections. These encodings match the dimension of embeddings and are created using sine and cosine functions of various frequencies. Each dimension in the encoding corresponds to a sinusoid with wavelengths forming a geometric progression. 
+
+   * - **rope**
+     - .. code::
+
+          model.position_embedding_type='rope'
+          model.rotary_percentage=1.0
+     - Rotary Position Embedding (RoPE) :cite:`nlp-megatron-su2022roformer` incorporates positional information by utilizing a rotation matrix to encode the absolute positions of tokens while maintaining relative positional relationships in self-attention formulations by leveraging the geometric properties of vectors and complex numbers, applying a rotation based on a preset non-zero constant and the relative positions of the tokens to the word embeddings. 
+
+   * - **alibi**
+     - .. code::
+
+          model.position_embedding_type='alibi'
+     - Attention with Linear Biases (ALiBi) :cite:`nlp-megatron-press2022train` modifies the way attention scores are computed in the attention sublayer of the network. ALiBi introduces a static, non-learned bias after the query-key dot product during the computation of attention scores. This bias is added in the form of a head-specific slope that is determined before training, creating a geometric sequence of slopes for the different heads in the model. The method has an inductive bias towards recency, penalizing attention scores between distant query-key pairs with the penalty increasing as the distance grows, and it leverages different rates of penalty increase across different heads based on the slope magnitude. 
+
+   * - **kerple**
+     - .. code::
+
+          model.position_embedding_type='kerple'
+     - Kernelized Relative Positional Embedding for Length Extrapolation (KERPLE) :cite:`nlp-megatron-chi2022kerple` generalizes relative positional embeddings (RPE) by kernelizing positional differences using conditionally positive definite (CPD) kernels known for generalizing distance metrics. They transform CPD kernels into positive definite (PD) kernels by adding a constant offset, which is absorbed during softmax normalization in the self-attention mechanism of transformers. This approach allows for a variety of RPEs that facilitate length extrapolation in a principled manner. 
+
+   * - **xpos**
+     - .. code::
+
+          model.position_embedding_type='xpos'
+     - Extrapolatable Position Embedding (xPos) :cite:`nlp-megatron-sun2022lengthextrapolatable`
+
+   * - **sandwich**
+     - .. code::
+
+          model.position_embedding_type='sandwich'
+     - Sandwich :cite:`nlp-megatron-chi2023dissecting`
+
+T5
+^^
+
+.. list-table:: *Supported positional embeddings in T5 models*
+   :widths: 10 30 60
+   :header-rows: 1
+
+   * - Parameter value
+     - How to use
+     - Description
+
+   * - **learned_absolute**
+     - .. code::
+
+          model.encoder.position_embedding_type='learned_absolute'
+          model.decoder.position_embedding_type='learned_absolute'
+     - Absolute Position Encodings :cite:`nlp-megatron-vaswani2023attention` are position embeddings used in Transformer-based models, added to input embeddings in the encoder and decoder sections. These encodings match the dimension of embeddings and are created using sine and cosine functions of various frequencies. Each dimension in the encoding corresponds to a sinusoid with wavelengths forming a geometric progression. 
+
+   * - **relative**
+     - .. code::
+
+          model.encoder.position_embedding_type='relative'
+          model.decoder.position_embedding_type='relative'
+     - Relative Position Representations :cite:`nlp-megatron-shaw2018selfattention`
+
+   * - **alibi**
+     - .. code::
+
+          model.encoder.position_embedding_type='alibi'
+          model.decoder.position_embedding_type='alibi'
+     - Attention with Linear Biases (ALiBi) :cite:`nlp-megatron-press2022train` modifies the way attention scores are computed in the attention sublayer of the network. ALiBi introduces a static, non-learned bias after the query-key dot product during the computation of attention scores. This bias is added in the form of a head-specific slope that is determined before training, creating a geometric sequence of slopes for the different heads in the model. The method has an inductive bias towards recency, penalizing attention scores between distant query-key pairs with the penalty increasing as the distance grows, and it leverages different rates of penalty increase across different heads based on the slope magnitude. 
+
+   * - **kerple**
+     - .. code::
+
+          model.encoder.position_embedding_type='kerple'
+          model.decoder.position_embedding_type='kerple'
+     - Kernelized Relative Positional Embedding for Length Extrapolation (KERPLE) :cite:`nlp-megatron-chi2022kerple` generalizes relative positional embeddings (RPE) by kernelizing positional differences using conditionally positive definite (CPD) kernels known for generalizing distance metrics. They transform CPD kernels into positive definite (PD) kernels by adding a constant offset, which is absorbed during softmax normalization in the self-attention mechanism of transformers. This approach allows for a variety of RPEs that facilitate length extrapolation in a principled manner. 
+
+Positional interpolation
+------------------------
+Position Interpolation (PI) :cite:`nlp-megatron-chen2023extending` is a method introduced to extend the context window sizes of Rotary Position Embedding (RoPE)-based pretrained large language models (LLMs). The central principle of PI is to reduce the position indices so that they align with the initial context window size through interpolation.
+
+Positional Interpolation is supported in Megatron GPT SFT models. Set RoPE Interpolation factor for sequence length :code:`seq_len_interpolation_factor` to enable it.  
+
+.. code::
+
+   model.position_embedding_type='rope'
+   model.rotary_percentage=1.0
+   model.seq_len_interpolation_factor: 2 
+
+References
+----------
+
+.. bibliography:: ../nlp_all.bib
+    :style: plain
+    :labelprefix: nlp-megatron
+    :keyprefix: nlp-megatron-
diff --git a/docs/source/nlp/nlp_all.bib b/docs/source/nlp/nlp_all.bib
@@ -225,3 +225,84 @@ @misc{antonova2023spellmapper
   archivePrefix={arXiv},
   primaryClass={cs.CL}
 }
+
+@misc{dao2022flashattention,
+      title={FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness}, 
+      author={Tri Dao and Daniel Y. Fu and Stefano Ermon and Atri Rudra and Christopher Ré},
+      year={2022},
+      eprint={2205.14135},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
+}
+
+@misc{vaswani2023attention,
+      title={Attention Is All You Need}, 
+      author={Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},
+      year={2023},
+      eprint={1706.03762},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+
+@misc{su2022roformer,
+      title={RoFormer: Enhanced Transformer with Rotary Position Embedding}, 
+      author={Jianlin Su and Yu Lu and Shengfeng Pan and Ahmed Murtadha and Bo Wen and Yunfeng Liu},
+      year={2022},
+      eprint={2104.09864},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+
+@misc{press2022train,
+      title={Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation}, 
+      author={Ofir Press and Noah A. Smith and Mike Lewis},
+      year={2022},
+      eprint={2108.12409},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+
+@misc{chi2022kerple,
+      title={KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation}, 
+      author={Ta-Chung Chi and Ting-Han Fan and Peter J. Ramadge and Alexander I. Rudnicky},
+      year={2022},
+      eprint={2205.09921},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+
+@misc{sun2022lengthextrapolatable,
+      title={A Length-Extrapolatable Transformer}, 
+      author={Yutao Sun and Li Dong and Barun Patra and Shuming Ma and Shaohan Huang and Alon Benhaim and Vishrav Chaudhary and Xia Song and Furu Wei},
+      year={2022},
+      eprint={2212.10554},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+
+@misc{chi2023dissecting,
+      title={Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis}, 
+      author={Ta-Chung Chi and Ting-Han Fan and Alexander I. Rudnicky and Peter J. Ramadge},
+      year={2023},
+      eprint={2212.10356},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+
+@misc{shaw2018selfattention,
+      title={Self-Attention with Relative Position Representations}, 
+      author={Peter Shaw and Jakob Uszkoreit and Ashish Vaswani},
+      year={2018},
+      eprint={1803.02155},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+
+@misc{chen2023extending,
+      title={Extending Context Window of Large Language Models via Positional Interpolation}, 
+      author={Shouyuan Chen and Sherman Wong and Liangjian Chen and Yuandong Tian},
+      year={2023},
+      eprint={2306.15595},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}