Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Megatron LLM documentation updates #7400

Merged
merged 116 commits into from
Oct 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
59d9325
Create pos_emb.rst
ssh-meister Sep 8, 2023
35ba3ec
Update pos_emb.rst
ssh-meister Sep 8, 2023
2939667
Update pos_emb.rst
ssh-meister Sep 8, 2023
99a2350
Update pos_emb.rst
ssh-meister Sep 8, 2023
2e5af17
Update pos_emb.rst
ssh-meister Sep 8, 2023
5314d2d
Update pos_emb.rst
ssh-meister Sep 8, 2023
36de4c3
Update pos_emb.rst
ssh-meister Sep 8, 2023
28d5b0d
Update and rename docs/source/nlp/pos_emb.rst to docs/source/nlp/nemo…
ssh-meister Sep 8, 2023
85f4e63
Rename positional_embeddings.rst to positional_embeddings.rst
ssh-meister Sep 8, 2023
ce6113d
Create flash_attention.rst
ssh-meister Sep 8, 2023
2e03e6c
Changed value for model.seq_len_interpolation_factor to 2
ssh-meister Sep 11, 2023
b27d077
fixed flash_attention enabling for t5
ssh-meister Sep 11, 2023
252123d
[TTS] Added a callback for logging initial data (#7384)
anteju Sep 8, 2023
5341a73
Update Core Commit (#7402)
aklife97 Sep 9, 2023
68142d6
Use cfg attribute in bert (#7394)
maanug-nv Sep 9, 2023
2a60fb0
Add support for bias conversion in Swiglu models (#7386)
titu1994 Sep 9, 2023
8b5ef41
Update save_to and restore_from for dist checkpointing (#7343)
ericharper Sep 9, 2023
b24fdb9
fix forward for with mcore=false (#7403)
JimmyZhang12 Sep 9, 2023
c85dcba
Fix logging to remove 's/it' from progress bar in Megatron models and…
athitten Sep 9, 2023
ae5d3ce
Set Activation Checkpointing Defaults (#7404)
aklife97 Sep 9, 2023
153c53b
make loss mask default to false (#7407)
ericharper Sep 9, 2023
b61b9ab
Add dummy userbuffer config files (#7408)
erhoo82 Sep 9, 2023
3f36e11
add missing ubconf files (#7412)
aklife97 Sep 11, 2023
53224a2
New tutorial on Speech Data Explorer (#7405)
Jorjeous Sep 11, 2023
a8b85b7
Update ptl training ckpt conversion script to work with dist ckpt (#7…
ericharper Sep 12, 2023
a81bdbe
Allow disabling sanity checking when num_sanity_val_steps=0 (#7413)
athitten Sep 12, 2023
b79ec2a
Add comprehensive error messages (#7261)
PeganovAnton Sep 12, 2023
0d85ab8
check NEMO_PATH (#7418)
karpnv Sep 12, 2023
f7fc3bd
layer selection for ia3 (#7417)
arendu Sep 13, 2023
3c90b5e
Fix missing pip package 'einops' (#7397)
RobinDong Sep 14, 2023
12e77d4
Fix failure of pyaudio in Google Colab (#7396)
RobinDong Sep 15, 2023
aa3f36d
Update README.md: output_path --> output_manifest_filepath (#7442)
popcornell Sep 18, 2023
0a8eaf2
Add rope dynamic linear scaling (#7437)
hsiehjackson Sep 18, 2023
5bb2940
Fix None dataloader issue in PTL2.0 (#7455)
KunalDhawan Sep 19, 2023
bdcde46
[ASR] Confidence measure -> method renames (#7434)
GNroy Sep 19, 2023
3b45cbe
Add steps for document of getting dataset 'SF Bilingual Speech' (#7378)
RobinDong Sep 19, 2023
b9ddba7
RNN-T confidence and alignment bugfix (#7381)
GNroy Sep 19, 2023
a549204
Fix resume from checkpoint in exp_manager (#7424) (#7426)
github-actions[bot] Sep 19, 2023
0ff844f
Fix checking of cuda/cpu device for inputs of Decoder (#7444)
RobinDong Sep 19, 2023
b01b01f
Fix failure of ljspeech's get_data.py (#7430)
RobinDong Sep 19, 2023
422d464
[TTS] Fix audio codec type checks (#7373)
rlangman Sep 19, 2023
b4bf4ee
[TTS] Add dataset to path of logged artifacts (#7462)
rlangman Sep 20, 2023
7b444dc
Fix sft dataset truncation (#7464)
hsiehjackson Sep 20, 2023
f550583
Automatic Lip Reading Recognition (ALR) - ASR/CV (Visual ASR) (#7330)
burchim Sep 20, 2023
fd17915
HF StarCoder to NeMo conversion script (#7421)
janekl Sep 20, 2023
0076724
fix bug when loading dist ckpt in peft (#7452)
lhb8125 Sep 21, 2023
45babd2
Fix adding positional embeddings in-place in transformer module (#7440)
The0nix Sep 21, 2023
84584c0
Fix (#7478)
hsiehjackson Sep 22, 2023
37cc67f
add sleep (#7498) (#7499)
github-actions[bot] Sep 24, 2023
2b76509
Fix exp manager check for sleep (#7503) (#7504)
github-actions[bot] Sep 25, 2023
2585e0a
bugfix: trainer.accelerator=auto from None. (#7492) (#7493)
github-actions[bot] Sep 25, 2023
9c83365
[doc] fix broken link (#7481)
stas00 Sep 25, 2023
92f0eec
[TTS] Read audio as int32 to avoid flac read errors (#7477)
rlangman Sep 26, 2023
c5d917b
Add dataset 'AISHELL-3' from OpenSLR for training mandarin TTS (#7409)
RobinDong Sep 26, 2023
e12d30a
dllogger - log on rank 0 only (#7513)
stas00 Sep 26, 2023
e60cd38
Fix TTS FastPitch tutorial (#7494) (#7516)
github-actions[bot] Sep 26, 2023
1c3c43c
Fix get_dist() tensor dimension (#7506) (#7515)
github-actions[bot] Sep 26, 2023
c2d70e6
bugfix: specify trainer.strategy=auto when devices=1 (#7509) (#7512)
github-actions[bot] Sep 26, 2023
9317d15
fix (#7511)
aklife97 Sep 26, 2023
20b15f3
[TTS] Fix FastPitch data prep tutorial (#7524)
rlangman Sep 27, 2023
8b15984
add italian tokenization (#7486)
GiacomoLeoneMaria Sep 27, 2023
e81dda2
Replace None strategy with auto in tutorial notebooks (#7521) (#7527)
github-actions[bot] Sep 27, 2023
1be2b40
unpin setuptools (#7534) (#7535)
github-actions[bot] Sep 27, 2023
32c06fa
remove auto generated examples (#7510)
arendu Sep 27, 2023
3a1818f
Add the `strategy` argument to `MegatronGPTModel.generate()` (#7264)
odelalleau Sep 27, 2023
fd59a84
Fix PTL2.0 related ASR bugs in r1.21.0: Val metrics logging, None dat…
github-actions[bot] Sep 27, 2023
57116f4
gpus -> devices (#7542) (#7545)
github-actions[bot] Sep 28, 2023
40c8f08
Update FFMPEG version to fix issue with torchaudio (#7551) (#7553)
github-actions[bot] Sep 28, 2023
9bc0238
PEFT GPT & T5 Refactor (#7308)
meatybobby Sep 28, 2023
1da24ef
fix a typo (#7496)
BestJuly Sep 28, 2023
c3d3ffc
[TTS] remove curly braces from ${BRANCH} in jupyer notebook cell. (#7…
github-actions[bot] Sep 28, 2023
6849c94
add youtube embed url (#7570)
XuesongYang Sep 29, 2023
ec4f8c3
Remap speakers to continuous range of speaker_id for dataset AISHELL3…
RobinDong Sep 29, 2023
4858db4
fix validation_step_outputs initialization for multi-dataloader (#754…
github-actions[bot] Sep 29, 2023
122eced
Append output of val step to self.validation_step_outputs (#7530) (#7…
github-actions[bot] Sep 29, 2023
ec9e251
[TTS] fixed trainer's accelerator and strategy. (#7569) (#7574)
github-actions[bot] Sep 29, 2023
e99d530
Append val/test output to instance variable in EncDecSpeakerLabelMode…
github-actions[bot] Sep 29, 2023
e7b3b71
Fix CustomProgressBar for resume (#7427) (#7522)
github-actions[bot] Sep 30, 2023
abbef3b
fix typos in nfa and speech enhancement tutorials (#7580) (#7583)
github-actions[bot] Sep 30, 2023
468e8b0
Add strategy as ddp_find_unused_parameters_true for glue_benchmark.py…
github-actions[bot] Sep 30, 2023
8406b69
update strategy (#7577) (#7578)
github-actions[bot] Sep 30, 2023
02dac3b
Fix typos (#7581)
Kipok Oct 2, 2023
e434de9
Change hifigan finetune strategy to ddp_find_unused_parameters_true (…
github-actions[bot] Oct 2, 2023
6a2a145
[BugFix] Add missing quotes for auto strategy in tutorial notebooks (…
github-actions[bot] Oct 2, 2023
6db0b2b
add build os key (#7596) (#7599)
github-actions[bot] Oct 2, 2023
8f2a31c
StarCoder SFT test + bump PyT NGC image to 23.09 (#7540)
janekl Oct 2, 2023
8f306a2
defaults changed (#7600)
arendu Oct 3, 2023
2f6fa29
add ItalianPhonemesTokenizer (#7587)
GiacomoLeoneMaria Oct 3, 2023
71f327f
best ckpt fix (#7564) (#7588)
github-actions[bot] Oct 3, 2023
6517360
Add files via upload (#7598)
Jorjeous Oct 3, 2023
e52c99b
Fix validation in G2PModel and ThutmoseTaggerModel (#7597) (#7606)
github-actions[bot] Oct 3, 2023
cbb499c
Broadcast loss only when using pipeline parallelism and within the pi…
github-actions[bot] Oct 3, 2023
2da5c02
Safeguard nemo_text_processing installation on ARM (#7485)
blisc Oct 3, 2023
5f01aab
Bound transformers version in requirements (#7620)
athitten Oct 4, 2023
36ba71f
fix llama2 70b lora tuning bug (#7622)
cuichenx Oct 4, 2023
f8980ba
Fix import error no module name model_utils (#7629)
menon92 Oct 4, 2023
c8aa8ac
add fc large ls models (#7641)
nithinraok Oct 4, 2023
ad3a4de
bugfix: trainer.gpus, trainer.strategy, trainer.accelerator (#7621) (…
github-actions[bot] Oct 5, 2023
f83edf6
fix ssl models ptl monitor val through logging (#7608) (#7614)
github-actions[bot] Oct 5, 2023
99a914b
Fix metrics for SE tutorial (#7604) (#7612)
github-actions[bot] Oct 5, 2023
f1f3835
Add ddp_find_unused_parameters=True and change accelerator to auto (#…
github-actions[bot] Oct 5, 2023
1f280a9
Fix py3.11 dataclasses issue (#7616)
github-actions[bot] Oct 5, 2023
2ab6aa3
Fix issues with Dockerfile (#7650) (#7652)
github-actions[bot] Oct 6, 2023
290e228
[ASR] RNN-T greedy decoding max_frames fix for alignment and confiden…
GNroy Oct 6, 2023
25e86ab
[ASR] Fix type error in jasper (#7636) (#7653)
github-actions[bot] Oct 6, 2023
d4b6a75
[TTS] Add STFT and SI-SDR loss to audio codec recipe (#7468)
rlangman Oct 6, 2023
03b2846
Create per.py (#7538)
ssh-meister Oct 7, 2023
6c777ad
conversion issue fix (#7648) (#7668)
github-actions[bot] Oct 10, 2023
b85e9b4
layernorm1p fix (#7523) (#7567)
github-actions[bot] Oct 10, 2023
84e18eb
generalized chat sft prompt (#7655)
yidong72 Oct 10, 2023
182b1aa
Added References
ssh-meister Oct 10, 2023
2e17555
added to toctree
ssh-meister Oct 10, 2023
085a5ce
Merge branch 'main' into llm_docs_upd
ssh-meister Oct 10, 2023
453b6e0
Merge branch 'main' into llm_docs_upd
ssh-meister Oct 10, 2023
da2723f
Merge branch 'main' into llm_docs_upd
ssh-meister Oct 16, 2023
057767d
Merge branch 'main' into llm_docs_upd
ssh-meister Oct 17, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions docs/source/nlp/nemo_megatron/flash_attention.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
Flash attention
---------------
Flash Attention :cite:`nlp-megatron-dao2022flashattention` is a method designed to enhance the efficiency of Transformer models, which are widely utilized in applications such as natural language processing. Traditional Transformers are slow and consume a lot of memory, especially with long sequences, due to the quadratic time and memory complexity of self-attention. FlashAttention, an IO-aware exact attention algorithm that leverages tiling to minimize the number of memory reads/writes between the GPU's high bandwidth memory (HBM) and on-chip SRAM. This approach is designed to be more efficient in terms of IO complexity compared to standard attention mechanisms.

GPT
^^^
To enable Flash Attention while Megatron GPT model training or fine-tuning, modify the following configuration:

.. code::

model.use_flash_attention=True

T5
^^
To enable Flash Attention while Megatron T5 model training, modify the following configuration:

.. code::

model.encoder.use_flash_attention=True
model.decoder.use_flash_attention=True

References
----------

.. bibliography:: ../nlp_all.bib
:style: plain
:labelprefix: nlp-megatron
:keyprefix: nlp-megatron-
2 changes: 2 additions & 0 deletions docs/source/nlp/nemo_megatron/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ team at NVIDIA. NeMo Megatron supports several types of models:
retro/retro_model
hiddens/hiddens_module
peft/landing_page
flash_attention
positional_embeddings


References
Expand Down
111 changes: 111 additions & 0 deletions docs/source/nlp/nemo_megatron/positional_embeddings.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
Positional embeddings
---------------------

Positional embeddings are used to give the model information about the position of each element in a sequence. Megatron LLM supports the following positional embedding types:

GPT
^^^

.. list-table:: *Supported positional embeddings in GPT models*
:widths: 10 30 60
:header-rows: 1

* - Parameter value
- How to use
- Description

* - **learned_absolute**
- .. code::

model.position_embedding_type='learned_absolute'
- Absolute Position Encodings :cite:`nlp-megatron-vaswani2023attention` are position embeddings used in Transformer-based models, added to input embeddings in the encoder and decoder sections. These encodings match the dimension of embeddings and are created using sine and cosine functions of various frequencies. Each dimension in the encoding corresponds to a sinusoid with wavelengths forming a geometric progression.

* - **rope**
- .. code::

model.position_embedding_type='rope'
model.rotary_percentage=1.0
- Rotary Position Embedding (RoPE) :cite:`nlp-megatron-su2022roformer` incorporates positional information by utilizing a rotation matrix to encode the absolute positions of tokens while maintaining relative positional relationships in self-attention formulations by leveraging the geometric properties of vectors and complex numbers, applying a rotation based on a preset non-zero constant and the relative positions of the tokens to the word embeddings.

* - **alibi**
- .. code::

model.position_embedding_type='alibi'
- Attention with Linear Biases (ALiBi) :cite:`nlp-megatron-press2022train` modifies the way attention scores are computed in the attention sublayer of the network. ALiBi introduces a static, non-learned bias after the query-key dot product during the computation of attention scores. This bias is added in the form of a head-specific slope that is determined before training, creating a geometric sequence of slopes for the different heads in the model. The method has an inductive bias towards recency, penalizing attention scores between distant query-key pairs with the penalty increasing as the distance grows, and it leverages different rates of penalty increase across different heads based on the slope magnitude.

* - **kerple**
- .. code::

model.position_embedding_type='kerple'
- Kernelized Relative Positional Embedding for Length Extrapolation (KERPLE) :cite:`nlp-megatron-chi2022kerple` generalizes relative positional embeddings (RPE) by kernelizing positional differences using conditionally positive definite (CPD) kernels known for generalizing distance metrics. They transform CPD kernels into positive definite (PD) kernels by adding a constant offset, which is absorbed during softmax normalization in the self-attention mechanism of transformers. This approach allows for a variety of RPEs that facilitate length extrapolation in a principled manner.

* - **xpos**
- .. code::

model.position_embedding_type='xpos'
- Extrapolatable Position Embedding (xPos) :cite:`nlp-megatron-sun2022lengthextrapolatable`

* - **sandwich**
- .. code::

model.position_embedding_type='sandwich'
- Sandwich :cite:`nlp-megatron-chi2023dissecting`

T5
^^

.. list-table:: *Supported positional embeddings in T5 models*
:widths: 10 30 60
:header-rows: 1

* - Parameter value
- How to use
- Description

* - **learned_absolute**
- .. code::

model.encoder.position_embedding_type='learned_absolute'
model.decoder.position_embedding_type='learned_absolute'
- Absolute Position Encodings :cite:`nlp-megatron-vaswani2023attention` are position embeddings used in Transformer-based models, added to input embeddings in the encoder and decoder sections. These encodings match the dimension of embeddings and are created using sine and cosine functions of various frequencies. Each dimension in the encoding corresponds to a sinusoid with wavelengths forming a geometric progression.

* - **relative**
- .. code::

model.encoder.position_embedding_type='relative'
model.decoder.position_embedding_type='relative'
- Relative Position Representations :cite:`nlp-megatron-shaw2018selfattention`

* - **alibi**
- .. code::

model.encoder.position_embedding_type='alibi'
model.decoder.position_embedding_type='alibi'
- Attention with Linear Biases (ALiBi) :cite:`nlp-megatron-press2022train` modifies the way attention scores are computed in the attention sublayer of the network. ALiBi introduces a static, non-learned bias after the query-key dot product during the computation of attention scores. This bias is added in the form of a head-specific slope that is determined before training, creating a geometric sequence of slopes for the different heads in the model. The method has an inductive bias towards recency, penalizing attention scores between distant query-key pairs with the penalty increasing as the distance grows, and it leverages different rates of penalty increase across different heads based on the slope magnitude.

* - **kerple**
- .. code::

model.encoder.position_embedding_type='kerple'
model.decoder.position_embedding_type='kerple'
- Kernelized Relative Positional Embedding for Length Extrapolation (KERPLE) :cite:`nlp-megatron-chi2022kerple` generalizes relative positional embeddings (RPE) by kernelizing positional differences using conditionally positive definite (CPD) kernels known for generalizing distance metrics. They transform CPD kernels into positive definite (PD) kernels by adding a constant offset, which is absorbed during softmax normalization in the self-attention mechanism of transformers. This approach allows for a variety of RPEs that facilitate length extrapolation in a principled manner.

Positional interpolation
------------------------
Position Interpolation (PI) :cite:`nlp-megatron-chen2023extending` is a method introduced to extend the context window sizes of Rotary Position Embedding (RoPE)-based pretrained large language models (LLMs). The central principle of PI is to reduce the position indices so that they align with the initial context window size through interpolation.

Positional Interpolation is supported in Megatron GPT SFT models. Set RoPE Interpolation factor for sequence length :code:`seq_len_interpolation_factor` to enable it.

.. code::

model.position_embedding_type='rope'
model.rotary_percentage=1.0
model.seq_len_interpolation_factor: 2

References
----------

.. bibliography:: ../nlp_all.bib
:style: plain
:labelprefix: nlp-megatron
:keyprefix: nlp-megatron-
81 changes: 81 additions & 0 deletions docs/source/nlp/nlp_all.bib
Original file line number Diff line number Diff line change
Expand Up @@ -225,3 +225,84 @@ @misc{antonova2023spellmapper
archivePrefix={arXiv},
primaryClass={cs.CL}
}

@misc{dao2022flashattention,
title={FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness},
author={Tri Dao and Daniel Y. Fu and Stefano Ermon and Atri Rudra and Christopher Ré},
year={2022},
eprint={2205.14135},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

@misc{vaswani2023attention,
title={Attention Is All You Need},
author={Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},
year={2023},
eprint={1706.03762},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

@misc{su2022roformer,
title={RoFormer: Enhanced Transformer with Rotary Position Embedding},
author={Jianlin Su and Yu Lu and Shengfeng Pan and Ahmed Murtadha and Bo Wen and Yunfeng Liu},
year={2022},
eprint={2104.09864},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

@misc{press2022train,
title={Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation},
author={Ofir Press and Noah A. Smith and Mike Lewis},
year={2022},
eprint={2108.12409},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

@misc{chi2022kerple,
title={KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation},
author={Ta-Chung Chi and Ting-Han Fan and Peter J. Ramadge and Alexander I. Rudnicky},
year={2022},
eprint={2205.09921},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

@misc{sun2022lengthextrapolatable,
title={A Length-Extrapolatable Transformer},
author={Yutao Sun and Li Dong and Barun Patra and Shuming Ma and Shaohan Huang and Alon Benhaim and Vishrav Chaudhary and Xia Song and Furu Wei},
year={2022},
eprint={2212.10554},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

@misc{chi2023dissecting,
title={Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis},
author={Ta-Chung Chi and Ting-Han Fan and Alexander I. Rudnicky and Peter J. Ramadge},
year={2023},
eprint={2212.10356},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

@misc{shaw2018selfattention,
title={Self-Attention with Relative Position Representations},
author={Peter Shaw and Jakob Uszkoreit and Ashish Vaswani},
year={2018},
eprint={1803.02155},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

@misc{chen2023extending,
title={Extending Context Window of Large Language Models via Positional Interpolation},
author={Shouyuan Chen and Sherman Wong and Liangjian Chen and Yuandong Tian},
year={2023},
eprint={2306.15595},
archivePrefix={arXiv},
primaryClass={cs.CL}
}