fix NAN loss of rope long context training #399

inkcherry · 2024-06-05T07:24:20Z

based on ( #392) , we got NAN loss during long centext training using ds-sp(ulysses) for a llama style model.

We found that this issue is caused by precision problems. Half precision of rope sequence representation leads to loss in long context. Similar modifications have also been applied to the transformers.
https://github.com/huggingface/transformers/blob/63fb253df0d976b95d9b4b9a7b0012e5f8a37896/src/transformers/models/llama/modeling_llama.py#L111

shrutiramesh1988 · 2024-06-13T03:09:26Z

Even with this fix, I'm still facing loss=nan issues when trying to run the llama2 pre-training on single/multiple nodes with BF16, ZeRO stage 1, --use-rotary-position-embeddings and a sequence length of 4096. Could you kindly help.

fix rope precision for long context

8bda975

inkcherry requested review from tjruwase, awan-10, eltonzheng, duli2012, mrwyattii, arashb, xiaoxiawu-microsoft and GuanhuaWang as code owners June 5, 2024 07:24

inkcherry mentioned this pull request Jun 5, 2024

Sequence Parallel is incompatible with Rotary Positional Embedding #385

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix NAN loss of rope long context training #399

fix NAN loss of rope long context training #399

inkcherry commented Jun 5, 2024 •

edited

Loading

shrutiramesh1988 commented Jun 13, 2024

fix NAN loss of rope long context training #399

Are you sure you want to change the base?

fix NAN loss of rope long context training #399

Conversation

inkcherry commented Jun 5, 2024 • edited Loading

shrutiramesh1988 commented Jun 13, 2024

inkcherry commented Jun 5, 2024 •

edited

Loading