Refactor freqs_cis slice to be safer for PP #321

wconstab · 2024-05-10T23:03:32Z

Stack from ghstack (oldest at bottom):

Unchanged: we precompute freqs_cis for max_seqlen, >> seqlen for a given
batch.

Changed: instead of slicing self.freqs_cis down to seqlen at top level
transformer based on the input token shape, we slice it down to seqlen
inside a transformer layer after we have re-expanded to the full seqlen
in cases where TP has sharded across seqlen.

In the PP case, stage 1's input may be seqlen/TP instead of seqlen, but
we do not generally know this. That makes it hard for stage1 to slice
freqs_cis correctly. It's easy to do the slicing deeper inside, since
at that point we do know the full seqlen unambiguously.

Note: the full self.freqs_cis is stored in memory either way, and the
thing passed into every layer is just a view. This change should not be
material for memory usage or otherwise.

[ghstack-poisoned]

Unchanged: we precompute freqs_cis for max_seqlen, >> seqlen for a given batch. Changed: instead of slicing self.freqs_cis down to seqlen at top level transformer based on the input token shape, we slice it down to seqlen inside a transformer layer after we have re-expanded to the full seqlen in cases where TP has sharded across seqlen. In the PP case, stage 1's input may be seqlen/TP instead of seqlen, but we do not generally know this. That makes it hard for stage1 to slice freqs_cis correctly. It's easy to do the slicing deeper inside, since at that point we do know the full seqlen unambiguously. Note: the full self.freqs_cis is stored in memory either way, and the thing passed into every layer is just a view. This change should not be material for memory usage or otherwise. ghstack-source-id: 20ef05e0734e53260366878dfe0fac5e1ab48f1d Pull Request resolved: #321

lessw2020

makes sense - lgtm!

wanchaol

lgtm!

Unchanged: we precompute freqs_cis for max_seqlen, >> seqlen for a given batch. Changed: instead of slicing self.freqs_cis down to seqlen at top level transformer based on the input token shape, we slice it down to seqlen inside a transformer layer after we have re-expanded to the full seqlen in cases where TP has sharded across seqlen. In the PP case, stage 1's input may be seqlen/TP instead of seqlen, but we do not generally know this. That makes it hard for stage1 to slice freqs_cis correctly. It's easy to do the slicing deeper inside, since at that point we do know the full seqlen unambiguously. Note: the full self.freqs_cis is stored in memory either way, and the thing passed into every layer is just a view. This change should not be material for memory usage or otherwise. ghstack-source-id: 20ef05e0734e53260366878dfe0fac5e1ab48f1d Pull Request resolved: #321

awgu · 2024-05-14T12:42:14Z

torchtitan/models/llama/model.py

@@ -76,7 +79,9 @@ def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor) -> torch.Ten
    """
    ndim = x.ndim
    assert 0 <= 1 < ndim


not from this PR: I wonder what the point of the 0 <= 1 part is 😃 .

lol. its always good to check your assumptions

Unchanged: we precompute freqs_cis for max_seqlen, >> seqlen for a given batch. Changed: instead of slicing self.freqs_cis down to seqlen at top level transformer based on the input token shape, we slice it down to seqlen inside a transformer layer after we have re-expanded to the full seqlen in cases where TP has sharded across seqlen. In the PP case, stage 1's input may be seqlen/TP instead of seqlen, but we do not generally know this. That makes it hard for stage1 to slice freqs_cis correctly. It's easy to do the slicing deeper inside, since at that point we do know the full seqlen unambiguously. Note: the full self.freqs_cis is stored in memory either way, and the thing passed into every layer is just a view. This change should not be material for memory usage or otherwise. ghstack-source-id: 20ef05e0734e53260366878dfe0fac5e1ab48f1d Pull Request resolved: pytorch#321

Update

231ebc1

[ghstack-poisoned]

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 10, 2024

wconstab mentioned this pull request May 10, 2024

Make Transformer tolerate missing layers for PP #322

Merged

wconstab requested review from tianyu-l, wanchaol and lessw2020 May 10, 2024 23:48

wconstab mentioned this pull request May 10, 2024

Add Pipeline Parallel (and 2D PP+FSDP) support #318

Merged

lessw2020 approved these changes May 13, 2024

View reviewed changes

wanchaol approved these changes May 13, 2024

View reviewed changes

wconstab merged commit 231ebc1 into gh/wconstab/13/base May 13, 2024
4 checks passed

wconstab deleted the gh/wconstab/13/head branch May 13, 2024 21:46

awgu reviewed May 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor freqs_cis slice to be safer for PP #321

Refactor freqs_cis slice to be safer for PP #321

wconstab commented May 10, 2024 •

edited

Loading

lessw2020 left a comment

wanchaol left a comment

awgu May 14, 2024

wconstab May 14, 2024

Refactor freqs_cis slice to be safer for PP #321

Refactor freqs_cis slice to be safer for PP #321

Conversation

wconstab commented May 10, 2024 • edited Loading

lessw2020 left a comment

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

awgu May 14, 2024

Choose a reason for hiding this comment

wconstab May 14, 2024

Choose a reason for hiding this comment

wconstab commented May 10, 2024 •

edited

Loading