[PyTorch] Float8Tensor uses cached transpose if available #524

timmoon10 · 2023-11-19T10:51:09Z

This PR changes the transpose behavior of Float8Tensor:

If update_cache == True, it will compute the transpose and update the cache
If update_cache == False and the cache is empty, it will compute the transpose
If update_cache == False and the cache is populated, it will return the cached transpose

This is somewhat of a kludge to support transpose caching with Megatron GPT (see NVIDIA/NeMo#7909). Its forward function doesn't keep track of gradient accumulation steps, so it doesn't pass is_first_microbatch to LayerNormLinear or Linear. E.g.:
https://github.com/NVIDIA/Megatron-LM/blob/9290c730d04b482be8fae92a4186fe4ff0c95270/megatron/core/transformer/attention.py#L271C31-L271C31
Compare to NeMo GPT, which contains TE-specific logic like is_first_microbatch:
https://github.com/NVIDIA/NeMo/blob/d81beac52423dbd04b48e4e04567b17df2428e3a/nemo/collections/nlp/modules/common/megatron/transformer.py#L1556

Discussion would be appreciated. This design ping-ponged a few times in #452, e.g. 00b9c31. This approach is convenient with an FP8-aware optimizer since the optimizer doesn't need any access to the TE modules, just the FP8 params. There are also some alternative approaches:

Add TE logic to Megatron, especially is_first_microbatch, to keep the current API
Add arguments to the TE module constructors to control transpose caching
Add attributes to the TE modules or FP8 tensors to control transpose caching

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2023-11-19T11:33:17Z

/te-ci pytorch

ptrendx · 2023-11-20T17:19:05Z

tests/pytorch/test_float8tensor.py

@@ -263,7 +263,7 @@ def test_transpose(
        dims: DimsType,
        transpose_dims: Tuple[int, int],
        fp8_dtype: tex.DType = tex.DType.kFloat8E4M3,
-        scale: float = 1,
+        scale: float = 0.5,


I thought there was a correctness issue that was hidden by scale=1, but I don't think it's actually an issue. Making this non-one does a better job stress-testing this in any case though.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2023-11-20T23:56:22Z

/te-ci pytorch

timmoon10 · 2023-11-21T22:31:30Z

Test failures are ONNX-related. This is ready to go.

erhoo82 · 2023-12-08T17:09:53Z

@timmoon10
Can you re-open and close the PR? As I shared, I verified the functionality of this feature.

timmoon10 · 2023-12-08T17:55:42Z

The work in this PR was merged in #529.

Float8Tensor uses cached transpose if available

5215774

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added bug Something isn't working enhancement New feature or request labels Nov 19, 2023

timmoon10 requested review from sudhakarsingh27, ptrendx and ksivaman November 19, 2023 10:51

timmoon10 mentioned this pull request Nov 19, 2023

Add distopt support for FP8 params and BF16 optimizer state NVIDIA/NeMo#7909

Merged

8 tasks

ptrendx reviewed Nov 20, 2023

View reviewed changes

timmoon10 marked this pull request as draft November 20, 2023 19:21

Fix bug with non-2D transpose

22eccf6

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 marked this pull request as ready for review November 20, 2023 23:55

Merge branch 'main' into float8tensor-transpose-caching

0dcd8f4

timmoon10 requested a review from ptrendx November 21, 2023 22:30

Merge branch 'main' into float8tensor-transpose-caching

57c7e03

timmoon10 mentioned this pull request Nov 21, 2023

[PyTorch] Support pickling Float8Tensor #529

Merged

Merge branch 'main' into float8tensor-transpose-caching

3308da1

timmoon10 closed this in #529 Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Float8Tensor uses cached transpose if available #524

[PyTorch] Float8Tensor uses cached transpose if available #524

timmoon10 commented Nov 19, 2023 •

edited

Loading

timmoon10 commented Nov 19, 2023

ptrendx Nov 20, 2023

timmoon10 Nov 20, 2023

timmoon10 commented Nov 20, 2023

timmoon10 commented Nov 21, 2023

erhoo82 commented Dec 8, 2023

timmoon10 commented Dec 8, 2023

[PyTorch] Float8Tensor uses cached transpose if available #524

[PyTorch] Float8Tensor uses cached transpose if available #524

Conversation

timmoon10 commented Nov 19, 2023 • edited Loading

timmoon10 commented Nov 19, 2023

ptrendx Nov 20, 2023

Choose a reason for hiding this comment

timmoon10 Nov 20, 2023

Choose a reason for hiding this comment

timmoon10 commented Nov 20, 2023

timmoon10 commented Nov 21, 2023

erhoo82 commented Dec 8, 2023

timmoon10 commented Dec 8, 2023

timmoon10 commented Nov 19, 2023 •

edited

Loading