Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify and fix match_fw_and_bw_saved_for_bw_proxies implementation #1754

Draft
wants to merge 3 commits into
base: ivan-1732-0
Choose a base branch
from

Conversation

IvanYashchuk
Copy link
Collaborator

@IvanYashchuk IvanYashchuk commented Feb 7, 2025

Base PR: #1756, keeping draft mode to prevent merges into previous PR in the stack

This change is needed to unblock @jjsjann123 for #1732.

The previous implementation uses both tensors and non-tensors saved for backward in constructing old_saved_for_backward_fw but in construction of the mirrored object from the backward trace non-tensors are ignored leading to an error that @jjsjann123 observed while working on #1732.

cc @mruberry @lantiga @ali-alshaar7

@IvanYashchuk
Copy link
Collaborator Author

Failures to fix:

FAILED thunder/tests/test_inplace_copy.py::test_prim_inplace_copy_bwd_nvfuser_cuda_thunder.dtypes.bfloat16 - AssertionError
FAILED thunder/tests/test_inplace_copy.py::test_prim_inplace_copy_bwd_nvfuser_cuda_thunder.dtypes.float16 - AssertionError
FAILED thunder/tests/test_torch_compile_executor.py::test_litgpt_fabric_for_callable - AssertionError
FAILED thunder/tests/test_torch_compile_executor.py::test_torch_compile_cat_rope_single_fusion - AssertionError
FAILED thunder/tests/test_transforms.py::test_disable_params_and_buffer_check - AssertionError
FAILED thunder/tests/test_jit_general.py::test_tom_overrides_proxy[cuda] - AssertionError
FAILED thunder/tests/test_jit_general.py::test_litgpt_variants[cuda-llama1-like] - AssertionError
FAILED thunder/tests/test_sdpaex_executor.py::test_sdpa_attn_mask[True-cuda-bf16] - AssertionError
FAILED thunder/tests/test_sdpaex_executor.py::test_sdpa_attn_mask[True-cuda-f16] - AssertionError
FAILED thunder/tests/test_sdpaex_executor.py::test_sdpa_attn_mask[False-cuda-f16] - AssertionError
FAILED thunder/tests/test_dynamo.py::test_ThunderCompilerGraphBenchmarking_LlamaMLPBenchmark - torch._dynamo.exc.BackendCompilerFailed: backend='?' raised:
AssertionError: 

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True
FAILED thunder/tests/test_sdpaex_executor.py::test_sdpa_attn_mask[False-cuda-bf16] - AssertionError
FAILED thunder/tests/test_jit_general.py::test_litgpt_variants[cuda-gpt-neox-like] - AssertionError
FAILED thunder/tests/test_jit_general.py::test_litgpt_variants[cuda-long-context-like] - AssertionError
FAILED thunder/tests/test_jit_general.py::test_litgpt_variants[cuda-llama2-like] - AssertionError
FAILED thunder/tests/test_jit_general.py::test_litgpt_variants[cuda-codellama2-like] - AssertionError
= 16 failed, 2162 passed, 228 skipped, 22 xfailed, 7 xpassed, 51635 warnings in 426.08s (0:07:06) =

@IvanYashchuk IvanYashchuk changed the base branch from main to ivan-1732-0 February 7, 2025 13:28
@IvanYashchuk IvanYashchuk marked this pull request as draft February 7, 2025 13:28
@IvanYashchuk
Copy link
Collaborator Author

One more test to fix:

FAILED thunder/tests/distributed/test_fsdp.py::FSDPTest::test_rematerialize_all_gather - RuntimeError: Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_distributed.py", line 726, in run_test
    getattr(self, test_name)()
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_distributed.py", line 599, in wrapper
    fn()
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 3120, in wrapper
    method(*args, **kwargs)
  File "/__w/3/s/thunder/tests/distributed/test_fsdp.py", line 136, in test_rematerialize_all_gather
    self.assertTrue(all(t in result_saved_for_bwd for t in sharded_param_names))
  File "/usr/lib/python3.10/unittest/case.py", line 687, in assertTrue
    raise self.failureException(msg)
AssertionError: False is not true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant