-
Notifications
You must be signed in to change notification settings - Fork 23k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[functorch] test: try using reference_inputs in vmap tests #91355
[functorch] test: try using reference_inputs in vmap tests #91355
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91355
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 FailuresAs of commit 21da1aa: NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base c99a2a4:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
test/functorch/test_vmap.py
Outdated
sample_inputs_itr = op.sample_inputs(device, dtype, requires_grad=False) | ||
sample_inputs_op = { | ||
# Take too long | ||
"special.chebyshev_polynomial_t", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These ops already have skip for taking long time with reference inputs
Eg.
pytorch/torch/testing/_internal/opinfo/definitions/special.py
Lines 374 to 388 in 39d49db
BinaryUfuncInfo( | |
"special.chebyshev_polynomial_t", | |
dtypes=all_types_and(torch.bool), | |
promotes_int_to_float=True, | |
skips=( | |
DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"), | |
DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"), | |
DecorateInfo( | |
unittest.skip("testing takes an unreasonably long time, #79528"), | |
"TestCommon", | |
"test_compare_cpu", | |
), | |
), | |
supports_one_python_scalar=True, | |
supports_autograd=False, |
Will take care of ASAN failure post the review. |
xfail('__rsub__'), | ||
# RuntimeError: Batching rule not implemented for aten::moveaxis.int; | ||
# the fallback path doesn't work on out= or view ops. | ||
xfail('movedim'), | ||
# RuntimeError: NYI: querying is_contiguous inside of vmap for | ||
# memory_format other than torch.contiguous_format | ||
xfail('contiguous'), | ||
# RuntimeError: NYI: Tensor.clone(memory_format) inside vmap is only supported | ||
# with memory_format torch.preserve_format or torch.contiguous_format (got ChannelsLast) | ||
xfail('clone'), | ||
# RuntimeError: When vmap-ing torch.nn.functional.one_hot, | ||
# please provide an explicit positive num_classes argument. | ||
xfail('nn.functional.one_hot'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally I'd feel bad about adding these xfails, but we do have manual tests for contiguous, clone, one_hot, sub, in the codebase; and movedim is tested just by virtue of being a part of the vmap implementation.
# AssertionError | ||
# Mismatched elements: 18 / 20 (90.0%) | ||
# Greatest absolute difference: 14.031710147857666 at index (0, 5) (up to 0.0001 allowed) | ||
# Greatest relative difference: 2.9177700113052603 at index (0, 3) (up to 0.0001 allowed) | ||
xfail('narrow_copy', device_type='cpu'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you file an issue for silent correctness? Also, do you know which of the following is the actual problem?
- the non-contiguous test is failing
- the batching rule is bogus?
- narrow_copy has inconsistent semantics on cpu/cuda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will file an issue.
- I don't think
non-contiguous
sample is an issue as we haven't addednon-contig
testing to vmap tests. - Batching rule for
narrow_copy
seems innocuous and doesn't have special handling for CPU and CUDA.
So maybe the operator has some issue.
Batching Rule Ref:
pytorch/aten/src/ATen/functorch/BatchRulesViews.cpp
Lines 506 to 515 in 3120054
std::tuple<Tensor, optional<int64_t>> narrow_copy_batch_rule( | |
const Tensor &self, optional<int64_t> self_bdim, int64_t dim, c10::SymInt start, c10::SymInt length) | |
{ | |
TORCH_INTERNAL_ASSERT(self_bdim.has_value()); | |
auto self_ = moveBatchDimToFront(self, self_bdim); | |
auto logical_rank = rankWithoutBatchDim(self, self_bdim); | |
dim = maybe_wrap_dim(dim, logical_rank) + 1; | |
auto result = self_.narrow_copy_symint(dim, start, length); | |
return std::make_tuple(result, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the operator is a problem: if we can come up with some repro that doesn't involve vmap that shows that on the same input (on cpu/cuda with the same strides), it produces different outputs, then that would be great. One idea to "get rid of the vmap" is to use make_fx to trace out what's happening
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Thanks!
Have assigned the issue to myself. Will have a look soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More info here : #91690
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We should try to dig into if some of the failures are important and file issues for them if so
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: The following mandatory check(s) failed (Rule Dig deeper by viewing the failures on hud Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot revert |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot revert -m"Broke trunk" -c landrace |
@pytorchbot successfully started a revert job. Check the current status here. |
@kshitij12345 your PR has been successfully reverted. |
…91355)" This reverts commit a51090d. Reverted #91355 on behalf of https://github.com/kshitij12345 due to Broke trunk
eed545d
to
4b2e3b5
Compare
@pytorchbot merge -f"JIT failure looks unrelated" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Ref pytorch/functorch#1090
Timings:
test_vmap_exhaustive
After PR
Before PR
test_op_has_batch_rule
After PR
Before PR