Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[functorch] test: try using reference_inputs in vmap tests #91355

Closed

Conversation

kshitij12345
Copy link
Collaborator

@kshitij12345 kshitij12345 commented Dec 23, 2022

Ref pytorch/functorch#1090

Timings:

test_vmap_exhaustive

After PR

== 1168 passed, 55 skipped, 2353 deselected, 153 xfailed in 195.07s (0:03:15) ==

Before PR

== 1134 passed, 55 skipped, 2316 deselected, 150 xfailed in 77.18s (0:01:17) ==

test_op_has_batch_rule

After PR

== 988 passed, 57 skipped, 2353 deselected, 331 xfailed in 144.70s (0:02:24) ==

Before PR

== 969 passed, 57 skipped, 2316 deselected, 313 xfailed in 65.86s (0:01:05) ==

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Dec 23, 2022
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 23, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91355

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Failures

As of commit 21da1aa:

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base c99a2a4:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

sample_inputs_itr = op.sample_inputs(device, dtype, requires_grad=False)
sample_inputs_op = {
# Take too long
"special.chebyshev_polynomial_t",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These ops already have skip for taking long time with reference inputs

Eg.

BinaryUfuncInfo(
"special.chebyshev_polynomial_t",
dtypes=all_types_and(torch.bool),
promotes_int_to_float=True,
skips=(
DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
DecorateInfo(
unittest.skip("testing takes an unreasonably long time, #79528"),
"TestCommon",
"test_compare_cpu",
),
),
supports_one_python_scalar=True,
supports_autograd=False,

@kshitij12345 kshitij12345 marked this pull request as ready for review January 2, 2023 19:00
@kshitij12345 kshitij12345 requested a review from zou3519 as a code owner January 2, 2023 19:00
@kshitij12345
Copy link
Collaborator Author

Will take care of ASAN failure post the review.

Comment on lines +3501 to +3513
xfail('__rsub__'),
# RuntimeError: Batching rule not implemented for aten::moveaxis.int;
# the fallback path doesn't work on out= or view ops.
xfail('movedim'),
# RuntimeError: NYI: querying is_contiguous inside of vmap for
# memory_format other than torch.contiguous_format
xfail('contiguous'),
# RuntimeError: NYI: Tensor.clone(memory_format) inside vmap is only supported
# with memory_format torch.preserve_format or torch.contiguous_format (got ChannelsLast)
xfail('clone'),
# RuntimeError: When vmap-ing torch.nn.functional.one_hot,
# please provide an explicit positive num_classes argument.
xfail('nn.functional.one_hot'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally I'd feel bad about adding these xfails, but we do have manual tests for contiguous, clone, one_hot, sub, in the codebase; and movedim is tested just by virtue of being a part of the vmap implementation.

Comment on lines +3728 to +3732
# AssertionError
# Mismatched elements: 18 / 20 (90.0%)
# Greatest absolute difference: 14.031710147857666 at index (0, 5) (up to 0.0001 allowed)
# Greatest relative difference: 2.9177700113052603 at index (0, 3) (up to 0.0001 allowed)
xfail('narrow_copy', device_type='cpu'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you file an issue for silent correctness? Also, do you know which of the following is the actual problem?

  • the non-contiguous test is failing
  • the batching rule is bogus?
  • narrow_copy has inconsistent semantics on cpu/cuda?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will file an issue.

  • I don't think non-contiguous sample is an issue as we haven't added non-contig testing to vmap tests.
  • Batching rule for narrow_copy seems innocuous and doesn't have special handling for CPU and CUDA.

So maybe the operator has some issue.

Batching Rule Ref:

std::tuple<Tensor, optional<int64_t>> narrow_copy_batch_rule(
const Tensor &self, optional<int64_t> self_bdim, int64_t dim, c10::SymInt start, c10::SymInt length)
{
TORCH_INTERNAL_ASSERT(self_bdim.has_value());
auto self_ = moveBatchDimToFront(self, self_bdim);
auto logical_rank = rankWithoutBatchDim(self, self_bdim);
dim = maybe_wrap_dim(dim, logical_rank) + 1;
auto result = self_.narrow_copy_symint(dim, start, length);
return std::make_tuple(result, 0);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the operator is a problem: if we can come up with some repro that doesn't involve vmap that shows that on the same input (on cpu/cuda with the same strides), it produces different outputs, then that would be great. One idea to "get rid of the vmap" is to use make_fx to trace out what's happening

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Thanks!

Have assigned the issue to myself. Will have a look soon.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More info here : #91690

test/functorch/test_vmap.py Outdated Show resolved Hide resolved
Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We should try to dig into if some of the failures are important and file issues for them if so

@kshitij12345
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 4, 2023
@kshitij12345 kshitij12345 added the keep-going Don't stop on first failure, keep running tests until the end label Jan 4, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: The following mandatory check(s) failed (Rule superuser):

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

@kshitij12345
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@kshitij12345 kshitij12345 deleted the fix/vmap/use-reference-inputs branch January 6, 2023 08:29
@kshitij12345 kshitij12345 restored the fix/vmap/use-reference-inputs branch January 6, 2023 09:52
@kshitij12345
Copy link
Collaborator Author

@pytorchbot revert

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 6, 2023

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -m/--message, -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@kshitij12345 kshitij12345 reopened this Jan 6, 2023
@kshitij12345
Copy link
Collaborator Author

@pytorchbot revert -m"Broke trunk" -c landrace

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@kshitij12345 your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Jan 6, 2023
@kshitij12345 kshitij12345 force-pushed the fix/vmap/use-reference-inputs branch from eed545d to 4b2e3b5 Compare January 6, 2023 10:29
@kshitij12345
Copy link
Collaborator Author

@pytorchbot merge -f"JIT failure looks unrelated"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged open source Reverted topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants