Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nvfuser] view size is not compatible with input tensor's size and stride #93648

Closed
anijain2305 opened this issue Aug 26, 2022 · 9 comments
Closed
Labels
module: inductor oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@anijain2305
Copy link
Contributor

anijain2305 commented Aug 26, 2022

Instructions

  • Uncompress Repro file - repro.tar.gz
  • Run python repro/repro.py

The repro is minified using pytorch/torchdynamo#1056
This only fail with aot_nvfuser backend.

cc @ezyang @msaroufim @wconstab @bdhirsh @zou3519 @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @soumith @ngimel

@rdspring1
Copy link
Contributor

nvfuser is returning a contiguous tensor while eager mode creates a strided tensor.
The following transpose creates a strided tensor, which causes the runtime error with the final linear layer.

Here is standalone repro:

import torch
import torchdynamo
from torchdynamo.testing import rand_strided
from torchdynamo.debug_utils import run_fwd_maybe_bwd

torch._C._jit_set_nvfuser_single_node_mode(True)

from torch.nn import *
class ReproModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_layer = Linear(196, 256)
        self.alpha = torch.randn(1, 1, 3, device='cuda')

    def forward(self, x : torch.Tensor):
        flatten_1 = torch.flatten(x, 2);
        transpose_2 = torch.transpose(flatten_1, 1, 2);

        # returns expected strided tensor like transpose_2
        # add_3 = torch.add(transpose_2, self.alpha)
        # returns contiguous tensor like alpha
        add_3 = torch.add(self.alpha, transpose_2)

        transpose_4 = torch.transpose(add_3, 1, 2);
        return self.linear_layer(transpose_4)

args = [((128, 3, 14, 14), (5376, 42, 14, 1), torch.float32, 'cuda', False)]
args = [rand_strided(sh, st, dt, de).requires_grad_(rg) for (sh, st, dt, de, rg) in args]

mod = ReproModule().cuda()
opt_mod = torchdynamo.optimize("aot_nvfuser")(mod)
res = run_fwd_maybe_bwd(opt_mod, args)

cc @jjsjann123

@jjsjann123
Copy link
Collaborator

yeah... this is expected behavior. A note is that we do intend to improve memory format support after we wrap up our transpose scheduler.

Currently we don't promise any memory format on output tensors, because the lack of performance transpose scheduler, permutation on output could bomb our kernel perf.

We intended to improve it by following TS profiled output memory format. I don't currently have a timeline for that, but the PR for permutation scheduler is getting there. csarofeen#1927. We'll still need to plumb it through after that PR. Likely separately for TS integration and nvprim.

@jjsjann123
Copy link
Collaborator

Just to clarify.

nvfuser is returning a contiguous tensor while eager mode creates a strided tensor.
^^^ this is expected behavior in general. We try a best effort to stick to what eager has, but it's not a guarantee.

        flatten_1 = torch.flatten(x, 2);
        transpose_2 = torch.transpose(flatten_1, 1, 2);

        add_3 = torch.add(self.alpha, transpose_2)

        transpose_4 = torch.transpose(add_3, 1, 2);
        return self.linear_layer(transpose_4)

hmmm... looking at this example, I'm not sure how things get messed up with nvfuser. We are not fusing transpose, and our pointwise memory format handling should be pretty close to what eager has... But we also have self.alpha = torch.randn(1, 1, 3, device='cuda') which is just contiguous and it should dominate the output memory format.

We can take a look at the TS graph passed to nvfuser and figure out what went wrong there.

@ezyang
Copy link
Contributor

ezyang commented Feb 1, 2023

our new attitude is that backends MUST match original eager striding exactly... at least until @dagitses finishes stride agnostic pytorch.

@jjsjann123
Copy link
Collaborator

at least until @dagitses finishes stride agnostic pytorch.

Is that currently a WIP? Sounds like a pretty big endeavor, do we have any guesstimation on delivery time?

@ezyang
Copy link
Contributor

ezyang commented Feb 1, 2023

it's a big endeavor, no estimate on timing

@ngimel
Copy link
Collaborator

ngimel commented Feb 1, 2023

In the meantime, the existing tracing mechanisms (mostly) correctly propagate strides (a few quirks here and there, mostly for the ambiguous strides of tensors with 0 elements or size-1 dimensions), and those can be queried on the fx graph nodes, so backends don't need to develop their own systems of stride propagation, they only need to query and match the strides that tracer gives them at the boundaries of the compiled region.

@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 7, 2023
@ydwu4
Copy link
Contributor

ydwu4 commented Nov 22, 2023

It's difficult to come up with a repro. Label it as an inductor issue since it's specific to backend.

@jjsjann123
Copy link
Collaborator

FYI, nvfuser has been removed from TorchScript. we should close this issue.

I see an inductor label added. not sure if that's intended or just by mistake.

@ydwu4 ydwu4 closed this as completed Nov 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: inductor oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

8 participants