Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpInfo has problems testing define_tensor. #3225

Closed
wujingyue opened this issue Oct 18, 2024 · 6 comments
Closed

OpInfo has problems testing define_tensor. #3225

wujingyue opened this issue Oct 18, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request Testing e.g. improving test infra and test coverage

Comments

@wujingyue
Copy link
Collaborator

wujingyue commented Oct 18, 2024

Context: https://github.com/NVIDIA/Fuser/pull/3222/files#diff-577ed6d3703dbc615028823a5113fdef10881ffb1247b9a79c7f17270650124fR11-R14

To repro, patch b0ccb48 and run

pytest tests/python/test_ops.py -k define_tensor

cc @jjsjann123 and @rdspring1

@wujingyue
Copy link
Collaborator Author

Since #3222 is merged, you can now reproduce this by doing the following:

$ git checkout wjy/define

$ pytest tests/python/test_ops.py -k test_correctness_define_tensor_float32 -s
========================================================================================================================================================================================================================================= test session starts =========================================================================================================================================================================================================================================
platform linux -- Python 3.10.12, pytest-8.1.1, pluggy-1.5.0
Test order randomisation NOT enabled. Enable with --random-order or --random-order-bucket=<bucket_type>
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /opt/pytorch/nvfuser
plugins: xdist-3.6.1, timestamper-0.0.10, hypothesis-6.112.2, cov-5.0.0, timeout-2.3.1, random-order-1.1.1, mpi-0.6, benchmark-4.0.0, shard-0.1.2, typeguard-4.3.0
collected 896 items / 895 deselected / 1 selected
Running 1 items in this shard

tests/python/test_ops.py F

============================================================================================================================================================================================================================================== FAILURES ===============================================================================================================================================================================================================================================
_______________________________________________________________________________________________________________________________________________________________________________________________________________________________ test_correctness_define_tensor_float32 ________________________________________________________________________________________________________________________________________________________________________________________________________________________________

    def test():
        # Ref: https://github.com/pytorch/pytorch/blob/aa8ea1d787a9d21b064b664c5344376265feea6c/torch/testing/_internal/common_utils.py#L2251-L2263
        # > CUDA device side error will cause subsequence test cases to fail.
        # > stop entire test suite if catches RuntimeError during torch.cuda.synchronize().
        if torch.cuda.is_initialized():
            try:
                torch.cuda.synchronize()
            except RuntimeError as rte:
                pytest.exit(
                    "TEST SUITE EARLY TERMINATION due to torch.cuda.synchronize() failure"
                )

>       return template(opinfo, dtype)

tests/python/opinfo_framework.py:30:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/python/test_ops.py:215: in test_correctness
    return serde_test_fn(op, dtype)
tests/python/test_ops.py:206: in serde_test_fn
    result = correctness_test_fn(op.reference_type, op, sample)
tests/python/test_ops.py:190: in correctness_test_fn
    return torch_correctness_test_fn(_fd_fn, nvf_op, sample)
tests/python/test_ops.py:86: in torch_correctness_test_fn
    nvfuser_result = fd.execute(inputs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self =
def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[1, -1], contiguity=[None, Tr...ue, None], dtype=DataType.Float, is_cpu=False, stride_order=[0, 1])
    T2 = fd.ops.add(T0, T1)
    fd.add_output(T2)

, inputs = [tensor([[-6.7103,  5.7013]], device='cuda:0')]

    def execute(
        self,
        inputs,
        *,
        device=None,
        override_user_schedule=False,
        capture_debug_output=False,
        print_repro=False,
        profile=False,
        save_repro_inputs=False,
    ):
        """
        Executes an nvFuser set of kernels for a given Fusion

        The FusionDefinition will be executed on a single CUDA device.
        Typically, which device to run on is determined by the devices where
        the input tensors reside. However, if the Fusion is defined such that
        none of the inputs are tensors, we are not able to infer a device from
        the inputs. For example, the following FusionDefinition will be unable
        to unambiguously infer the device of its output:

            with FusionDefinition() as fd:
                tv1 = fd.ops.full([5])
                fd.add_output(tv1)

        In that case, we default to selecting the first CUDA
        device, i.e. `torch.device("cuda:0")`. This method enables selecting an
        alternative preferred device.

        Args:
            inputs (List[Union[Tensor, Scalar]]): A list of inputs to fusion.

        Kwargs:
            device (Optional[Union[int, str, torch.device]]): This is a hint to run
                the Fusion on the given CUDA device. This is not typically
                necessary, as the device is usually inferred from the locations
                of input tensors. However, for some fusion definitions, no
                tensors will be input (for example when all tensors are
                generated with `full` or `uniform` ops). In these cases, we
                must either tell NVFuser where to run the resulting kernel, or
                let it default to 0. Note that passing this option providing
                and input tensors that lie on another device is an error.
            override_user_schedule (bool): For a user defined schedule,
                override with auto-generated schedule (default: False)
            capture_debug_output (bool): Whether to capture any printed
                debugging information as a string. If True, the string can be
                retrieved after execution using :meth:`get_debug_output`. If False,
                then that method will return None when called.
            print_repro (bool): Prints a reproduction script to stdout.
            profile (bool): Captures a CUPTI based profile of a fusion.
            save_repro_inputs (bool): Saves the inputs for last_repro_script() to
                provide a provide a reproduction script.

        Returns:
            List[Tensor]
        """
        self.profiled = profile

        if device is not None:
            if not isinstance(device, torch.device):
                device = torch.device(device)
            assert (
                device.type == "cuda"
            ), "If device argument is passed it must be a CUDA device"
            device = device.index

        # if definition is not defined by a context manager, try a child class
        if self.id() is None:
            self._setup_definition()
            self.definition()
            self._finalize_definition()

        defined_multidevice_schedule = hasattr(
            self, "multidevice_schedule"
        ) and isinstance(self.multidevice_schedule, Callable)
        defined_schedule = hasattr(self, "schedule") and isinstance(
            self.schedule, Callable
        )
        assert not (
            defined_multidevice_schedule and defined_schedule
        ), "I haven't tested what if both are defined. We don't plan to support this use case although it may just work."

        if defined_multidevice_schedule:
            # Unlike `schedule`, `multidevice_schedule` is designed for inter-device
            # scheduling, The scheduling is done before concretization and therefore
            # before pre-segmentation. `schedule` however assumes the FusionDefinition
            # has been concretized and pre-segmented, and therefore requires
            # `_setup_schedule` and `_finalize_schedule` to be called before and after.
            #
            # Note: there's a plan to embed multidevice schedules into FusionDefinition
            # as annotating nodes. This may eventually replace `multidevice_schedule`.
            self.multidevice_schedule()

        # If schedule is defined by child class and schedule is not defined for
        # inputs, make a schedule.
        if defined_schedule:
            # Schedule fusion if it does not exist yet or profiling fusion
            if profile or not self._exist_schedule(inputs):
                self._setup_schedule(inputs, overwrite_existing_schedule=profile)
                self.schedule()
                self._finalize_schedule(inputs)

        if save_repro_inputs:
            from torch._subclasses.fake_tensor import FakeTensorMode

            fake_mode = FakeTensorMode()
            self.fake_inputs = [fake_mode.from_tensor(inp) for inp in inputs]

        results = None
        try:
>           results = self._execute(
                inputs,
                device=device,
                override_user_schedule=override_user_schedule,
                capture_debug_output=capture_debug_output,
                profile=profile,
            )
E           RuntimeError:  INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/runtime/executor_utils.cpp":708, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. KernelArgumentHolder contains less argument than kernel's input.
E           Exception raised from bindInputs at /opt/pytorch/nvfuser/csrc/runtime/executor_utils.cpp:708 (most recent call first):
E           frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7ff7946f48e7 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E           frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x53 (0x7ff794aac533 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E           frame #2: nvfuser::executor_utils::bindInputs(nvfuser::KernelArgumentHolder const&, nvfuser::Fusion*) + 0xb3a (0x7ff794d8fb3a in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E           frame #3: <unknown function> + 0x7f41cc (0x7ff794da91cc in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E           frame #4: nvfuser::FusionExecutorCache::runFusionWithInputs(c10::ArrayRef<c10::IValue> const&, std::optional<nvfuser::PrimDataType>, std::optional<signed char>) + 0xa9 (0x7ff794daaa39 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E           frame #5: nvfuser::python_frontend::FusionDefinition::execute(c10::ArrayRef<c10::IValue> const&, std::optional<signed char>, bool, bool, bool) const + 0x796 (0x7ff794f195a6 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E           frame #6: <unknown function> + 0x1cc00e (0x7ff79478100e in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E           frame #7: <unknown function> + 0x24a21f (0x7ff7947ff21f in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E           frame #8: <unknown function> + 0x2df550 (0x7ff794894550 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E           frame #9: <unknown function> + 0x15cb2e (0x57fe59be9b2e in /usr/bin/python3)
E           frame #10: _PyObject_MakeTpCall + 0x25b (0x57fe59be02db in /usr/bin/python3)
E           frame #11: <unknown function> + 0x16b55b (0x57fe59bf855b in /usr/bin/python3)
E           frame #12: _PyEval_EvalFrameDefault + 0x1983 (0x57fe59bd3b93 in /usr/bin/python3)
E           frame #13: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #14: _PyEval_EvalFrameDefault + 0x8ab (0x57fe59bd2abb in /usr/bin/python3)
E           frame #15: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #16: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
E           frame #17: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #18: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
E           frame #19: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #20: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
E           frame #21: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #22: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
E           frame #23: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #24: _PyEval_EvalFrameDefault + 0x285e (0x57fe59bd4a6e in /usr/bin/python3)
E           frame #25: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #26: _PyEval_EvalFrameDefault + 0x285e (0x57fe59bd4a6e in /usr/bin/python3)
E           frame #27: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #28: _PyEval_EvalFrameDefault + 0x613a (0x57fe59bd834a in /usr/bin/python3)
E           frame #29: <unknown function> + 0x16b281 (0x57fe59bf8281 in /usr/bin/python3)
E           frame #30: _PyEval_EvalFrameDefault + 0x613a (0x57fe59bd834a in /usr/bin/python3)
E           frame #31: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #32: _PyObject_FastCallDictTstate + 0x16d (0x57fe59bdf51d in /usr/bin/python3)
E           frame #33: _PyObject_Call_Prepend + 0x5c (0x57fe59bf52bc in /usr/bin/python3)
E           frame #34: <unknown function> + 0x2826d0 (0x57fe59d0f6d0 in /usr/bin/python3)
E           frame #35: _PyObject_MakeTpCall + 0x25b (0x57fe59be02db in /usr/bin/python3)
E           frame #36: _PyEval_EvalFrameDefault + 0x72ea (0x57fe59bd94fa in /usr/bin/python3)
E           frame #37: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #38: _PyEval_EvalFrameDefault + 0x8ab (0x57fe59bd2abb in /usr/bin/python3)
E           frame #39: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #40: _PyEval_EvalFrameDefault + 0x285e (0x57fe59bd4a6e in /usr/bin/python3)
E           frame #41: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #42: _PyEval_EvalFrameDefault + 0x613a (0x57fe59bd834a in /usr/bin/python3)
E           frame #43: <unknown function> + 0x16b281 (0x57fe59bf8281 in /usr/bin/python3)
E           frame #44: _PyEval_EvalFrameDefault + 0x613a (0x57fe59bd834a in /usr/bin/python3)
E           frame #45: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #46: _PyObject_FastCallDictTstate + 0x16d (0x57fe59bdf51d in /usr/bin/python3)
E           frame #47: _PyObject_Call_Prepend + 0x5c (0x57fe59bf52bc in /usr/bin/python3)
E           frame #48: <unknown function> + 0x2826d0 (0x57fe59d0f6d0 in /usr/bin/python3)
E           frame #49: PyObject_Call + 0xbb (0x57fe59bf8ebb in /usr/bin/python3)
E           frame #50: _PyEval_EvalFrameDefault + 0x285e (0x57fe59bd4a6e in /usr/bin/python3)
E           frame #51: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #52: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
E           frame #53: <unknown function> + 0x16b281 (0x57fe59bf8281 in /usr/bin/python3)
E           frame #54: _PyEval_EvalFrameDefault + 0x1983 (0x57fe59bd3b93 in /usr/bin/python3)
E           frame #55: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #56: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
E           frame #57: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #58: _PyEval_EvalFrameDefault + 0x1983 (0x57fe59bd3b93 in /usr/bin/python3)
E           frame #59: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #60: _PyEval_EvalFrameDefault + 0x285e (0x57fe59bd4a6e in /usr/bin/python3)
E           frame #61: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E           frame #62: _PyEval_EvalFrameDefault + 0x613a (0x57fe59bd834a in /usr/bin/python3)
E           frame #63: <unknown function> + 0x16b281 (0x57fe59bf8281 in /usr/bin/python3)

nvfuser/__init__.py:181: RuntimeError
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Captured log call ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ERROR    nvfuser:__init__.py:192 An error occurred while executing nvFuser FusionDefinition 0.
If you believe this is a bug or need assistance, please file an issue at https://github.com/NVIDIA/Fuser/issues/new
Here's a script to reproduce the error:
```python
# CUDA devices:
#  0: NVIDIA RTX 6000 Ada Generation
#  1: NVIDIA RTX 6000 Ada Generation
# torch version: 2.6.0a0+git0eba7e5
# cuda version: 12.6
# nvfuser version: 0.2.15+gitf01caf7
import torch
from nvfuser import FusionDefinition, DataType

def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[1, -1], contiguity=[None, True], dtype=DataType.Float, is_cpu=False, stride_order=[1, 0])
    T1 = fd.define_tensor(shape=[1, 2], contiguity=[True, None], dtype=DataType.Float, is_cpu=False, stride_order=[0, 1])
    T2 = fd.ops.add(T0, T1)
    fd.add_output(T2)

with FusionDefinition() as fd:
    nvfuser_fusion_id0(fd)

inputs = [
    torch.testing.make_tensor((1, 2), dtype=torch.float32, device='cuda:0'),
]
fd.execute(inputs)

Traceback (most recent call last):
File "/opt/pytorch/nvfuser/nvfuser/init.py", line 181, in execute
results = self._execute(
RuntimeError: INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/runtime/executor_utils.cpp":708, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. KernelArgumentHolder contains less argument than kernel's input.
Exception raised from bindInputs at /opt/pytorch/nvfuser/csrc/runtime/executor_utils.cpp:708 (most recent call first):
frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7ff7946f48e7 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x53 (0x7ff794aac533 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
frame #2: nvfuser::executor_utils::bindInputs(nvfuser::KernelArgumentHolder const&, nvfuser::Fusion*) + 0xb3a (0x7ff794d8fb3a in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
frame #3: + 0x7f41cc (0x7ff794da91cc in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
frame #4: nvfuser::FusionExecutorCache::runFusionWithInputs(c10::ArrayRefc10::IValue const&, std::optionalnvfuser::PrimDataType, std::optional) + 0xa9 (0x7ff794daaa39 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
frame #5: nvfuser::python_frontend::FusionDefinition::execute(c10::ArrayRefc10::IValue const&, std::optional, bool, bool, bool) const + 0x796 (0x7ff794f195a6 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
frame #6: + 0x1cc00e (0x7ff79478100e in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
frame #7: + 0x24a21f (0x7ff7947ff21f in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
frame #8: + 0x2df550 (0x7ff794894550 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
frame #9: + 0x15cb2e (0x57fe59be9b2e in /usr/bin/python3)
frame #10: _PyObject_MakeTpCall + 0x25b (0x57fe59be02db in /usr/bin/python3)
frame #11: + 0x16b55b (0x57fe59bf855b in /usr/bin/python3)
frame #12: _PyEval_EvalFrameDefault + 0x1983 (0x57fe59bd3b93 in /usr/bin/python3)
frame #13: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #14: _PyEval_EvalFrameDefault + 0x8ab (0x57fe59bd2abb in /usr/bin/python3)
frame #15: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #16: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
frame #17: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #18: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
frame #19: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #20: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
frame #21: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #22: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
frame #23: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #24: _PyEval_EvalFrameDefault + 0x285e (0x57fe59bd4a6e in /usr/bin/python3)
frame #25: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #26: _PyEval_EvalFrameDefault + 0x285e (0x57fe59bd4a6e in /usr/bin/python3)
frame #27: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x613a (0x57fe59bd834a in /usr/bin/python3)
frame #29: + 0x16b281 (0x57fe59bf8281 in /usr/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0x613a (0x57fe59bd834a in /usr/bin/python3)
frame #31: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #32: _PyObject_FastCallDictTstate + 0x16d (0x57fe59bdf51d in /usr/bin/python3)
frame #33: _PyObject_Call_Prepend + 0x5c (0x57fe59bf52bc in /usr/bin/python3)
frame #34: + 0x2826d0 (0x57fe59d0f6d0 in /usr/bin/python3)
frame #35: _PyObject_MakeTpCall + 0x25b (0x57fe59be02db in /usr/bin/python3)
frame #36: _PyEval_EvalFrameDefault + 0x72ea (0x57fe59bd94fa in /usr/bin/python3)
frame #37: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #38: _PyEval_EvalFrameDefault + 0x8ab (0x57fe59bd2abb in /usr/bin/python3)
frame #39: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #40: _PyEval_EvalFrameDefault + 0x285e (0x57fe59bd4a6e in /usr/bin/python3)
frame #41: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #42: _PyEval_EvalFrameDefault + 0x613a (0x57fe59bd834a in /usr/bin/python3)
frame #43: + 0x16b281 (0x57fe59bf8281 in /usr/bin/python3)
frame #44: _PyEval_EvalFrameDefault + 0x613a (0x57fe59bd834a in /usr/bin/python3)
frame #45: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #46: _PyObject_FastCallDictTstate + 0x16d (0x57fe59bdf51d in /usr/bin/python3)
frame #47: _PyObject_Call_Prepend + 0x5c (0x57fe59bf52bc in /usr/bin/python3)
frame #48: + 0x2826d0 (0x57fe59d0f6d0 in /usr/bin/python3)
frame #49: PyObject_Call + 0xbb (0x57fe59bf8ebb in /usr/bin/python3)
frame #50: _PyEval_EvalFrameDefault + 0x285e (0x57fe59bd4a6e in /usr/bin/python3)
frame #51: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #52: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
frame #53: + 0x16b281 (0x57fe59bf8281 in /usr/bin/python3)
frame #54: _PyEval_EvalFrameDefault + 0x1983 (0x57fe59bd3b93 in /usr/bin/python3)
frame #55: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #56: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
frame #57: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #58: _PyEval_EvalFrameDefault + 0x1983 (0x57fe59bd3b93 in /usr/bin/python3)
frame #59: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #60: _PyEval_EvalFrameDefault + 0x285e (0x57fe59bd4a6e in /usr/bin/python3)
frame #61: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
frame #62: _PyEval_EvalFrameDefault + 0x613a (0x57fe59bd834a in /usr/bin/python3)
frame #63: + 0x16b281 (0x57fe59bf8281 in /usr/bin/python3)
======================================================================================================================================================================================================================================= short test summary info =======================================================================================================================================================================================================================================
FAILED tests/python/test_ops.py::test_correctness_define_tensor_float32 - RuntimeError: INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/runtime/executor_utils.cpp":708, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. KernelArgumentHolder contains less argument than kernel's input.
================================================================================================================================================================================================================================== 1 failed, 895 deselected in 2.00s ==================================================================================================================================================================================================================================

@kevinstephano kevinstephano added Testing e.g. improving test infra and test coverage Triage labels Oct 30, 2024
@kevinstephano
Copy link
Collaborator

kevinstephano commented Nov 2, 2024

I am not sure what the problems are from the description? Was some new testing attempted for define_tensor given the link to the PR?

@wujingyue
Copy link
Collaborator Author

#3225 (comment) has an updated repro. So far, test_ops.py has been testing ops.define_tensor only for invalid cases to see if it throws the right error/exception. When I attempted to test ops.define_tensor for valid cases, I didn't manage to find a way to get the "generator" to work.

That being said, I'm unsure about the root cause and the generator could be fixed trivially.

@rdspring1
Copy link
Collaborator

Just by look at the failed definition, the fusion expects two input tensors but only receives one tensor.
Did you try returning two tensors in the SampleInput for your define_tensor_generator?

def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[1, -1], contiguity=[None, True], dtype=DataType.Float, is_cpu=False, stride_order=[1, 0])
    T1 = fd.define_tensor(shape=[1, 2], contiguity=[True, None], dtype=DataType.Float, is_cpu=False, stride_order=[0, 1])
    T2 = fd.ops.add(T0, T1)
    fd.add_output(T2)

with FusionDefinition() as fd:
    nvfuser_fusion_id0(fd)

inputs = [
    torch.testing.make_tensor((1, 2), dtype=torch.float32, device='cuda:0'),
]
``

@kevinstephano
Copy link
Collaborator

Jingyue is trying to extend the test to valid test cases. Came across this during multi-gpu testing.

@wujingyue wujingyue removed the Triage label Nov 4, 2024
@rdspring1 rdspring1 added enhancement New feature or request on hold This issue should be revisited in the future labels Nov 4, 2024
@rdspring1
Copy link
Collaborator

The SampleInput is a pytorch tensor argument and some keyword arguments. It is meant to be passed to a intermediate operation.
define_tensor creates the input argument for the fusion, so it doesn't need the tensor argument. This issue seems like a feature request for the opinfo testing.

@rdspring1 rdspring1 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 4, 2024
@wujingyue wujingyue removed the on hold This issue should be revisited in the future label Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Testing e.g. improving test infra and test coverage
Projects
None yet
Development

No branches or pull requests

3 participants