Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

building with triton support? #166

Open
ngam opened this issue Apr 5, 2023 · 58 comments
Open

building with triton support? #166

ngam opened this issue Apr 5, 2023 · 58 comments
Labels
help wanted Extra attention is needed

Comments

@ngam
Copy link
Contributor

ngam commented Apr 5, 2023

  • I am a bit confused what is depending on what. As far as I can see, pytorch depends on torchtriton, which in turn seems to depend on pytorch.
  • Is there any difference between torchtriton and triton?
  • We already package an old version of triton in conda-forge and have a PR open for version 2.0.0 (triton v2.0.0 triton-feedstock#2). Would this 2.0.0 version be suitable as dependency for here?

Originally posted by @Tobias-Fischer in #165 (comment)

--

More background: #151

@ngam ngam mentioned this issue Apr 5, 2023
5 tasks
@ngam
Copy link
Contributor Author

ngam commented Apr 9, 2023

@Tobias-Fischer Are you aware of any quick example demonstrating the usage of triton? No worries if not, I will look upstream. I am now back to home base and can help debug this further

@Tobias-Fischer
Copy link
Contributor

The first code snippet is an easy example: https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html

@ngam
Copy link
Contributor Author

ngam commented Apr 9, 2023

Alright, it fails with InvalidCxxCompiler

InvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++')

The above exception was the direct cause of the following exception:

BackendCompilerFailed                     Traceback (most recent call last)

...
...
...

BackendCompilerFailed: debug_wrapper raised InvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++')

Set torch._dynamo.config.verbose=True for more information


You can suppress this exception and fall back to eager by setting:
    torch._dynamo.config.suppress_errors = True

@ngam
Copy link
Contributor Author

ngam commented Apr 9, 2023

@h-vetinari and @hmaarrfk, just fyi. My current assessment is that we will likely need to wait for triton 2.x, then simply add it as run dep and see if things work out. I tried adding torchtriton (from the pytorch channel) and it didn't work because it was searching for system c libraries not linked correctly. It seems that all components are in place, and we just need to have the triton component, but I may be wrong

@Tobias-Fischer
Copy link
Contributor

I just confirmed that this example works: https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html
when using conda-forge/triton-feedstock#3 and installing the cxx_compiler :)

@Tobias-Fischer
Copy link
Contributor

Ah, something surprising (?): It also works without installing triton, just by having a cxx_compiler installed (e.g. mamba install compilers)

@Tobias-Fischer
Copy link
Contributor

Made some progress - see conda-forge/triton-feedstock#6

The open question in my mind is still the circular dependency from both torch to triton and from triton to torch. Not sure how to best deal with it .. a run_constrained maybe?

@h-vetinari
Copy link
Member

Not sure how to best deal with it .. a run_constrained maybe?

Probably with a -base package that's built here, depended on by triton to build itself, and then we can include triton here as a dependence of the complete pytorch package.

@h-vetinari h-vetinari mentioned this issue May 6, 2023
@benjaminrwilson
Copy link
Contributor

I keep running into:

cannot find -lcuda: No such file or directory

when using torch.compile. I've installed both cudatoolkit and cudatoolkit-dev.

Any idea what might be going on?

@Tobias-Fischer
Copy link
Contributor

We need conda-forge/triton-feedstock#6

@RaulPPelaez
Copy link
Contributor

Bumping this since conda-forge/triton-feedstock#6 was merged. Thanks for the good work!

@ngam
Copy link
Contributor Author

ngam commented Jun 16, 2023

Can we quickly test this? Just use the package we have and install triton. Do things magically work or do we need to more tweaks in this feedstock?

@RaulPPelaez
Copy link
Contributor

Anecdotal experience, but pytorch and triton from conda-forge seem to pick up each other fine today (as in no segfaults when calling torch.compile with the inductor backend). I believe special care should be taken with dependency versions. Both triton and support for it in torch are really experimental and probably have quite narrow ranges of versions where they are supposed to work together.
On a side note, what we would really like to see is pytorch 2.0.1 (available in the pytorch channel) in conda-forge.

@ngam
Copy link
Contributor Author

ngam commented Jun 16, 2023

@RaulPPelaez good idea on 2.0.1. Would you be interested in submitting a PR? I could start one now...

@ngam
Copy link
Contributor Author

ngam commented Jun 16, 2023

see #172 we could also include formal triton support in that one too

@cread
Copy link

cread commented Apr 29, 2024

Has anyone looked at this again recently?

@hmaarrfk hmaarrfk added the help wanted Extra attention is needed label Sep 26, 2024
@danpetry
Copy link

Here's what we've done at Anaconda for v2.3.0: https://github.com/AnacondaRecipes/triton-feedstock

There are some notes in the meta.yaml about various choices we made. Let me know if you've got any questions.

@mgorny
Copy link
Contributor

mgorny commented Nov 20, 2024

I'm going to try making a new pull request for 3.1.0, as that's the version required by PyTorch 2.5.1.

@danpetry, thanks. Curious enough, I've just tried diffing the PyPI 3.1.0 package against the one provided by PyTorch, and — at least as far as .py files go — they seem the same. So I don't think we technically need a rename here.

@danpetry
Copy link

danpetry commented Nov 20, 2024

I think pytorch call(ed) their conda package torchtriton too?
The problem is that triton vendors-in (last time I checked) a random commit of llvm that might make it not widely compatible with other packages. I.e. usable outside pytorch. Hence naming it with "torch" in the name.

@danpetry
Copy link

I probably need to re-check the logic of that statement, but that's the conclusion I came to when I worked on it, at least I wanted to be cautious and say, "this should not be used except with pytorch, with which it has been explicitly integration tested"

@danpetry
Copy link

It uses llvm at runtime, rather than build time

@isuruf
Copy link
Member

isuruf commented Nov 20, 2024

The issue was that triton did not have wheels and even when it did it took time to get releases in. There was also an issue with rocm support not getting merged. All of these have been resolved I think.

@danpetry
Copy link

danpetry commented Nov 20, 2024

ok, so we can now use the wheels rather than the git repo to build? IIRC this wasn't possible

@isuruf
Copy link
Member

isuruf commented Nov 20, 2024

In conda packaging? We don't want to use pre-compiled wheels in conda-build.

@mgorny
Copy link
Contributor

mgorny commented Nov 20, 2024

I've pushed my WIP to conda-forge/triton-feedstock#26.

@danpetry
Copy link

danpetry commented Nov 21, 2024

Worth bearing in mind that LLVM is only used by triton at runtime, to compile cuda kernels. And in the end, the binary format I guess is determined by the cuda compiler rather than llvm. So, keeping it vendored-in isn't an issue so far as compatibility with the rest of the distro is concerned, I think.

@danpetry
Copy link

I don't know if anyone else can confirm this?

@mgorny
Copy link
Contributor

mgorny commented Nov 21, 2024

I think the bigger issue here is that triton either downloads a prebuilt LLVM version if it detects a supported platform, or uses system LLVM (expecting this specific commit) when it doesn't.

@rgommers
Copy link

Cc @amjames. Andrew, you had some useful insights into the PyTorch -> Triton -> LLVM coupling, so you may be interested in this topic and in conda-forge/triton-feedstock#26. Making Triton compatible with a proper LLVM release can be very useful for Conda-forge (and probably other distros as well). conda-forge/triton-feedstock#26 (comment) summarizes how this was achieved for Triton 3.1.0 with the LLVM 19 release. Some manual testing seems to confirm success at the "build and seems to compile stuff with nvcc" level - perhaps you have some suggestions into what subset of the PyTorch test suite to run to confirm that PyTorch + Triton works as designed?

@danpetry
Copy link

There's a smoke test which tests torch.compile with cuda (if an environment variable is appropriately set)

@mgorny
Copy link
Contributor

mgorny commented Nov 25, 2024

Thanks. Looks like I was wrong and some patching is necessary for regular CC to be able to find CUDA headers:

Testing smoke_test_compile for cuda and torch.float16
/tmp/tmp_itxw3hv/main.c:1:10: fatal error: cuda.h: No such file or directory
    1 | #include "cuda.h"
      |          ^~~~~~~~
compilation terminated.
/tmp/tmpwwhx19wm/main.c:1:10: fatal error: cuda.h: No such file or directory
    1 | #include "cuda.h"
      |          ^~~~~~~~
compilation terminated.
Traceback (most recent call last):
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1446, in _call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/repro/after_dynamo.py", line 129, in __call__
    compiled_gm = compiler_fn(gm, example_inputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/__init__.py", line 2234, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1521, in compile_fx
    return aot_autograd(
           ^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/backends/common.py", line 72, in __call__
    cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1071, in aot_module_simplified
    compiled_fn = dispatch_and_compile()
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1056, in dispatch_and_compile
    compiled_fn, _ = create_aot_dispatcher_function(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 522, in create_aot_dispatcher_function
    return _create_aot_dispatcher_function(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 759, in _create_aot_dispatcher_function
    compiled_fn, fw_metadata = compiler_fn(
                               ^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 179, in aot_dispatch_base
    compiled_fw = compiler(fw_module, updated_flat_args)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1350, in fw_compiler_base
    return _fw_compiler_base(model, example_inputs, is_inference)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1421, in _fw_compiler_base
    return inner_compile(
           ^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 475, in compile_fx_inner
    return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/repro/after_aot.py", line 85, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 661, in _compile_fx_inner
    compiled_graph = FxGraphCache.load(
                     ^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/codecache.py", line 1334, in load
    compiled_graph = compile_fx_fn(
                     ^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 570, in codegen_and_compile
    compiled_graph = fx_codegen_and_compile(gm, example_inputs, **fx_kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 878, in fx_codegen_and_compile
    compiled_fn = graph.compile_to_fn()
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/graph.py", line 1913, in compile_to_fn
    return self.compile_to_module().call
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/graph.py", line 1839, in compile_to_module
    return self._compile_to_module()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/graph.py", line 1845, in _compile_to_module
    self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
                                                             ^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/graph.py", line 1784, in codegen
    self.scheduler.codegen()
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/scheduler.py", line 3383, in codegen
    return self._codegen()
           ^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/scheduler.py", line 3461, in _codegen
    self.get_backend(device).codegen_node(node)
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/codegen/cuda_combined_scheduling.py", line 80, in codegen_node
    return self._triton_scheduling.codegen_node(node)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/codegen/simd.py", line 1155, in codegen_node
    return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/codegen/simd.py", line 1364, in codegen_node_schedule
    src_code = kernel.codegen_kernel()
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/codegen/triton.py", line 2661, in codegen_kernel
    **self.inductor_meta_common(),
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_inductor/codegen/triton.py", line 2532, in inductor_meta_common
    "backend_hash": torch.utils._triton.triton_hash_with_backend(),
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/utils/_triton.py", line 53, in triton_hash_with_backend
    backend = triton_backend()
              ^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/utils/_triton.py", line 45, in triton_backend
    target = driver.active.get_current_target()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/triton/runtime/driver.py", line 23, in __getattr__
    self._initialize_obj()
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
    self._obj = self._init_fn()
                ^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/triton/runtime/driver.py", line 9, in _create_driver
    return actives[0]()
           ^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 371, in __init__
    self.utils = CudaUtils()  # TODO: make static
                 ^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
    mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
    so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/triton/runtime/build.py", line 48, in _build
    ret = subprocess.check_call(cc_cmd)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/home/mgorny/.conda/envs/pytorch/bin/x86_64-conda-linux-gnu-cc', '/tmp/tmpwwhx19wm/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmpwwhx19wm/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-L/lib32', '-I/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpwwhx19wm', '-I/home/mgorny/.conda/envs/pytorch/include/python3.12']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mgorny/smoke_test.py", line 352, in <module>
    main()
  File "/home/mgorny/smoke_test.py", line 348, in main
    smoke_test_cuda(options.package, options.runtime_error_check, options.torch_compile_check)
  File "/home/mgorny/smoke_test.py", line 171, in smoke_test_cuda
    smoke_test_compile("cuda" if torch.cuda.is_available() else "cpu")
  File "/home/mgorny/smoke_test.py", line 261, in smoke_test_compile
    x_pt2 = torch.compile(foo)(x)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 465, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1269, in __call__
    return self._torchdynamo_orig_callable(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1064, in __call__
    result = self._inner_convert(
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 526, in __call__
    return _compile(
           ^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 924, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 666, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_utils_internal.py", line 87, in wrapper_function
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 699, in _compile_inner
    out_code = transform_code_object(code, transform)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/bytecode_transformation.py", line 1322, in transform_code_object
    transformations(instructions, code_options)
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 219, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 634, in transform
    tracer.run()
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 2796, in run
    super().run()
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 983, in run
    while self.step():
          ^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 895, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 2987, in RETURN_VALUE
    self._return(inst)
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 2972, in _return
    self.output.compile_subgraph(
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1117, in compile_subgraph
    self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1369, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1416, in call_user_compiler
    return self._call_user_compiler(gm)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1465, in _call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e) from e
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
CalledProcessError: Command '['/home/mgorny/.conda/envs/pytorch/bin/x86_64-conda-linux-gnu-cc', '/tmp/tmpwwhx19wm/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmpwwhx19wm/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-L/lib32', '-I/home/mgorny/.conda/envs/pytorch/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpwwhx19wm', '-I/home/mgorny/.conda/envs/pytorch/include/python3.12']' returned non-zero exit status 1.

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

@amjames
Copy link

amjames commented Nov 25, 2024

@rgommers Thanks!

Some manual testing seems to confirm success at the "build and seems to compile stuff with nvcc" level - perhaps you have some suggestions into what subset of the PyTorch test suite to run to confirm that PyTorch + Triton works as designed?

What kind of time-limits are we working with? Full coverage is probably a non-starter, but the inductor tests will be the best place to focus on.

I would start with:

Some of these tests will require a specific GPU like A100, but in general the tests should be annotated to skip if they have special requirements like that which are not met.

@mgorny
Copy link
Contributor

mgorny commented Nov 25, 2024

The plot thickens. After fixing the path errors, I'm getting:

RuntimeError: Internal Triton PTX codegen error: 
ptxas /tmp/tmp_at1jgs8.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.5'
ptxas fatal   : Ptx assembly aborted due to errors

My guess would be that it doesn't support CUDA 12.6 for some reason. But 12.0 is too old, and I don't think we can cleanly do 12.5 just for Triton without changing PyTorch. Will try to figure out a good solution tomorrow, but I'd appreciate any hints.

@Tobias-Fischer
Copy link
Contributor

Are you sure it’s not picking up your system CUDA (12.5?) and then competing against conda CUDA (12.6?)

@danpetry
Copy link

It might be a problem that the feedstock uses cuda-nvcc in the run section directly, instead of compiler('cuda') with the activation scripts. https://github.com/conda-forge/triton-feedstock/blob/7f07922287846f050f2c65f4cfced35f1e1b311d/recipe/meta.yaml#L66

@mgorny
Copy link
Contributor

mgorny commented Nov 26, 2024

Actually, it turns out we need one more upstream patch for CUDA 12.6 support. But I also need to fix path search, I'll make a pull request later.

It might be a problem that the feedstock uses cuda-nvcc in the run section directly, instead of compiler('cuda') with the activation scripts. https://github.com/conda-forge/triton-feedstock/blob/7f07922287846f050f2c65f4cfced35f1e1b311d/recipe/meta.yaml#L66

Hmm, is that actually wrong? I didn't think compiler('cuda') actually implies cuda-nvcc too, but I can remove that if it's redundant.

@danpetry
Copy link

AFAIU compiler('cuda') resolves to cuda-nvcc_, partly because of this setting here

But cuda-nvcc pulls in cuda-nvcc_<platform> anyway, so the activation scripts should be running.

Maybe it's to do with the "conda-build" conditional in the link above, so the I/L flags aren't being added?

@mgorny
Copy link
Contributor

mgorny commented Nov 26, 2024

Already solved via conda-forge/triton-feedstock#28. Just wondering if I should update the dependencies while at it.

@Tobias-Fischer
Copy link
Contributor

@mgorny - does any more work need to be done here?

@mgorny
Copy link
Contributor

mgorny commented Dec 26, 2024

The remaining question is: do we want to add a requirement from PyTorch to Triton, so that it's pulled in automatically? Unless I'm mistaken, Conda doesn't have "recommends" kind of requirements, so it probably have to be unconditional (and then it would create a cycle).

@rgommers
Copy link

On PyPI, triton has only one dependency (filelock) and torch depends on triton. That sounds about right to me. I don't understand why triton in conda-forge has the reverse runtime dependency on pytorch =*=cuda*.

The remaining question is: do we want to add a requirement from PyTorch to Triton

That would be desirable, but only for cuda builds - and I'm not sure that that is possible? If not, then it seems best to leave out the dependency.

@mgorny
Copy link
Contributor

mgorny commented Dec 26, 2024

On PyPI, triton has only one dependency (filelock) and torch depends on triton. That sounds about right to me. I don't understand why triton in conda-forge has the reverse runtime dependency on pytorch =*=cuda*.

Well, at least of triton's Python modules do import torch, so I guess it's at least an optional dependency.

The remaining question is: do we want to add a requirement from PyTorch to Triton

That would be desirable, but only for cuda builds - and I'm not sure that that is possible? If not, then it seems best to leave out the dependency.

Yes, that should be fine.

@rgommers
Copy link

Well, at least of triton's Python modules do import torch, so I guess it's at least an optional dependency.

There's a bunch of import torch in tests and examples, but only one outside of that, which is not at the top level:

https://github.com/triton-lang/triton/blob/f8b5301a92459199e1b9faf7aadf1a7c10bb9866/python/triton/backends/driver.py#L37-L41

So it's optional at most. It's also possible to use Triton with JAX: https://jax.readthedocs.io/en/latest/pallas/design/design.html#lowering-pallas-to-triton-for-gpu. So it looks to me like the dependency in the triton feedstock should be removed. I don't know why it was added in the first place though.

@hmaarrfk
Copy link
Contributor

Likely for the tests.

I agree it can be removed!

Thanks looking into it!

@mgorny
Copy link
Contributor

mgorny commented Dec 26, 2024

Okay, I'll look into it tomorrow. I also think I'm ready to give rattler-build in pytorch another try!

@hmaarrfk
Copy link
Contributor

Okay, I'll look into it tomorrow. I also think I'm ready to give rattler-build in pytorch another try!

Lets try to give the windows folk a break on the rebasing.

They seem pretty close, and the new resources in the form of CIs make it finally a possibility.

Given that pytorch build times are dominated by mathematical compilation, I don't see much of an advantage in working on this immediately

mgorny added a commit to mgorny/triton-feedstock that referenced this issue Dec 27, 2024
Per the discussion on pytorch-cpu-feedstock:
conda-forge/pytorch-cpu-feedstock#166 (comment)
(and below)
@mgorny
Copy link
Contributor

mgorny commented Dec 27, 2024

Okay, I'll look into it tomorrow. I also think I'm ready to give rattler-build in pytorch another try!

Lets try to give the windows folk a break on the rebasing.

Sure, I'll wait.

Given that pytorch build times are dominated by mathematical compilation, I don't see much of an advantage in working on this immediately

Well, just for the record, it would make my local dev builds faster, since (with an up-to-date ccache available) conda-build is actually being a bottleneck :-).

@hmaarrfk
Copy link
Contributor

Well, just for the record, it would make my local dev builds faster, since (with an up-to-date ccache available) conda-build is actually being a bottleneck :-).

Interesting.....

I know your working on alot, but I would be willing to learn if you can document your process of getting ccache working with conda-forge's isolated builds!

My strategy is currently to follow a variant of conda-forge/tensorflow-feedstock#360

@mgorny
Copy link
Contributor

mgorny commented Dec 27, 2024

Right, I suppose it makes sense to add it to the README as well. Basically, I'm running conda build ... directly, and PyTorch's CMake simply picks up ccache from my host root. Hardly a clean solution but it just happened to work here.

Oh, I'm passing --no-build-id --croot /var/tmp/conda-bld to get caches to match. And ~/.config/ccache/ccache.conf of:

compiler_check=none
compression=true
sloppiness=pch_defines,time_macros
hash_dir=false
base_dir=/var/tmp/conda-bld
max_size = 6G

I'll document that when I update some pull request next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests