Skip to content

"RuntimeError: HIP error: invalid device function" when running "mnist" on 7900XTX #1313

Open
@SuGotLand

Description

@SuGotLand

Context

  • Pytorch version: 2.6.0+rocm6.2.4
  • Operating System and version: Ubuntu 24.04.2 LTS x86_64

Your Environment

  • Installed using source? [yes/no]: no
  • Are you planning to deploy it using docker container? [yes/no]: no
  • Is it a CPU or GPU environment?: GPU
  • Which example are you using: mnist
  • Link to code or data to repro [if any]: mnist

Expected Behavior

Train Epoch: 1 [0/60000 (0%)]	Loss: 2.326473
Train Epoch: 1 [640/60000 (1%)]	Loss: 1.377825
Train Epoch: 1 [1280/60000 (2%)]	Loss: 0.828890
Train Epoch: 1 [1920/60000 (3%)]	Loss: 0.623807
Train Epoch: 1 [2560/60000 (4%)]	Loss: 0.447925
Train Epoch: 1 [3200/60000 (5%)]	Loss: 0.293224
Train Epoch: 1 [3840/60000 (6%)]	Loss: 0.163648
Train Epoch: 1 [4480/60000 (7%)]	Loss: 0.633399
Train Epoch: 1 [5120/60000 (9%)]	Loss: 0.226126
Train Epoch: 1 [5760/60000 (10%)]	Loss: 0.226796
...

Current Behavior

Traceback (most recent call last):
  File "/home/USER/Desktop/PYTHON Document/examples/mnist/main.py", line 147, in <module>
    main()
  File "/home/USER/Desktop/PYTHON Document/examples/mnist/main.py", line 138, in main
    train(args, model, device, train_loader, optimizer, epoch)
  File "/home/USER/Desktop/PYTHON Document/examples/mnist/main.py", line 45, in train
    output = model(data)
             ^^^^^^^^^^^
  File "/home/USER/Desktop/PYTHON Document/PhyRevE/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/USER/Desktop/PYTHON Document/PhyRevE/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/USER/Desktop/PYTHON Document/examples/mnist/main.py", line 25, in forward
    x = self.conv1(x)
        ^^^^^^^^^^^^^
  File "/home/USER/Desktop/PYTHON Document/PhyRevE/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/USER/Desktop/PYTHON Document/PhyRevE/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/USER/Desktop/PYTHON Document/PhyRevE/.venv/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 554, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/USER/Desktop/PYTHON Document/PhyRevE/.venv/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 549, in _conv_forward
    return F.conv2d(
           ^^^^^^^^^
RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Possible Solution

export HIP_VISIBLE_DEVICES=1
export HSA_OVERRIDE_GFX_VERSION=11.0.0
export PYTORCH_ROCM_ARCH="gfx1100"

But it doesn't work for me.

Steps to Reproduce

  1. Install the lastest pytorch by pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4
  2. clone examples and cd the directory.
  3. python3 mnist/main.py

Failure Logs [if any]

Output of AMD_LOG_LEVEL=3 python main.py
AMD_LOG.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions