Skip to content

[torch-xla 2.9RC1] Neuron compiler crashed with "Instruction with id > INT_MAX: 21474836481 this is not intended behavior and might indicate a bug in the HLO proto serialization." #9685

@jeffhataws

Description

@jeffhataws

🐛 Bug

The torch-xla RC1 change 11590c1 includes the instruction ID size change from 32-bit to 64-bit 5bba026. This causes Neuron compiler to crash with the error:

I0000 00:00:1760638353.839317  163922 hlo_instruction.cc:1355] Instruction with id > INT_MAX: 21474836481 this is not intended behavior and might indicate a bug in the HLO proto serialization.
E0000 00:00:1760638353.878123  163922 status_macros.cc:57] INTERNAL: RET_CHECK failure (external/xla/xla/hlo/ir/hlo_instruction.cc:353) absl::c_all_of(proto.operand_ids(), [&](int64_t id) { return instruction_map.contains(id); }) transpose.3 instruction contains invalid operand id(s)
*** Begin stack trace ***
        xla::PjRtCApiClient::CompileAndLoad(xla::XlaComputation const&, xla::CompileOptions)
        torch_xla::runtime::PjRtComputationClient::Compile(std::vector<torch_xla::runtime::ComputationClient::CompileInstance, std::allocator<torch_xla::runtime::ComputationClient::CompileInstance> >)

It's currently not possible to fix this in time for pytorch 2.9 support, so we are proposing to everting back to either before torch-xla change 11590c1 or before openxla change 5bba026.

To Reproduce

Reproducible with Neuron SDK alpha, any test that uses PJRT.

Expected behavior

Environment

  • Reproducible on XLA backend [CPU/TPU]: Neuron
  • torch_xla version: 2.9

Additional context

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions