🐛 Bug
The torch-xla RC1 change 11590c1 includes the instruction ID size change from 32-bit to 64-bit 5bba026. This causes Neuron compiler to crash with the error:
I0000 00:00:1760638353.839317 163922 hlo_instruction.cc:1355] Instruction with id > INT_MAX: 21474836481 this is not intended behavior and might indicate a bug in the HLO proto serialization.
E0000 00:00:1760638353.878123 163922 status_macros.cc:57] INTERNAL: RET_CHECK failure (external/xla/xla/hlo/ir/hlo_instruction.cc:353) absl::c_all_of(proto.operand_ids(), [&](int64_t id) { return instruction_map.contains(id); }) transpose.3 instruction contains invalid operand id(s)
*** Begin stack trace ***
xla::PjRtCApiClient::CompileAndLoad(xla::XlaComputation const&, xla::CompileOptions)
torch_xla::runtime::PjRtComputationClient::Compile(std::vector<torch_xla::runtime::ComputationClient::CompileInstance, std::allocator<torch_xla::runtime::ComputationClient::CompileInstance> >)
It's currently not possible to fix this in time for pytorch 2.9 support, so we are proposing to everting back to either before torch-xla change 11590c1 or before openxla change 5bba026.
To Reproduce
Reproducible with Neuron SDK alpha, any test that uses PJRT.
Expected behavior
Environment
- Reproducible on XLA backend [CPU/TPU]: Neuron
- torch_xla version: 2.9
Additional context