Skip to content

Conversation

@deven-amd
Copy link

No description provided.

@whchung
Copy link
Collaborator

whchung commented May 15, 2018

@deven-amd thanks for this PR. This PR only contains patch to Eigen. Could you help review if other parts of TF need to be amended? In particular MIOpen kernel call sites for fp16.

Copy link
Collaborator

@whchung whchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@whchung whchung merged commit 965c75f into develop-upstream May 15, 2018
@deven-amd
Copy link
Author

Yes expect another PR fo the TF side of fp16 changes sometime later today

iotamudelta pushed a commit to iotamudelta/tensorflow-upstream that referenced this pull request Jun 12, 2018
deven-amd pushed a commit that referenced this pull request Sep 18, 2018
parallelo pushed a commit that referenced this pull request Oct 26, 2018
deven-amd pushed a commit that referenced this pull request Dec 24, 2018
deven-amd pushed a commit that referenced this pull request Dec 31, 2018
Updated download script and instructions to contain modified files away from regular sources
parallelo pushed a commit that referenced this pull request Feb 27, 2019
When num_squeeze_dims is zero the data pointer should be nullptr.

This addresses the following error when running NNAPIDelegate.SqueezeSimpleTest:
ANeuralNetworksModel_setOperandValue setting operand 1 which has operand type that is not fully specified

PiperOrigin-RevId: 235772184
jerryyin pushed a commit that referenced this pull request Apr 11, 2019
whchung pushed a commit that referenced this pull request Apr 12, 2019
jerryyin pushed a commit that referenced this pull request Apr 16, 2019
Fixing small typo in comment
jeffdaily pushed a commit that referenced this pull request Jun 3, 2019
deven-amd pushed a commit that referenced this pull request Jun 28, 2019
whchung pushed a commit that referenced this pull request Aug 5, 2019
deven-amd pushed a commit that referenced this pull request Aug 29, 2019
Fix accidentally deleted #if line
deven-amd pushed a commit that referenced this pull request Nov 11, 2019
deven-amd pushed a commit that referenced this pull request Nov 26, 2019
deven-amd pushed a commit that referenced this pull request Jan 3, 2020
Added code examples tf.keras.backend.gather
deven-amd pushed a commit that referenced this pull request Jan 13, 2020
deven-amd pushed a commit that referenced this pull request Feb 11, 2020
* Add missing stdint.h include in KissFFT

* Transform header files in ESP examples sources

* Copy sdkconfig.defaults in ESP examples
deven-amd added a commit that referenced this pull request Nov 16, 2021
…uite

Related JIRA ticket - https://ontrack-internal.amd.com/browse/SWDEV-310531

The following unit-test (when run as part of the CPU testsuite) fails on some CI nodes

```
//tensorflow/c/eager:c_api_distributed_test                              FAILED in 10.8s

...
...
[----------] 6 tests from CAPI (1396 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test suite ran. (1396 ms total)
[  PASSED  ] 6 tests.

  YOU HAVE 1 DISABLED TEST

*** Received signal 11 ***
*** BEGIN MANGLED STACK TRACE ***
================================================================================

```

The failure seems to be server/node dependent, and I was able to consistently reproduce it on the `zt-dh170-07` node. When I bring up the core file from the crash in gdb, I see the following stack trace (running the testcase under gdb makes it pass, and hence the need to back-trace via core file)

```
(gdb) where
#0  0x00007f9448612000 in ?? ()
#1  0x00007f9458a37934 in dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver(char const*, char const*, long, long, long, float const*, float const*, long, float const*, long, float const*, float*, long, float const*, float*) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_k8/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#2  0x00007f9458a38531 in dnnl::impl::cpu::x64::jit_avx_gemm_f32(int, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_k8/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#3  0x00007f9458b491bd in dnnl_status_t dnnl::impl::cpu::x64::gemm_driver<float, float, float>(char const*, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, float const*, long const*, float const*, float const*, float*, long const*, float const*, bool, dnnl::impl::cpu::x64::pack_type, dnnl::impl::cpu::x64::gemm_pack_storage_t*, bool) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_k8/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#4  0x00007f945838ab46 in dnnl::impl::cpu::extended_sgemm(char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*, bool) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_k8/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#5  0x00007f9458024ad3 in dnnl_sgemm ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_k8/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#6  0x000055f42a5df14b in Eigen::internal::TensorContractionKernel<float, float, float, long, Eigen::internal::blas_data_mapper<float, long, 0, 0, 1>, Eigen::internal::TensorContractionInputMapper<float, long, 1, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer>, Eigen::internal::TensorContractionInputMapper<float, long, 0, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer> >::invoke(Eigen::internal::blas_data_mapper<float, long, 0, 0, 1> const&, Eigen::internal::ColMajorBlock<float, long> const&, Eigen::internal::ColMajorBlock<float, long> const&, long, long, long, float, float) ()
#7  0x000055f42a5e52b6 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::kernel(long, long, long, bool) ()
#8  0x00007f9467160491 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_k8/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest___Utensorflow/libtensorflow_framework.so.2
#9  0x00007f946715db83 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_k8/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest___Utensorflow/libtensorflow_framework.so.2
#10 0x00007f946713fdf7 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_k8/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest___Utensorflow/libtensorflow_framework.so.2
#11 0x00007f9454d536db in start_thread (arg=0x7f8fd37fe700) at pthread_create.c:463
#12 0x00007f945445671f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
```

The crash seems to be origination from MKL DNN code...guessing there is some MKL DNN implementation bug, which manifests only on some of CI nodes that we have. Running the test with MKL DNN disabled for contractions (`--define=tensorflow_mkldnn_contraction_kernel=0`), makes this test pass

Root causing the point of failure within MKL DNN implementaion and fixing it, is outside the scope of our work. Given that there are two ways to workaround this issue
 * remove this test from test-list
 * run the CPU testsuite with the build option `--define=tensorflow_mkldnn_contraction_kernel=0`

For the time being, we will simply remove this test from the test-list, as this bug only seems to affect one test. If more tests start failing due to this same issue, it may warrant running the CPU testsuite with `--define=tensorflow_mkldnn_contraction_kernel=0`
deven-amd added a commit that referenced this pull request Dec 6, 2021
On some CI nodes (typically those with higher CPU core counts 128/256), the `//tensorflow/c/eager:c_api_distributed_test_gpu` test fails on an intermitent basis.

When it does fail, the failures manifests as segfault at the end of the test, with the stack dump shown at the end of this commit message. The stack dump points the finger to a routine within the MKLDNN implementation. This is further confirmed by the observation that disabling the MKLDNN based Eigen contraction kernels (for ROCm) seems to make the crash go away.

related JIRA ticket - https://ontrack-internal.amd.com/browse/SWDEV-313684

A previous commit disabled the `//tensorflow/c/eager:c_api_distributed_test` unit-test only in the CPU unit-tests CI job (for the same reason). That comit cannot be reverted, because this commit disables MKLDNN based Eigen contraction kernels *only* for the ROCm build.

```
Thread 191 "c_api_distribut" received signal SIGSEGV, Segmentation fault.
[Switching to thread 191 (Thread 0x7ffc777fe700 (LWP 159004))]
0x00007fff54530000 in ?? ()
(gdb) where
#0  0x00007fff54530000 in ?? ()
#1  0x00007fffd5d15ae4 in dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver(char const*, char const*, long, long, long, float const*, float const*, long, float const*, long, float const*, float*, long, float const*, float*) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#2  0x00007fffd5d166e1 in dnnl::impl::cpu::x64::jit_avx_gemm_f32(int, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#3  0x00007fffd5e277ed in dnnl_status_t dnnl::impl::cpu::x64::gemm_driver<float, float, float>(char const*, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, float const*, long const*, float const*, float const*, float*, long const*, float const*, bool, dnnl::impl::cpu::x64::pack_type, dnnl::impl::cpu::x64::gemm_pack_storage_t*, bool) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#4  0x00007fffd5665056 in dnnl::impl::cpu::extended_sgemm(char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*, bool) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#5  0x00007fffd52fe983 in dnnl_sgemm ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#6  0x0000555557187b0b in Eigen::internal::TensorContractionKernel<float, float, float, long, Eigen::internal::blas_data_mapper<float, long, 0, 0, 1>, Eigen::internal::TensorContractionInputMapper<float, long, 1, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer>, Eigen::internal::TensorContractionInputMapper<float, long, 0, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer> >::invoke(Eigen::internal::blas_data_mapper<float, long, 0, 0, 1> const&, Eigen::internal::ColMajorBlock<float, long> const&, Eigen::internal::ColMajorBlock<float, long> const&, long, long, long, float, float) ()
#7  0x000055555718dc76 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::kernel(long, long, long, bool) ()
#8  0x000055555718f327 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::signal_kernel(long, long, long, bool, bool) ()
#9  0x00005555571904cb in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::pack_rhs(long, long) ()
#10 0x000055555718fd69 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::enqueue_packing_helper(long, long, long, bool) ()
#11 0x00007ffff6b607a1 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2
#12 0x00007ffff6b5de93 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2
#13 0x00007ffff6b40107 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2
#14 0x00007fffd1ca86db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007fffd00b471f in clone () from /lib/x86_64-linux-gnu/libc.so.6
```
deven-amd added a commit that referenced this pull request Dec 7, 2021
On some CI nodes (typically those with higher CPU core counts 128/256), the `//tensorflow/c/eager:c_api_distributed_test_gpu` test fails on an intermitent basis.

When it does fail, the failures manifests as segfault at the end of the test, with the stack dump shown at the end of this commit message. The stack dump points the finger to a routine within the MKLDNN implementation. This is further confirmed by the observation that disabling the MKLDNN based Eigen contraction kernels (for ROCm) seems to make the crash go away.

related JIRA ticket - https://ontrack-internal.amd.com/browse/SWDEV-313684

A previous commit disabled the `//tensorflow/c/eager:c_api_distributed_test` unit-test only in the CPU unit-tests CI job (for the same reason). That comit cannot be reverted, because this commit disables MKLDNN based Eigen contraction kernels *only* for the ROCm build.

```
Thread 191 "c_api_distribut" received signal SIGSEGV, Segmentation fault.
[Switching to thread 191 (Thread 0x7ffc777fe700 (LWP 159004))]
0x00007fff54530000 in ?? ()
(gdb) where
#0  0x00007fff54530000 in ?? ()
#1  0x00007fffd5d15ae4 in dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver(char const*, char const*, long, long, long, float const*, float const*, long, float const*, long, float const*, float*, long, float const*, float*) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#2  0x00007fffd5d166e1 in dnnl::impl::cpu::x64::jit_avx_gemm_f32(int, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#3  0x00007fffd5e277ed in dnnl_status_t dnnl::impl::cpu::x64::gemm_driver<float, float, float>(char const*, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, float const*, long const*, float const*, float const*, float*, long const*, float const*, bool, dnnl::impl::cpu::x64::pack_type, dnnl::impl::cpu::x64::gemm_pack_storage_t*, bool) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#4  0x00007fffd5665056 in dnnl::impl::cpu::extended_sgemm(char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*, bool) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#5  0x00007fffd52fe983 in dnnl_sgemm ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#6  0x0000555557187b0b in Eigen::internal::TensorContractionKernel<float, float, float, long, Eigen::internal::blas_data_mapper<float, long, 0, 0, 1>, Eigen::internal::TensorContractionInputMapper<float, long, 1, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer>, Eigen::internal::TensorContractionInputMapper<float, long, 0, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer> >::invoke(Eigen::internal::blas_data_mapper<float, long, 0, 0, 1> const&, Eigen::internal::ColMajorBlock<float, long> const&, Eigen::internal::ColMajorBlock<float, long> const&, long, long, long, float, float) ()
#7  0x000055555718dc76 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::kernel(long, long, long, bool) ()
#8  0x000055555718f327 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::signal_kernel(long, long, long, bool, bool) ()
#9  0x00005555571904cb in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::pack_rhs(long, long) ()
#10 0x000055555718fd69 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::enqueue_packing_helper(long, long, long, bool) ()
#11 0x00007ffff6b607a1 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2
#12 0x00007ffff6b5de93 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2
#13 0x00007ffff6b40107 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2
#14 0x00007fffd1ca86db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007fffd00b471f in clone () from /lib/x86_64-linux-gnu/libc.so.6
```
deven-amd added a commit that referenced this pull request Dec 21, 2021
On some CI nodes (typically those with higher CPU core counts 128/256), the `//tensorflow/c/eager:c_api_distributed_test_gpu` test fails on an intermitent basis.

When it does fail, the failures manifests as segfault at the end of the test, with the stack dump shown at the end of this commit message. The stack dump points the finger to a routine within the MKLDNN implementation. This is further confirmed by the observation that disabling the MKLDNN based Eigen contraction kernels (for ROCm) seems to make the crash go away.

related JIRA ticket - https://ontrack-internal.amd.com/browse/SWDEV-313684

A previous commit disabled the `//tensorflow/c/eager:c_api_distributed_test` unit-test only in the CPU unit-tests CI job (for the same reason). That comit cannot be reverted, because this commit disables MKLDNN based Eigen contraction kernels *only* for the ROCm build.

```
Thread 191 "c_api_distribut" received signal SIGSEGV, Segmentation fault.
[Switching to thread 191 (Thread 0x7ffc777fe700 (LWP 159004))]
0x00007fff54530000 in ?? ()
(gdb) where
#0  0x00007fff54530000 in ?? ()
#1  0x00007fffd5d15ae4 in dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver(char const*, char const*, long, long, long, float const*, float const*, long, float const*, long, float const*, float*, long, float const*, float*) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#2  0x00007fffd5d166e1 in dnnl::impl::cpu::x64::jit_avx_gemm_f32(int, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#3  0x00007fffd5e277ed in dnnl_status_t dnnl::impl::cpu::x64::gemm_driver<float, float, float>(char const*, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, float const*, long const*, float const*, float const*, float*, long const*, float const*, bool, dnnl::impl::cpu::x64::pack_type, dnnl::impl::cpu::x64::gemm_pack_storage_t*, bool) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#4  0x00007fffd5665056 in dnnl::impl::cpu::extended_sgemm(char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*, bool) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#5  0x00007fffd52fe983 in dnnl_sgemm ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#6  0x0000555557187b0b in Eigen::internal::TensorContractionKernel<float, float, float, long, Eigen::internal::blas_data_mapper<float, long, 0, 0, 1>, Eigen::internal::TensorContractionInputMapper<float, long, 1, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer>, Eigen::internal::TensorContractionInputMapper<float, long, 0, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer> >::invoke(Eigen::internal::blas_data_mapper<float, long, 0, 0, 1> const&, Eigen::internal::ColMajorBlock<float, long> const&, Eigen::internal::ColMajorBlock<float, long> const&, long, long, long, float, float) ()
#7  0x000055555718dc76 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::kernel(long, long, long, bool) ()
#8  0x000055555718f327 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::signal_kernel(long, long, long, bool, bool) ()
#9  0x00005555571904cb in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::pack_rhs(long, long) ()
#10 0x000055555718fd69 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::enqueue_packing_helper(long, long, long, bool) ()
#11 0x00007ffff6b607a1 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2
#12 0x00007ffff6b5de93 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2
#13 0x00007ffff6b40107 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2
#14 0x00007fffd1ca86db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007fffd00b471f in clone () from /lib/x86_64-linux-gnu/libc.so.6
```
deven-amd added a commit that referenced this pull request Jan 31, 2022
On some CI nodes (typically those with higher CPU core counts 128/256), the `//tensorflow/c/eager:c_api_distributed_test_gpu` test fails on an intermitent basis.

When it does fail, the failures manifests as segfault at the end of the test, with the stack dump shown at the end of this commit message. The stack dump points the finger to a routine within the MKLDNN implementation. This is further confirmed by the observation that disabling the MKLDNN based Eigen contraction kernels (for ROCm) seems to make the crash go away.

related JIRA ticket - https://ontrack-internal.amd.com/browse/SWDEV-313684

A previous commit disabled the `//tensorflow/c/eager:c_api_distributed_test` unit-test only in the CPU unit-tests CI job (for the same reason). That comit cannot be reverted, because this commit disables MKLDNN based Eigen contraction kernels *only* for the ROCm build.

```
Thread 191 "c_api_distribut" received signal SIGSEGV, Segmentation fault.
[Switching to thread 191 (Thread 0x7ffc777fe700 (LWP 159004))]
0x00007fff54530000 in ?? ()
(gdb) where
#0  0x00007fff54530000 in ?? ()
#1  0x00007fffd5d15ae4 in dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver(char const*, char const*, long, long, long, float const*, float const*, long, float const*, long, float const*, float*, long, float const*, float*) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#2  0x00007fffd5d166e1 in dnnl::impl::cpu::x64::jit_avx_gemm_f32(int, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#3  0x00007fffd5e277ed in dnnl_status_t dnnl::impl::cpu::x64::gemm_driver<float, float, float>(char const*, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, float const*, long const*, float const*, float const*, float*, long const*, float const*, bool, dnnl::impl::cpu::x64::pack_type, dnnl::impl::cpu::x64::gemm_pack_storage_t*, bool) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#4  0x00007fffd5665056 in dnnl::impl::cpu::extended_sgemm(char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*, bool) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#5  0x00007fffd52fe983 in dnnl_sgemm ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so
#6  0x0000555557187b0b in Eigen::internal::TensorContractionKernel<float, float, float, long, Eigen::internal::blas_data_mapper<float, long, 0, 0, 1>, Eigen::internal::TensorContractionInputMapper<float, long, 1, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer>, Eigen::internal::TensorContractionInputMapper<float, long, 0, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer> >::invoke(Eigen::internal::blas_data_mapper<float, long, 0, 0, 1> const&, Eigen::internal::ColMajorBlock<float, long> const&, Eigen::internal::ColMajorBlock<float, long> const&, long, long, long, float, float) ()
#7  0x000055555718dc76 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::kernel(long, long, long, bool) ()
#8  0x000055555718f327 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::signal_kernel(long, long, long, bool, bool) ()
#9  0x00005555571904cb in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::pack_rhs(long, long) ()
#10 0x000055555718fd69 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::enqueue_packing_helper(long, long, long, bool) ()
#11 0x00007ffff6b607a1 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2
#12 0x00007ffff6b5de93 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2
#13 0x00007ffff6b40107 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) ()
   from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2
#14 0x00007fffd1ca86db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007fffd00b471f in clone () from /lib/x86_64-linux-gnu/libc.so.6
```
weihanmines pushed a commit that referenced this pull request Feb 7, 2022
Today this pattern assumes:

  1. Input has rank 1
  2. The reshape reshapes it to a tensor of shape Nx1.

Neither of them always hold. For example, we could come up with something like:

  Equal(Reshape(X: tensor<4x4xi32>, [2,8,1]), [0,1,2]): tensor<2x8x3xi1>

tfl.one_hot() requires output_dims = indices_dim + 1 (see link [1]), so in this
case, we should produce something like:

  OneHot(Reshape(X: tensor<4x4xi32>, [2,8]), ...): tensor<2x8x3xi1>

But for the sake of being more conservative, for now we will only apply this
rewrite pattern when the reshape returns a tensor of rank 2; but we do remove
the original assumption #1 (X has rank 1) in this CL.

[1]: https://github.com/tensorflow/tensorflow/blob/8c96acb5b70a5e351931a4075cb0c40f7f04a832/tensorflow/lite/kernels/one_hot.cc#L49

PiperOrigin-RevId: 426188277
Change-Id: I180490c30347caeabe6a3b14c138f577cc2a556c
ekuznetsov139 pushed a commit that referenced this pull request May 31, 2022
Added support to tf-mlir-translate and it's python wrapper for handling data types with subtypes
i-chaochen pushed a commit that referenced this pull request Dec 24, 2022
Error message:

```
[ RUN      ] Conv2DTransposeTest.testGradient
2022-12-22 10:21:44.842744: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:1041] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2022-12-22 10:21:44.843106: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:1041] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2022-12-22 10:21:44.843394: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:1041] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2022-12-22 10:21:44.843738: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:1041] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2022-12-22 10:21:44.844362: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:1041] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2022-12-22 10:21:44.844628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1614] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15366 MB memory:  -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 6.0
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/bin/bazel_pip/tensorflow/python/kernel_tests/nn_ops/conv2d_transpose_test_gpu.runfiles/org_tensorflow/bazel_pip/tensorflow/python/kernel_tests/nn_ops/conv2d_transpose_test.py:176: compute_gradient_error (from tensorflow.python.ops.gradient_checker) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.test.compute_gradient in 2.0, which has better support for functions. Note that the two versions have different usage, so code change is needed.
W1222 10:21:44.849361 140024911816512 deprecation.py:364] From /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/bin/bazel_pip/tensorflow/python/kernel_tests/nn_ops/conv2d_transpose_test_gpu.runfiles/org_tensorflow/bazel_pip/tensorflow/python/kernel_tests/nn_ops/conv2d_transpose_test.py:176: compute_gradient_error (from tensorflow.python.ops.gradient_checker) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.test.compute_gradient in 2.0, which has better support for functions. Note that the two versions have different usage, so code change is needed.
WARNING:tensorflow:From /bazel_pip/lib/python3.8/site-packages/tensorflow/python/ops/gradient_checker.py:390: compute_gradient (from tensorflow.python.ops.gradient_checker) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.test.compute_gradient in 2.0, which has better support for functions. Note that the two versions have different usage, so code change is needed.
W1222 10:21:44.849497 140024911816512 deprecation.py:364] From /bazel_pip/lib/python3.8/site-packages/tensorflow/python/ops/gradient_checker.py:390: compute_gradient (from tensorflow.python.ops.gradient_checker) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.test.compute_gradient in 2.0, which has better support for functions. Note that the two versions have different usage, so code change is needed.
2022-12-22 10:21:44.858697: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:331] MLIR V1 optimization pass is not enabled
Fatal Python error: Segmentation fault

Thread 0x00007f5a1720e740 (most recent call first):
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1454 in _call_tf_sessionrun
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1361 in _run_fn
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1378 in _do_call
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1371 in _do_run
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1191 in _run
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 968 in run
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/framework/test_util.py", line 2054 in run
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/ops/gradient_checker.py", line 112 in _compute_theoretical_jacobian
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/ops/gradient_checker.py", line 235 in _compute_gradient
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/ops/gradient_checker.py", line 261 in <listcomp>
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/ops/gradient_checker.py", line 261 in _compute_gradient_list
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/ops/gradient_checker.py", line 322 in compute_gradient
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py", line 371 in new_func
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/ops/gradient_checker.py", line 390 in compute_gradient_error
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py", line 371 in new_func
  File "/root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/bin/bazel_pip/tensorflow/python/kernel_tests/nn_ops/conv2d_transpose_test_gpu.runfiles/org_tensorflow/bazel_pip/tensorflow/python/kernel_tests/nn_ops/conv2d_transpose_test.py", line 176 in testGradient
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/framework/test_util.py", line 1624 in decorated
  File "/usr/lib/python3.8/unittest/case.py", line 633 in _callTestMethod
  File "/usr/lib/python3.8/unittest/case.py", line 676 in run
  File "/usr/lib/python3.8/unittest/case.py", line 736 in __call__
  File "/usr/lib/python3.8/unittest/suite.py", line 122 in run
  File "/usr/lib/python3.8/unittest/suite.py", line 84 in __call__
  File "/usr/lib/python3.8/unittest/suite.py", line 122 in run
  File "/usr/lib/python3.8/unittest/suite.py", line 84 in __call__
  File "/usr/lib/python3.8/unittest/runner.py", line 176 in run
  File "/usr/lib/python3.8/unittest/main.py", line 271 in runTests
  File "/usr/lib/python3.8/unittest/main.py", line 101 in __init__
  File "/bazel_pip/lib/python3.8/site-packages/absl/testing/absltest.py", line 2524 in _run_and_get_tests_result
  File "/bazel_pip/lib/python3.8/site-packages/absl/testing/absltest.py", line 2558 in run_tests
  File "/bazel_pip/lib/python3.8/site-packages/absl/testing/absltest.py", line 2152 in _run_in_app
  File "/bazel_pip/lib/python3.8/site-packages/absl/testing/absltest.py", line 2057 in main
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/platform/googletest.py", line 51 in g_main
  File "/bazel_pip/lib/python3.8/site-packages/absl/app.py", line 254 in _run_main
  File "/bazel_pip/lib/python3.8/site-packages/absl/app.py", line 308 in run
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/platform/googletest.py", line 60 in main_wrapper
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/platform/benchmark.py", line 486 in benchmarks_main
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/platform/googletest.py", line 62 in main
  File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/platform/test.py", line 56 in main
  File "/root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/bin/bazel_pip/tensorflow/python/kernel_tests/nn_ops/conv2d_transpose_test_gpu.runfiles/org_tensorflow/bazel_pip/tensorflow/python/kernel_tests/nn_ops/conv2d_transpose_test.py", line 334 in <module>
*** Received signal 11 ***
*** BEGIN MANGLED STACK TRACE ***
/bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x17e5780)[0x7f599157e780]
/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f5a1741d090]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f5a1741d00b]
/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f5a1741d090]
/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8(+0x3986ec5)[0x7f5691b7cec5]
/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8(+0x395d0e9)[0x7f5691b530e9]
/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8(+0x19ce755)[0x7f568fbc4755]
/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8(_ZN5cudnn3cnn31ConvolutionForwardGroupedDirect21execute_internal_implERKNS_7backend11VariantPackEP11CUstream_st+0x2f6)[0x7f568fb6b536]
/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8(_ZN5cudnn3cnn15EngineInterface7executeERKNS_7backend11VariantPackEP11CUstream_st+0xd5)[0x7f568f4440a5]
/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8(_ZN5cudnn7backend7executeEP12cudnnContextRNS0_13ExecutionPlanERNS0_11VariantPackE+0x13cc)[0x7f568f4561fc]
/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8(cudnnBackendExecute+0x111)[0x7f568f4565b1]
/bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(cudnnBackendExecute+0x40)[0x7f5991abacf0]
/bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZNK15stream_executor3gpu24CudnnExecutionPlanRunnerIFvNS_16DeviceMemoryBaseES2_S2_EEclEPNS_6StreamEPNS_3dnn13ProfileResultES2_S2_S2_S2_+0x690)[0x7f5991a93930]
/bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_cc.so.2(_ZN10tensorflow8internal16AutotuneConvImplIZNS_19AutotuneUnfusedConvIfEEN3tsl8StatusOrINS_13AutotuneEntryIN15stream_executor3dnn6ConvOpEEEEEbPNS_11AutotuneMapINS_14ConvParametersES9_NS0_17AutotuneMapHasherISC_EEEERKSC_PNS_15OpKernelContextENS7_15ConvolutionKindERKNS7_15BatchDescriptorENS6_12DeviceMemoryIT_EERKNS7_16FilterDescriptorESR_RKNS7_21ConvolutionDescriptorESO_SR_lEUlPNS6_16ScratchAllocatorERKSt10unique_ptrIKNS7_8OpRunnerIFvNS6_16DeviceMemoryBaseES12_S12_EEESt14default_deleteIS15_EEPNS7_13ProfileResultEE_S13_EENS4_ISt6vectorINS_14AutotuneResultESaIS1F_EEEESK_RS1E_IS10_IKNS11_IT0_EES16_IS1L_EESaIS1N_EEbRKSQ_mRKNS6_16RedzoneAllocatorE+0x3d4)[0x7f599c1c02d4]
/bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_cc.so.2(_ZN10tensorflow19AutotuneUnfusedConvIfEEN3tsl8StatusOrINS_13AutotuneEntryIN15stream_executor3dnn6ConvOpEEEEEbPNS_11AutotuneMapINS_14ConvParametersES7_NS_8internal17AutotuneMapHasherISA_EEEERKSA_PNS_15OpKernelContextENS5_15ConvolutionKindERKNS5_15BatchDescriptorENS4_12DeviceMemoryIT_EERKNS5_16FilterDescriptorESQ_RKNS5_21ConvolutionDescriptorESN_SQ_l+0x38d)[0x7f599c1c70bd]
/bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_cc.so.2(_ZN10tensorflow18LaunchConv2DOpImplIfEEvPNS_15OpKernelContextEbbRKNS_6TensorES5_iiiiRKNS_7PaddingERKSt6vectorIlSaIlEEPS3_NS_12TensorFormatE+0x1095)[0x7f599be457e5]
/bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_cc.so.2(_ZN10tensorflow8Conv2DOpIN5Eigen9GpuDeviceEfE7ComputeEPNS_15OpKernelContextE+0x1d9)[0x7f599be46ce9]
/bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x329)[0x7f5991523259]
/bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x16e450b)[0x7f599147d50b]
/bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x16e6868)[0x7f599147f868]
/bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_cc.so.2(_ZN5Eigen15ThreadPoolTemplIN3tsl6thread16EigenEnvironmentEE10WorkerLoopEi+0x2a5)[0x7f59a72af595]
/bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_cc.so.2(_ZNSt17_Function_handlerIFvvEZN3tsl6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x43)[0x7f59a72ad2c3]
/bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x157078b)[0x7f599130978b]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f5a173bf609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f5a174f9133]
*** END MANGLED STACK TRACE ***

*** Begin stack trace ***
	tsl::CurrentStackTrace[abi:cxx11]()

	gsignal

	cudnn::cnn::ConvolutionForwardGroupedDirect::execute_internal_impl(cudnn::backend::VariantPack const&, CUstream_st*)
	cudnn::cnn::EngineInterface::execute(cudnn::backend::VariantPack const&, CUstream_st*)
	cudnn::backend::execute(cudnnContext*, cudnn::backend::ExecutionPlan&, cudnn::backend::VariantPack&)
	cudnnBackendExecute
	cudnnBackendExecute
	stream_executor::gpu::CudnnExecutionPlanRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)>::operator()(stream_executor::Stream*, stream_executor::dnn::ProfileResult*, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase) const
	tsl::StatusOr<std::vector<tensorflow::AutotuneResult, std::allocator<std::vector> > > tensorflow::internal::AutotuneConvImpl<tensorflow::AutotuneUnfusedConv<float>(bool, tensorflow::AutotuneMap<tensorflow::ConvParameters, tensorflow::AutotuneEntry<stream_executor::dnn::ConvOp>, tensorflow::internal::AutotuneMapHasher<tensorflow::ConvParameters> >*, tensorflow::ConvParameters const&, tensorflow::OpKernelContext*, stream_executor::dnn::ConvolutionKind, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, long)::{lambda(stream_executor::ScratchAllocator*, std::unique_ptr<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const, std::default_delete<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const> > const&, stream_executor::dnn::ProfileResult*)#1}, void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)>(tensorflow::OpKernelContext*, tensorflow::AutotuneUnfusedConv<float>(bool, tensorflow::AutotuneMap<tensorflow::ConvParameters, tensorflow::AutotuneEntry<stream_executor::dnn::ConvOp>, tensorflow::internal::AutotuneMapHasher<tensorflow::ConvParameters> >*, tensorflow::ConvParameters const&, tensorflow::OpKernelContext*, stream_executor::dnn::ConvolutionKind, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, long)::{lambda(stream_executor::ScratchAllocator*, std::unique_ptr<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const, std::default_delete<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const> > const&, stream_executor::dnn::ProfileResult*)#1}<std::unique_ptr<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const, std::default_delete<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> > >, std::allocator<std::default_delete<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> > > >&, bool, tensorflow::AutotuneUnfusedConv<float>(bool, tensorflow::AutotuneMap<tensorflow::ConvParameters, tensorflow::AutotuneEntry<stream_executor::dnn::ConvOp>, tensorflow::internal::AutotuneMapHasher<tensorflow::ConvParameters> >*, tensorflow::ConvParameters const&, tensorflow::OpKernelContext*, stream_executor::dnn::ConvolutionKind, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, long)::{lambda(stream_executor::ScratchAllocator*, std::unique_ptr<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const, std::default_delete<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const> > const&, stream_executor::dnn::ProfileResult*)#1} const&, unsigned long, stream_executor::RedzoneAllocator const&)
	tsl::StatusOr<tensorflow::AutotuneEntry<stream_executor::dnn::ConvOp> > tensorflow::AutotuneUnfusedConv<float>(bool, tensorflow::AutotuneMap<tensorflow::ConvParameters, tensorflow::AutotuneEntry<stream_executor::dnn::ConvOp>, tensorflow::internal::AutotuneMapHasher<tensorflow::ConvParameters> >*, tensorflow::ConvParameters const&, tensorflow::OpKernelContext*, stream_executor::dnn::ConvolutionKind, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, long)
	void tensorflow::LaunchConv2DOpImpl<float>(tensorflow::OpKernelContext*, bool, bool, tensorflow::Tensor const&, tensorflow::Tensor const&, int, int, int, int, tensorflow::Padding const&, std::vector<long, std::allocator<long> > const&, tensorflow::Tensor*, tensorflow::TensorFormat)
	tensorflow::Conv2DOp<Eigen::GpuDevice, float>::Compute(tensorflow::OpKernelContext*)
	tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*)

	Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int)
	std::_Function_handler<void (), tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&)

	clone
*** End stack trace ***
```

PiperOrigin-RevId: 497243188
rahulbatra85 pushed a commit that referenced this pull request Jan 24, 2023
This is mostly a 1:1 restructuring with the following changes:
1) Added simple snapshot recovery from on-disk state.
2) Removed all members tracking snapshot, stream, and source completion.  I think these may have been structured incorrectly, and either way they weren't tested or used.  I'll reevaluate when stream completion is implemented.
3) Removed some validations that weren't tested and/or were related to #1.  Will add back after addressing #1.
4) Renamed directory -> path.

PiperOrigin-RevId: 502934739
i-chaochen pushed a commit that referenced this pull request Feb 20, 2023
…3b0f6530e4d7

[Snyk] Security upgrade python from 3.9.0-buster to 3.10-buster
jayfurmanek pushed a commit that referenced this pull request Mar 27, 2023
…tual

   number of batch threads we have.  We are creating our own threadpool and
   passing it to ASBS; you'd think that ASBS would then ignore the
   num_batch_threads option, but in fact it still uses it as an indication of the
   max number of in-flight requests that are allowed.  We also need to cap the
   other in-flight parameters to be also no larger than num_batch_threads.

2. Set lower defaults for these ASBS options -- models can always override them
   but I think the current defaults are just too high.

The current settings are:
- actual threadpool size: 64 (or 4 in the second experiment)
- max in-flight batches: 64
- min in-flight batches: 16
- initial in-flight batches: 16

This is a lot more in-filght batches than we were getting with SBS where we had
2-8 batch threads, and cutting the actual threadpool size to 4 failed to adjust
things as we expected because of issue #1.

Proposed defaults are
- max in-flight batches: 64 (no change)
- min in-flight batches: 1
- initial in-flight batches: 2

PiperOrigin-RevId: 519142730
wenchenvincent pushed a commit that referenced this pull request Apr 25, 2023
1. Cap ASBS's num_batch_threads parameter to be no larger than the actual
   number of batch threads we have.  We are creating our own threadpool and
   passing it to ASBS; you'd think that ASBS would then ignore the
   num_batch_threads option, but in fact it still uses it as an indication of the
   max number of in-flight requests that are allowed.  We also need to cap the
   other in-flight parameters to be also no larger than num_batch_threads.

2. Set lower defaults for these ASBS options -- models can always override them
   but I think the current defaults are just too high.

The current settings are:
- actual threadpool size: 64 (or 4 in the second experiment)
- max in-flight batches: 64
- min in-flight batches: 16
- initial in-flight batches: 16

This is a lot more in-filght batches than we were getting with SBS where we had
2-8 batch threads, and cutting the actual threadpool size to 4 failed to adjust
things as we expected because of issue #1.

Proposed defaults are
- max in-flight batches: 64 (no change)
- min in-flight batches: 1
- initial in-flight batches: 2

PiperOrigin-RevId: 524878260
weihanmines pushed a commit that referenced this pull request Aug 1, 2023
…-example

Updated Install libffi7 package step
i-chaochen pushed a commit that referenced this pull request Aug 29, 2023
NAVI31 was not detected properly due to a typo.
rahulbatra85 pushed a commit that referenced this pull request Nov 9, 2023
Imported from GitHub PR openxla/xla#6599

FP8 cublasLt matmul uses fast accumulation when both operands' precision are DEFAULT. Otherwise fall back to high precision acuumulation. Issue#openxla/xla#6168

This PR is closely related to Flax PR-![3416](google/flax#3416).
Copybara import of the project:

--
a4140da8ca08cd2d4796a7b8f032827867a361bc by shuw <shuw@nvidia.com>:

Add FP8 fast accumulation support for cublasLt.

--
96845683cc4b1e7b947bc919fbf97d8865abeac9 by shuw <shuw@nvidia.com>:

Improve based on review #1

--
e906d7620780d2cf1fe8433c933648dcb98dc61d by shuw <shuw@nvidia.com>:

Improve based on review #2

Merging this change closes tensorflow#6599

PiperOrigin-RevId: 578948593
zoranjovanovic-ns pushed a commit that referenced this pull request Dec 19, 2023
Imported from GitHub PR openxla/xla#7751

Due to fast accumulation being turned on in the forward mode, the cublasLt fp8 gemm with gelu epilogue can efficiently operate with a fused kernel. Compared against the XLA-generated gelu kernel on H100, the performance demonstrates some improvement for size of [8192, 4096] x [4096, 16384] + gelu:

Execution time for matmul using cublasLt and gelu (XLA): 1.28ms
Execution time for matmul_gelu using cublasLt: 1.25ms
Copybara import of the project:

--
e8abce3b41f68cae1bb625cdecd5885413a0781d by Shu Wang <shuw@nvidia.com>:

Support cublasLt Fp8 Approx Gelu epilogue fusion.

--
818127cf582af7ceba014d88bdf027857fc8f0e5 by shuw <shuw@nvidia.com>:

Remove F32 check

--
5ce3108a9bc8459e20456d23a3ae493ef7a6a387 by shuw <shuw@nvidia.com>:

Improve based on review #1

Merging this change closes tensorflow#7751

PiperOrigin-RevId: 591236441
hsharsha pushed a commit that referenced this pull request Mar 18, 2024
…execution scope

Instead of always constructing while operation conditional in the default scope use the scope of a while operation itself.

This generates correct CUDA graph: https://gist.github.com/ezhulenev/a84192fe8b46a4bf1a934a8baa08ea60

Memeset operation launched in a scope #1 is not synchronized with initial condition handle update

PiperOrigin-RevId: 609742672
i-chaochen pushed a commit that referenced this pull request May 13, 2024
i-chaochen pushed a commit that referenced this pull request May 20, 2024
…uild phase to Initialize()

Imported from GitHub PR openxla/xla#12228

The first time that a NormThunk is executed, it will build a cudnn execution plan. This build step can hang if a NCCL collective is running at the same time. To fix this, I've moved the build step to take place during thunk initialization. We only observe this hang when using cudnn 9.

Here's a backtrace from the hang that will be fixed:
```
Thread 585 (Thread 0x7fb9391ff640 (LWP 41364) "main.py"):
#0  0x00007fd3d17cffd9 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd3d17da24f in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd070967dfe in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fd0709c928a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f1970d76102 in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
#5  0x00007f1970f2c999 in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
#6  0x00007f1970a7d4ab in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
#7  0x00007f1970d0a9cb in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
#8  0x00007fce60b2a98c in cudnn::backend::ExecutionPlan::finalize_internal() () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
#9  0x00007fce60aefbb1 in cudnn::backend::Descriptor::finalize() () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
#10 0x00007fce60b15bec in cudnnBackendFinalize () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
#11 0x00007fd2521b8f39 in cudnn_frontend::ExecutionPlanBuilder_v8::build() () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
#12 0x00007fd2521734ba in stream_executor::gpu::(anonymous namespace)::GetExecPlanFromHeuristics(cudnn_frontend::OperationGraph_v8&&, stream_executor::gpu::(anonymous namespace)::CudnnHandle const&, bool) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
#13 0x00007fd25216ff9b in stream_executor::gpu::CudnnSupport::NormRunnerFromDesc(stream_executor::Stream*, stream_executor::dnn::AlgorithmDesc const&, stream_executor::dnn::NormKind, double, stream_executor::dnn::TensorDescriptor const&, stream_executor::dnn::TensorDescriptor const&, stream_executor::dnn::TensorDescriptor const&, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
#14 0x00007fd24e36b88b in stream_executor::dnn::NormOp::RunnerFromAlgorithmDesc(stream_executor::dnn::AlgorithmDesc const&, stream_executor::dnn::NormOp::Config, stream_executor::Stream*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
#15 0x00007fd24e36ae37 in stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}::operator()() const () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
#16 0x00007fd24e36adbc in void absl::lts_20230802::base_internal::CallOnceImpl<stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}>(std::atomic<unsigned int>*, absl::lts_20230802::base_internal::SchedulingMode, stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}&&) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
#17 0x00007fd24e36a9bd in stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
#18 0x00007fd24e369d29 in xla::gpu::RunGpuNorm(xla::gpu::GpuNormConfig const&, stream_executor::DeviceMemoryBase const&, stream_executor::DeviceMemoryBase const&, stream_executor::DeviceMemoryBase const&, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, stream_executor::DeviceMemoryBase const&, stream_executor::Stream*, xla::gpu::RunNormOptions) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
#19 0x00007fd24e368be6 in xla::gpu::NormThunk::ExecuteOnStream(xla::gpu::Thunk::ExecuteParams const&) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
```
Copybara import of the project:

--
f53533087ba1ddcf65ad7cc6268ee89de4690d15 by Trevor Morris <tmorris@nvidia.com>:

Fix hang with cudnn layer norm by moving cudnn init to Initialize()

Merging this change closes tensorflow#12228

PiperOrigin-RevId: 633220207
hsharsha pushed a commit that referenced this pull request Aug 15, 2024
name                                     old cpu/op   new cpu/op   delta
BM_SelectAndScatterF32/128/process_time   889µs ± 1%   740µs ± 3%  -16.70%
BM_SelectAndScatterF32/256/process_time  3.64ms ± 2%  3.00ms ± 1%  -17.64%
BM_SelectAndScatterF32/512/process_time  15.3ms ± 1%  13.1ms ± 3%  -14.61%

PiperOrigin-RevId: 658063846
alekstheod pushed a commit that referenced this pull request Jan 13, 2025
Fixes the following TSAN race:

```
WARNING: ThreadSanitizer: data race (pid=899472)
  Write of size 8 at 0x7f979e0f1cd8 by thread T69:
    #0 llvm::TargetRegistry::RegisterTargetMachine(llvm::Target&, llvm::TargetMachine* (*)(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool)) /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:827:27 (xla_extension.so+0x9803668) (BuildId: 6fa88e3910a5eb04)
    #1 llvm::RegisterTargetMachine<llvm::X86TargetMachine>::RegisterTargetMachine(llvm::Target&) /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:1250:5 (xla_extension.so+0x9803668)
    #2 LLVMInitializeX86Target /proc/self/cwd/external/llvm-project/llvm/lib/Target/X86/X86TargetMachine.cpp:69:43 (xla_extension.so+0x9803668)
    #3 llvm::InitializeNativeTarget() /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/TargetSelect.h:123:5 (xla_extension.so+0x48d2358) (BuildId: 6fa88e3910a5eb04)
    #4 xla::cpu::JitCompiler::Create(llvm::TargetOptions, xla::cpu::JitCompiler::Options, absl::lts_20230802::AnyInvocable<void (std::function<void ()>)>)::$_0::operator()() const /proc/self/cwd/external/xla/xla/backends/cpu/codegen/jit_compiler.cc:113:5 (xla_extension.so+0x48d2358)
    #5 xla::cpu::JitCompiler::Create(llvm::TargetOptions, xla::cpu::JitCompiler::Options, absl::lts_20230802::AnyInvocable<void (std::function<void ()>)>) /proc/self/cwd/external/xla/xla/backends/cpu/codegen/jit_compiler.cc:112:34 (xla_extension.so+0x48d209b) (BuildId: 6fa88e3910a5eb04)
    #6 xla::cpu::CpuCompiler::CompileLegacyCpuExecutable(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1416:3 (xla_extension.so+0x2f716a0) (BuildId: 6fa88e3910a5eb04)
    #7 xla::cpu::CpuCompiler::RunBackend(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1730:3 (xla_extension.so+0x2f7ae18) (BuildId: 6fa88e3910a5eb04)
    #8 xla::JitCompile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&, xla::ExecutionOptions const&, xla::Compiler::CompileOptions const&, int, std::function<void (xla::HloModuleConfig&)>) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:759:19 (xla_extension.so+0x2f12915) (BuildId: 6fa88e3910a5eb04)
    #9 xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:847:3 (xla_extension.so+0x2f12915)

  Previous read of size 8 at 0x7f979e0f1cd8 by thread T66:
    #0 llvm::Target::createTargetMachine(llvm::StringRef, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) const /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:460:10 (xla_extension.so+0x94ba6db) (BuildId: 6fa88e3910a5eb04)
    #1 llvm::EngineBuilder::selectTarget(llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::SmallVectorImpl<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>> const&) /proc/self/cwd/external/llvm-project/llvm/lib/ExecutionEngine/TargetSelect.cpp:88:18 (xla_extension.so+0x94ba6db)
    #2 xla::cpu::JitCompiler::InferTargetMachine(llvm::TargetOptions const&, llvm::CodeGenOptLevel, std::optional<tsl::port::CPUFeature>) /proc/self/cwd/external/xla/xla/backends/cpu/codegen/jit_compiler.cc:88:12 (xla_extension.so+0x48d096f) (BuildId: 6fa88e3910a5eb04)
    #3 xla::cpu::CpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1017:3 (xla_extension.so+0x2f70857) (BuildId: 6fa88e3910a5eb04)
    #4 xla::JitCompile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&, xla::ExecutionOptions const&, xla::Compiler::CompileOptions const&, int, std::function<void (xla::HloModuleConfig&)>) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:754:3 (xla_extension.so+0x2f12874) (BuildId: 6fa88e3910a5eb04)
    #5 xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:847:3 (xla_extension.so+0x2f12874)
    #6 xla::TfrtCpuClient::Compile(mlir::ModuleOp, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:893:10 (xla_extension.so+0x2f13ef2) (BuildId: 6fa88e3910a5eb04)
```

PiperOrigin-RevId: 707701032
alekstheod pushed a commit that referenced this pull request Jan 13, 2025
…r RunBackend.

Both of these call into LLVM code that reads the compiler options.

Fixes the following race:

```
WARNING: ThreadSanitizer: data race (pid=869815)
  Read of size 1 at 0x7f8b24effc08 by thread T65:
    #0 llvm::cl::opt_storage<bool, false, false>::getValue() const /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h:1406:38 (xla_extension.so+0xa281417) (BuildId: 7f5d2098f168c4db)
    #1 llvm::cl::opt_storage<bool, false, false>::operator bool() const /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h:1410:38 (xla_extension.so+0xa281417)
    #2 llvm::CodeGenTargetMachineImpl::CodeGenTargetMachineImpl(llvm::Target const&, llvm::StringRef, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, llvm::Reloc::Model, llvm::CodeModel::Model, llvm::CodeGenOptLevel) /proc/self/cwd/external/llvm-project/llvm/lib/CodeGen/CodeGenTargetMachineImpl.cpp:97:7 (xla_extension.so+0xa281417)
    #3 llvm::X86TargetMachine::X86TargetMachine(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) /proc/self/cwd/external/llvm-project/llvm/lib/Target/X86/X86TargetMachine.cpp:236:7 (xla_extension.so+0x9803b80) (BuildId: 7f5d2098f168c4db)
    #4 llvm::RegisterTargetMachine<llvm::X86TargetMachine>::Allocator(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:1258:16 (xla_extension.so+0x980757a) (BuildId: 7f5d2098f168c4db)
    #5 llvm::Target::createTargetMachine(llvm::StringRef, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) const /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:462:12 (xla_extension.so+0x94ba529) (BuildId: 7f5d2098f168c4db)
    #6 llvm::EngineBuilder::selectTarget(llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::SmallVectorImpl<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>> const&) /proc/self/cwd/external/llvm-project/llvm/lib/ExecutionEngine/TargetSelect.cpp:88:18 (xla_extension.so+0x94ba529)
    #7 xla::cpu::JitCompiler::InferTargetMachine(llvm::TargetOptions const&, llvm::CodeGenOptLevel, std::optional<tsl::port::CPUFeature>) /proc/self/cwd/external/xla/xla/backends/cpu/codegen/jit_compiler.cc:88:12 (xla_extension.so+0x48d070f) (BuildId: 7f5d2098f168c4db)
    #8 xla::cpu::CpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1017:3 (xla_extension.so+0x2f6dc47) (BuildId: 7f5d2098f168c4db)
    #9 xla::JitCompile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&, xla::ExecutionOptions const&, xla::Compiler::CompileOptions const&, int, std::function<void (xla::HloModuleConfig&)>) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:749:3 (xla_extension.so+0x2f127e2) (BuildId: 7f5d2098f168c4db)
    #10 xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:842:3 (xla_extension.so+0x2f127e2)
    #11 xla::TfrtCpuClient::Compile(mlir::ModuleOp, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:888:10 (xla_extension.so+0x2f13da2) (BuildId: 7f5d2098f168c4db)
    #12 xla::ifrt::PjRtLoadedExecutable::Create(xla::ifrt::PjRtCompatibleClient*, mlir::ModuleOp, xla::CompileOptions, std::vector<tsl::RCReference<xla::ifrt::LoadedHostCallback>, std::allocator<tsl::RCReference<xla::ifrt::LoadedHostCallback>>>) /proc/self/cwd/external/xla/xla/python/pjrt_ifrt/pjrt_executable.cc:258:3 (xla_extension.so+0xdd04d77) (BuildId: 7f5d2098f168c4db)
    #13 xla::ifrt::PjRtCompiler::Compile(std::unique_ptr<xla::ifrt::Program, std::default_delete<xla::ifrt::Program>>, std::unique_ptr<xla::ifrt::CompileOptions, std::default_delete<xla::ifrt::CompileOptions>>) /proc/self/cwd/external/xla/xla/python/pjrt_ifrt/pjrt_compiler.cc:97:10 (xla_extension.so+0xdcfd29b) (BuildId: 7f5d2098f168c4db)
    #14 xla::PyClient::CompileIfrtProgram(xla::nb_class_ptr<xla::PyClient>, std::unique_ptr<xla::ifrt::Program, std::default_delete<xla::ifrt::Program>>, std::unique_ptr<xla::ifrt::CompileOptions, std::default_delete<xla::ifrt::CompileOptions>>) /proc/self/cwd/external/xla/xla/python/py_client.cc:443:5 (xla_extension.so+0xc62a228) (BuildId: 7f5d2098f168c4db)
    #15 xla::PyClient::Compile(xla::nb_class_ptr<xla::PyClient>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, xla::CompileOptions, std::vector<nanobind::capsule, std::allocator<nanobind::capsule>>) /proc/self/cwd/external/xla/xla/python/py_client.cc:466:10 (xla_extension.so+0xc62b514) (BuildId: 7f5d2098f168c4db)

  Previous write of size 1 at 0x7f8b24effc08 by thread T66 (mutexes: write M0):
    #0 void llvm::cl::opt_storage<bool, false, false>::setValue<bool>(bool const&, bool) /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h:1401:11 (xla_extension.so+0x100bace9) (BuildId: 7f5d2098f168c4db)
    #1 void llvm::cl::opt<bool, false, llvm::cl::parser<bool>>::setDefaultImpl<bool, void>() /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h (xla_extension.so+0x100bace9)
    #2 llvm::cl::opt<bool, false, llvm::cl::parser<bool>>::setDefault() /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h:1474:32 (xla_extension.so+0x100bace9)
    #3 llvm::cl::Option::reset() /proc/self/cwd/external/llvm-project/llvm/lib/Support/CommandLine.cpp:460:3 (xla_extension.so+0x100cac0e) (BuildId: 7f5d2098f168c4db)
    #4 (anonymous namespace)::CommandLineParser::ResetAllOptionOccurrences() /proc/self/cwd/external/llvm-project/llvm/lib/Support/CommandLine.cpp:1478:17 (xla_extension.so+0x100cac0e)
    #5 llvm::cl::ResetAllOptionOccurrences() /proc/self/cwd/external/llvm-project/llvm/lib/Support/CommandLine.cpp:2831:17 (xla_extension.so+0x100caa72) (BuildId: 7f5d2098f168c4db)
    #6 xla::llvm_ir::LLVMCommandLineOptionsLock::LLVMCommandLineOptionsLock(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>> const&) /proc/self/cwd/external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:70:5 (xla_extension.so+0x91d69f4) (BuildId: 7f5d2098f168c4db)
    #7 xla::cpu::CpuCompiler::RunBackend(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1727:39 (xla_extension.so+0x2f781c8) (BuildId: 7f5d2098f168c4db)
    #8 xla::JitCompile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&, xla::ExecutionOptions const&, xla::Compiler::CompileOptions const&, int, std::function<void (xla::HloModuleConfig&)>) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:754:19 (xla_extension.so+0x2f12883) (BuildId: 7f5d2098f168c4db)
    #9 xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:842:3 (xla_extension.so+0x2f12883)
    #10 xla::TfrtCpuClient::Compile(mlir::ModuleOp, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:888:10 (xla_extension.so+0x2f13da2) (BuildId: 7f5d2098f168c4db)
    #11 xla::ifrt::PjRtLoadedExecutable::Create(xla::ifrt::PjRtCompatibleClient*, mlir::ModuleOp, xla::CompileOptions, std::vector<tsl::RCReference<xla::ifrt::LoadedHostCallback>, std::allocator<tsl::RCReference<xla::ifrt::LoadedHostCallback>>>) /proc/self/cwd/external/xla/xla/python/pjrt_ifrt/pjrt_executable.cc:258:3 (xla_extension.so+0xdd04d77) (BuildId: 7f5d2098f168c4db)
    #12 xla::ifrt::PjRtCompiler::Compile(std::unique_ptr<xla::ifrt::Program, std::default_delete<xla::ifrt::Program>>, std::unique_ptr<xla::ifrt::CompileOptions, std::default_delete<xla::ifrt::CompileOptions>>) /proc/self/cwd/external/xla/xla/python/pjrt_ifrt/pjrt_compiler.cc:97:10 (xla_extension.so+0xdcfd29b) (BuildId: 7f5d2098f168c4db)
    #13 xla::PyClient::CompileIfrtProgram(xla::nb_class_ptr<xla::PyClient>, std::unique_ptr<xla::ifrt::Program, std::default_delete<xla::ifrt::Program>>, std::unique_ptr<xla::ifrt::CompileOptions, std::default_delete<xla::ifrt::CompileOptions>>) /proc/self/cwd/external/xla/xla/python/py_client.cc:443:5 (xla_extension.so+0xc62a228) (BuildId: 7f5d2098f168c4db)
    #14 xla::PyClient::Compile(xla::nb_class_ptr<xla::PyClient>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, xla::CompileOptions, std::vector<nanobind::capsule, std::allocator<nanobind::capsule>>) /proc/self/cwd/external/xla/xla/python/py_client.cc:466:10 (xla_extension.so+0xc62b514) (BuildId: 7f5d2098f168c4db)
```

PiperOrigin-RevId: 707721170
ScXfjiang pushed a commit that referenced this pull request Mar 31, 2025
…ssume all dimensions are static.

Some callers call the `Shape(element_type, dimensions, dynamic_dimensions)` ctor with a non-empty `dimensions` and an empty `dynamic_dimensions`. This breaks the shape object's invariant that the two should have the same size.

We have two options for fixing this:

1. Force the caller to always provide a `dynamic_dimensions` whose size matches that of `dimensions`.
2. Provide a sensible default behavior when `dynamic_dimensions` is empty.

I chose #2 as:

1. #1 is more risky as it may cause the compiler to crash in production (e.g. if we don't have adequate test coverage).
2. It's very common for an array to have only static dimensions. Therefore it's good to optimize the user experience for this common case.

PiperOrigin-RevId: 739197635
ScXfjiang pushed a commit that referenced this pull request Mar 31, 2025
Imported from GitHub PR openxla/xla#24114

Converting FP8 <-> FP8 fails because the Triton compiler does not support it.
The proposed fix will make the conversion go through FP16.

Two questions:
1) Are there any better approaches of solving this?
2) I could not find a place to put unit tests for this, and in the code there is a comment saying:
    ```
        // TODO(b/266862493): Add end-to-end test once FP8 support lands in XLA as
        // we can't test the code below without patching the feature.
    ```
    Wondering if there is a place where I can add a test?

### Details
When converting FP8 types, the XLA compiler emits a `fp_to_fp` Triton instruction. If the source type is FP8, no rounding strategy is specified.

Concretely, this causes the following Triton to be emitted:
<details>
<summary>
<code>%24 = tt.fp_to_fp %20 : tensor<32x64xf8E5M2> -> tensor<32x64xf8E4M3FN></code>
</summary>

```
module {
  tt.func @gemm_fusion_dot_320_impl(%arg0: !tt.ptr<f8E4M3FN> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f8E5M2> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f8E4M3FN> {tt.divisibility = 16 : i32}) {
    %cst = arith.constant dense<0.000000e+00> : tensor<64x64xf8E4M3FN>
    %cst_0 = arith.constant dense<0.000000e+00> : tensor<32x64xf8E4M3FN>
    %c90_i32 = arith.constant 90 : i32
    %c32000_i64 = arith.constant 32000 : i64
    %c64_i32 = arith.constant 64 : i32
    %c90_i64 = arith.constant 90 : i64
    %c768_i64 = arith.constant 768 : i64
    %c0_i32 = arith.constant 0 : i32
    %c1_i64 = arith.constant 1 : i64
    %c32_i32 = arith.constant 32 : i32
    %c24_i32 = arith.constant 24 : i32
    %c8_i32 = arith.constant 8 : i32
    %c4000_i32 = arith.constant 4000 : i32
    %cst_1 = arith.constant dense<0.000000e+00> : tensor<32x64xf32>
    %0 = tt.get_program_id x : i32
    %1 = arith.divsi %0, %c4000_i32 : i32
    %2 = arith.muli %1, %c8_i32 : i32
    %3 = arith.subi %c24_i32, %2 : i32
    %4 = arith.cmpi slt, %3, %c8_i32 : i32
    %5 = arith.select %4, %3, %c8_i32 : i32
    %6 = arith.remsi %0, %5 : i32
    %7 = arith.addi %2, %6 : i32
    %8 = arith.remsi %0, %c4000_i32 : i32
    %9 = arith.divsi %8, %5 : i32
    %10 = arith.muli %7, %c32_i32 : i32
    %11 = tt.make_tensor_ptr %arg1, [%c768_i64, %c90_i64], [%c1_i64, %c768_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<32x64xf8E5M2>>
    %12 = tt.advance %11, [%10, %c0_i32] : <tensor<32x64xf8E5M2>>
    %13 = arith.muli %9, %c64_i32 : i32
    %14 = tt.make_tensor_ptr %arg0, [%c90_i64, %c32000_i64], [%c1_i64, %c90_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<64x64xf8E4M3FN>>
    %15 = tt.advance %14, [%c0_i32, %13] : <tensor<64x64xf8E4M3FN>>
    %16:3 = scf.for %arg3 = %c0_i32 to %c90_i32 step %c64_i32 iter_args(%arg4 = %12, %arg5 = %15, %arg6 = %cst_1) -> (!tt.ptr<tensor<32x64xf8E5M2>>, !tt.ptr<tensor<64x64xf8E4M3FN>>, tensor<32x64xf32>)  : i32 {
      %20 = tt.load %arg4 {boundaryCheck = array<i32: 1>, padding = 1 : i32} : !tt.ptr<tensor<32x64xf8E5M2>>
      %21 = tt.advance %arg4, [%c0_i32, %c64_i32] : <tensor<32x64xf8E5M2>>
      %22 = tt.load %arg5 {boundaryCheck = array<i32: 0>, padding = 1 : i32} : !tt.ptr<tensor<64x64xf8E4M3FN>>
      %23 = tt.advance %arg5, [%c64_i32, %c0_i32] : <tensor<64x64xf8E4M3FN>>
      %24 = tt.fp_to_fp %20 : tensor<32x64xf8E5M2> -> tensor<32x64xf8E4M3FN>
      %25 = arith.subi %c90_i32, %arg3 : i32
      %26 = arith.cmpi slt, %25, %c64_i32 : i32
      %27 = scf.if %26 -> (tensor<32x64xf8E4M3FN>) {
        %30 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
        %31 = tt.expand_dims %30 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
        %32 = tt.splat %25 : i32 -> tensor<1x64xi32>
        %33 = arith.cmpi slt, %31, %32 : tensor<1x64xi32>
        %34 = tt.broadcast %33 : tensor<1x64xi1> -> tensor<32x64xi1>
        %35 = arith.select %34, %24, %cst_0 : tensor<32x64xi1>, tensor<32x64xf8E4M3FN>
        scf.yield %35 : tensor<32x64xf8E4M3FN>
      } else {
        scf.yield %24 : tensor<32x64xf8E4M3FN>
      }
      %28 = scf.if %26 -> (tensor<64x64xf8E4M3FN>) {
        %30 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
        %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
        %32 = tt.splat %25 : i32 -> tensor<64x1xi32>
        %33 = arith.cmpi slt, %31, %32 : tensor<64x1xi32>
        %34 = tt.broadcast %33 : tensor<64x1xi1> -> tensor<64x64xi1>
        %35 = arith.select %34, %22, %cst : tensor<64x64xi1>, tensor<64x64xf8E4M3FN>
        scf.yield %35 : tensor<64x64xf8E4M3FN>
      } else {
        scf.yield %22 : tensor<64x64xf8E4M3FN>
      }
      %29 = tt.dot %27, %28, %arg6, inputPrecision = tf32 {maxNumImpreciseAcc = 2147483647 : i32} : tensor<32x64xf8E4M3FN> * tensor<64x64xf8E4M3FN> -> tensor<32x64xf32>
      scf.yield %21, %23, %29 : !tt.ptr<tensor<32x64xf8E5M2>>, !tt.ptr<tensor<64x64xf8E4M3FN>>, tensor<32x64xf32>
    }
    %17 = tt.fp_to_fp %16#2, rounding = rtne : tensor<32x64xf32> -> tensor<32x64xf8E4M3FN>
    %18 = tt.make_tensor_ptr %arg2, [%c768_i64, %c32000_i64], [%c1_i64, %c768_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<32x64xf8E4M3FN>>
    %19 = tt.advance %18, [%10, %13] : <tensor<32x64xf8E4M3FN>>
    tt.store %19, %17 : !tt.ptr<tensor<32x64xf8E4M3FN>>
    tt.return
  }
}
```
</details>

Which leads to a failing assertion:
```
#0  0x000073413786d9fc in pthread_kill () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000734137819476 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007341377ff7f3 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007341377ff71b in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x0000734137810e96 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x000057d936b1777b in mlir::triton::gpu::(anonymous namespace)::FpToFpOpConversion::createDestOps (this=0x733d08425cc0, op=..., adaptor=..., rewriter=..., elemTy=..., operands=..., loc=...)
    at external/triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ElementwiseOpToLLVM.cpp:500
#6  0x000057d936b17195 in mlir::triton::gpu::ElementwiseOpConversionBase<mlir::triton::FpToFpOp, mlir::triton::gpu::(anonymous namespace)::FpToFpOpConversion>::matchAndRewrite (this=0x733d08425cc0, op=..., adaptor=..., rewriter=...)
    at external/triton/include/triton/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVMBase.h:188
[...]
#29 0x000057d93fa6cade in mlir::PassManager::run (this=0x733e80fba158, op=0x733d080bbc20) at external/llvm-project/mlir/lib/Pass/Pass.cpp:885
#30 0x000057d9363f6b1b in xla::gpu::CompileTritonToLLVM (hlo_config=..., hlo_module_name="gemm_fusion_dot.320", device_info=..., block_level_parameters=..., triton_module=..., llvm_module=0x733d0816d6a0, mlir_context=..., is_xla_fusion=true, emit_kernel=true)
    at xla/backends/gpu/codegen/triton/fusion_emitter.cc:1627
#31 0x000057d9363f5a5d in xla::gpu::TritonWrapper (fn_name="gemm_fusion_dot_320_impl", fusion=0x733d080a31c0, cc=std::variant<stream_executor::CudaComputeCapability, stream_executor::RocmComputeCapability> [index 0] = {...}, device_info=..., block_level_parameters=...,
    llvm_module=0x733d0816d6a0, mlir_context=...) at xla/backends/gpu/codegen/triton/fusion_emitter.cc:1531
```

However, this fails Triton compilation:
* First it hits an assertion that the rounding strategy when the destination type is FP8 must be specified
* Adding the rounding strategy, then goes on to another issue, that no methods for converting FP8 <-> FP8 are specified

To work around the above two issues, I propose going through FP16 when both the source and destination types are FP8's.
Copybara import of the project:

--
afd3929099fc4d1045275ca3210e0bc727a2b906 by Kasper Nielsen <kasper0406@gmail.com>:

Fix fused fp8 <-> fp8 conversions

--
66340aa808f58e5dc6ab1c2e06790ceccde95540 by Kasper Nielsen <kasper0406@gmail.com>:

Add unit tests and refactor duplicated code

--
07ae307879eff24ad2f85607e94503deda1074e4 by Kasper Nielsen <kasper0406@gmail.com>:

Run clang-format

--
fe967ff94ffc5f34f07bff142b5d10d81d5e4dce by Kasper Nielsen <kasper0406@gmail.com>:

Fix support conversion tests

Merging this change closes tensorflow#24114

PiperOrigin-RevId: 741473648
pemeliya pushed a commit that referenced this pull request Apr 22, 2025
pemeliya pushed a commit that referenced this pull request Apr 22, 2025
… allocator kernel rocm cu

Imported from GitHub PR openxla/xla#24898

Fix issue reported by asan while running the tests on rocm ci:

```
==1718600==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x5030001d97f8 at pc 0x5647cfdda211 bp 0x7ffc9eb7eac0 sp 0x7ffc9eb7eab8
READ of size 8 at 0x5030001d97f8 thread T0
    #0 0x5647cfdda210 in absl::lts_20230802::container_internal::CommonFields::capacity() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:990:36
    #1 0x5647cfdda210 in void absl::lts_20230802::container_internal::InitializeSlots<std::allocator<char>, 8ul, 8ul>(absl::lts_20230802::container_internal::CommonFields&, std::allocator<char>) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:1403:24
    #2 0x7f066c2cfdde in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::resize(unsigned long) (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x9dde) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c)
    #3 0x7f066c2cfd97 in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::prepare_insert(unsigned long) (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x9d97) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c)
    #4 0x7f066c2cfcca in std::pair<unsigned long, bool> absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::find_or_prepare_insert<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>(std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const&) (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x9cca) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c)
    #5 0x7f066c2cf9c4 in std::pair<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::iterator, bool> absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::EmplaceDecomposable::operator()<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, std::piecewise_construct_t const&, std::tuple<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>&>, std::tuple<stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>&&>>(std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const&, std::piecewise_construct_t const&, std::tuple<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>&>&&, std::tuple<stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>&&>&&) const (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x99c4) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c)
    #6 0x7f066c2cf0ad in stream_executor::GetComparisonKernel(stream_executor::StreamExecutor*, stream_executor::GpuAsmOpts) (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x90ad) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c)
    #7 0x7f066c37ba93 in stream_executor::RedzoneAllocator::CheckRedzones() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/stream_executor/gpu/redzone_allocator.cc:272:3
    #8 0x7f06b31bb7e9 in absl::lts_20230802::StatusOr<xla::AutotuneResult> xla::gpu::(anonymous namespace)::GemmAutotuner::GetBestAlgorithm<long, xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&)::'lambda'(long const&)&>(xla::HloInstruction const*, absl::lts_20230802::Span<long const>, double, bool, xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&)::'lambda'(long const&)&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:328:7
    #9 0x7f06b31bb7e9 in xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:256:12
    #10 0x7f06b31bb7e9 in xla::gpu::(anonymous namespace)::GemmAutotuner::operator()(xla::HloInstruction const*, xla::gpu::AutotuneCacheKey const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:137:18
    #11 0x7f06b31b6760 in xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0::operator()() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:418:3
    #12 0x7f06b31b6760 in absl::lts_20230802::StatusOr<xla::AutotuneResult> std::__invoke_impl<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>(std::__invoke_other, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14
    #13 0x7f06b31b6760 in std::enable_if<is_invocable_r_v<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>, absl::lts_20230802::StatusOr<xla::AutotuneResult>>::type std::__invoke_r<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>(xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:114:9
    #14 0x7f06b31b6760 in std::_Function_handler<absl::lts_20230802::StatusOr<xla::AutotuneResult> (), xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0>::_M_invoke(std::_Any_data const&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:290:9
    #15 0x7f06b308670d in std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()>::operator()() const /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:590:9
    #16 0x7f06b308670d in xla::gpu::AutotunerUtil::Autotune(xla::HloInstruction const*, xla::gpu::AutotuneConfig const&, std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/autotuner_util.cc:460:3
    #17 0x7f06b31b336e in xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:418:3
    #18 0x7f06b31b336e in xla::gpu::(anonymous namespace)::RunOnComputation(xla::HloComputation*, xla::gpu::(anonymous namespace)::GemmAutotuner&, unsigned long*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:468:7
    #19 0x7f06b31b336e in xla::gpu::GemmAlgorithmPicker::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:495:5
    #20 0x7f06b30242f3 in xla::HloPassPipeline::RunHelper(xla::HloPassInterface*, xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/hlo/pass/hlo_pass_pipeline.h:150:5
    #21 0x7f06b3010bb9 in absl::lts_20230802::StatusOr<bool> xla::HloPassPipeline::RunPassesInternal<xla::HloModule>(xla::HloModule*, xla::DebugOptions const&, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/hlo/pass/hlo_pass_pipeline.cc:198:30
    #22 0x7f06b300f786 in xla::HloPassPipeline::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/hlo/pass/hlo_pass_pipeline.cc:338:10
    #23 0x5647cfd66945 in xla::HloPassInterface::Run(xla::HloModule*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/hlo/pass/hlo_pass_interface.h:85:12
    #24 0x7f06c2908be0 in xla::gpu::GpuCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1754:3
    #25 0x7f06c2a000f3 in xla::gpu::AMDGPUCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/amdgpu_compiler.cc:197:3
    #26 0x7f06c28f85e9 in xla::gpu::GpuCompiler::OptimizeHloModule(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1392:3
    #27 0x7f06c291250d in xla::gpu::GpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1824:3
    #28 0x5647cfd63784 in xla::Compiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, stream_executor::DeviceMemoryAllocator*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/service/compiler.h:177:12
    #29 0x7f06c339acba in xla::HloTestBase::GetOptimizedModule(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/tests/hlo_test_base.cc:188:32
    #30 0x5647cfd89516 in xla::gpu::(anonymous namespace)::GpuCompilerTest_CollectivePermuteDecompositionAndPipelining_Test::TestBody() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler_test.cc:879:3
    #31 0x7f06c2c649dd in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10
    #32 0x7f06c2c649dd in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14
    #33 0x7f06c2c64708 in testing::Test::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2739:5
    #34 0x7f06c2c6771b in testing::TestInfo::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2885:11
    #35 0x7f06c2c6a5ab in testing::TestSuite::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:3063:30
    #36 0x7f06c2c96eba in testing::internal::UnitTestImpl::RunAllTests() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:6054:44
    #37 0x7f06c2c9579d in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10
    #38 0x7f06c2c9579d in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14
    #39 0x7f06c2c95203 in testing::UnitTest::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:5594:10
    #40 0x7f06c2d679b8 in RUN_ALL_TESTS() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/include/gtest/gtest.h:2334:73
    #41 0x7f06c2d679b8 in main /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/tests/xla_internal_test_main.cc:65:10
    #42 0x7f064c0b3d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #43 0x7f064c0b3e3f in __libc_start_main csu/../csu/libc-start.c:392:3
    #44 0x5647cfc7b044 in _start (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/gpu_compiler_test_gpu_amd_any+0xff044) (BuildId: ef1ac485eb61840d0e2233a2cca69eec)

0x5030001d97f8 is located 8 bytes before 32-byte region [0x5030001d9800,0x5030001d9820)
allocated by thread T0 here:
    #0 0x5647cfd1527f in malloc (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/gpu_compiler_test_gpu_amd_any+0x19927f) (BuildId: ef1ac485eb61840d0e2233a2cca69eec)
    #1 0x7f064c39798b in operator new(unsigned long) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xae98b) (BuildId: e37fe1a879783838de78cbc8c80621fa685d58a2)
    #2 0x7f06b31bb5b7 in google::protobuf::Duration* google::protobuf::MessageLite::CreateMaybeMessage<google::protobuf::Duration>(google::protobuf::Arena*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_protobuf/src/google/protobuf/message_lite.h:425:12
    #3 0x7f06b31bb5b7 in xla::AutotuneResult::_internal_mutable_run_time() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/autotuning.pb.h:3079:15
    #4 0x7f06b31bb5b7 in xla::AutotuneResult::mutable_run_time() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/autotuning.pb.h:3085:45
    #5 0x7f06b31bb5b7 in absl::lts_20230802::StatusOr<xla::AutotuneResult> xla::gpu::(anonymous namespace)::GemmAutotuner::GetBestAlgorithm<long, xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&)::'lambda'(long const&)&>(xla::HloInstruction const*, absl::lts_20230802::Span<long const>, double, bool, xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&)::'lambda'(long const&)&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:321:15
    #6 0x7f06b31bb5b7 in xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:256:12
    #7 0x7f06b31bb5b7 in xla::gpu::(anonymous namespace)::GemmAutotuner::operator()(xla::HloInstruction const*, xla::gpu::AutotuneCacheKey const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:137:18
    #8 0x7f06b31b6760 in xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0::operator()() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:418:3
    #9 0x7f06b31b6760 in absl::lts_20230802::StatusOr<xla::AutotuneResult> std::__invoke_impl<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>(std::__invoke_other, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14
    #10 0x7f06b31b6760 in std::enable_if<is_invocable_r_v<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>, absl::lts_20230802::StatusOr<xla::AutotuneResult>>::type std::__invoke_r<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>(xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:114:9
    #11 0x7f06b31b6760 in std::_Function_handler<absl::lts_20230802::StatusOr<xla::AutotuneResult> (), xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0>::_M_invoke(std::_Any_data const&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:290:9
    #12 0x7f06b308670d in std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()>::operator()() const /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:590:9
    #13 0x7f06b308670d in xla::gpu::AutotunerUtil::Autotune(xla::HloInstruction const*, xla::gpu::AutotuneConfig const&, std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/autotuner_util.cc:460:3
    #14 0x7f06b31b336e in xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:418:3
    #15 0x7f06b31b336e in xla::gpu::(anonymous namespace)::RunOnComputation(xla::HloComputation*, xla::gpu::(anonymous namespace)::GemmAutotuner&, unsigned long*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:468:7
    #16 0x7f06b31b336e in xla::gpu::GemmAlgorithmPicker::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:495:5
    #17 0x7f06b30242f3 in xla::HloPassPipeline::RunHelper(xla::HloPassInterface*, xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/hlo/pass/hlo_pass_pipeline.h:150:5
    #18 0x7f06b3010bb9 in absl::lts_20230802::StatusOr<bool> xla::HloPassPipeline::RunPassesInternal<xla::HloModule>(xla::HloModule*, xla::DebugOptions const&, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/hlo/pass/hlo_pass_pipeline.cc:198:30
    #19 0x7f06b300f786 in xla::HloPassPipeline::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/hlo/pass/hlo_pass_pipeline.cc:338:10
    #20 0x5647cfd66945 in xla::HloPassInterface::Run(xla::HloModule*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/hlo/pass/hlo_pass_interface.h:85:12
    #21 0x7f06c2908be0 in xla::gpu::GpuCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1754:3
    #22 0x7f06c2a000f3 in xla::gpu::AMDGPUCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/amdgpu_compiler.cc:197:3
    #23 0x7f06c28f85e9 in xla::gpu::GpuCompiler::OptimizeHloModule(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1392:3
    #24 0x7f06c291250d in xla::gpu::GpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1824:3
    #25 0x5647cfd63784 in xla::Compiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, stream_executor::DeviceMemoryAllocator*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/service/compiler.h:177:12
    #26 0x7f06c339acba in xla::HloTestBase::GetOptimizedModule(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/tests/hlo_test_base.cc:188:32
    #27 0x5647cfd89516 in xla::gpu::(anonymous namespace)::GpuCompilerTest_CollectivePermuteDecompositionAndPipelining_Test::TestBody() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler_test.cc:879:3
    #28 0x7f06c2c649dd in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10
    #29 0x7f06c2c649dd in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14
    #30 0x7f06c2c64708 in testing::Test::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2739:5
    #31 0x7f06c2c6771b in testing::TestInfo::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2885:11
    #32 0x7f06c2c6a5ab in testing::TestSuite::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:3063:30
    #33 0x7f06c2c96eba in testing::internal::UnitTestImpl::RunAllTests() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:6054:44
    #34 0x7f06c2c9579d in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10
    #35 0x7f06c2c9579d in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14
    #36 0x7f06c2c95203 in testing::UnitTest::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:5594:10
    #37 0x7f06c2d679b8 in RUN_ALL_TESTS() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/include/gtest/gtest.h:2334:73
    #38 0x7f06c2d679b8 in main /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/tests/xla_internal_test_main.cc:65:10
    #39 0x7f064c0b3d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16

SUMMARY: AddressSanitizer: heap-buffer-overflow /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:990:36 in absl::lts_20230802::container_internal::CommonFields::capacity() const
Shadow bytes around the buggy address:
  0x5030001d9500: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa
  0x5030001d9580: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd
  0x5030001d9600: fd fa fa fa fd fd fd fa fa fa fd fd fd fa fa fa
  0x5030001d9680: fd fd fd fd fa fa fd fd fd fa fa fa fd fd fd fa
  0x5030001d9700: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd
=>0x5030001d9780: fd fa fa fa 00 00 00 fa fa fa 00 00 00 00 fa[fa]
  0x5030001d9800: 00 00 00 00 fa fa 00 00 00 00 fa fa fd fd fd fd
  0x5030001d9880: fa fa fd fd fd fd fa fa fd fd fd fa fa fa fd fd
  0x5030001d9900: fd fd fa fa fd fd fd fd fa fa fd fd fd fd fa fa
  0x5030001d9980: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa
  0x5030001d9a00: fa fa fd fd fd fa fa fa fd fd fd fd fa fa fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==1718600==ABORTING
```
Copybara import of the project:

--
9a75d26eb9aab4226a690658d254a057fc59f22c by alekstheod <atheodor@amd.com>:

Fix access memory asan issue in redzone_allocator_kernel_rocm.cu

Merging this change closes tensorflow#24898

PiperOrigin-RevId: 745563669
pemeliya pushed a commit that referenced this pull request Apr 22, 2025
Imported from GitHub PR openxla/xla#24900

Fix asan memory access violation:

```
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //xla/service:elemental_ir_emitter_test_gpu_amd_any
-----------------------------------------------------------------------------
Running test /home/atheodor/projects/tmp/xla_asan/execroot/xla/bazel-out/k8-opt/bin/xla/service/elemental_ir_emitter_test_gpu_amd_any.runfiles/xla/xla/service/elemental_ir_emitter_test_gpu_amd_any --gtest_shuffle --gtest_fail_if_no_test_linked on GPU 3
Note: Randomizing tests' orders with a seed of 19906 .
[==========] Running 118 tests from 13 test suites.
[----------] Global test environment set-up.
[----------] 10 tests from ElementalIrEmitterExecutionTypedTest/7, where TypeParam = ml_dtypes::float8_internal::float8_e5m2
[ RUN      ] ElementalIrEmitterExecutionTypedTest/7.ConvertFloatsToFloat
=================================================================
==2457579==ERROR: AddressSanitizer: use-after-poison on address 0x506000843a08 at pc 0x7f401151be6a bp 0x7ffd1e3c3410 sp 0x7ffd1e3c3408
READ of size 8 at 0x506000843a08 thread T0
    #0 0x7f401151be69 in stream_executor::gpu::RocmExecutor::UnloadGpuBinary(stream_executor::ModuleHandle) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/stream_executor/rocm/rocm_executor.cc:596:23
    #1 0x7f401151b036 in stream_executor::gpu::RocmExecutor::UnloadModule(stream_executor::ModuleHandle) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/stream_executor/rocm/rocm_executor.cc:496:10
    #2 0x7f405dee713b in stream_executor::ScopedModuleHandle::~ScopedModuleHandle() /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/stream_executor/scoped_module_handle.h:48:7
    #3 0x7f405dee713b in std::pair<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>::~pair() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_iterator.h:2488:12
    #4 0x7f405dee713b in void __gnu_cxx::new_allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>::destroy<std::pair<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>>(std::pair<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>*) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/new_allocator.h:168:10
    #5 0x7f405dee713b in void std::allocator_traits<std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>::destroy<std::pair<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>>(std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>&, std::pair<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>*) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/alloc_traits.h:535:8
    #6 0x7f405dee713b in void absl::lts_20230802::container_internal::map_slot_policy<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>::destroy<std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>(std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>*, absl::lts_20230802::container_internal::map_slot_type<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/container_memory.h:419:7
    #7 0x7f405dee713b in void absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>::destroy<std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>(std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>*, absl::lts_20230802::container_internal::map_slot_type<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/flat_hash_map.h:578:5
    #8 0x7f405dee713b in void absl::lts_20230802::container_internal::common_policy_traits<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>, void>::destroy<std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>(std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>*, absl::lts_20230802::container_internal::map_slot_type<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/common_policy_traits.h:50:5
    #9 0x7f405dee713b in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>, absl::lts_20230802::container_internal::HashEq<stream_executor::StreamExecutor*, void>::Hash, absl::lts_20230802::container_internal::HashEq<stream_executor::StreamExecutor*, void>::Eq, std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>::destroy_slots() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:1946:9
    #10 0x7f405dee713b in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>, absl::lts_20230802::container_internal::HashEq<stream_executor::StreamExecutor*, void>::Hash, absl::lts_20230802::container_internal::HashEq<stream_executor::StreamExecutor*, void>::Eq, std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>::~raw_hash_set() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:1885:5
    #11 0x7f405dee8580 in xla::gpu::GpuExecutable::~GpuExecutable() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/gpu/gpu_executable.cc:155:1
    #12 0x7f405dee8d4d in xla::gpu::GpuExecutable::~GpuExecutable() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/gpu/gpu_executable.cc:151:33
    #13 0x7f407b818b3f in std::default_delete<xla::Executable>::operator()(xla::Executable*) const /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85:2
    #14 0x7f407b818b3f in std::unique_ptr<xla::Executable, std::default_delete<xla::Executable>>::~unique_ptr() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:361:4
    #15 0x7f407b818b3f in xla::(anonymous namespace)::HloRunnerExecutable::~HloRunnerExecutable() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:69:7
    #16 0x7f407b818b3f in xla::(anonymous namespace)::HloRunnerExecutable::~HloRunnerExecutable() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:69:7
    #17 0x7f407b7e6503 in std::default_delete<xla::OpaqueExecutable>::operator()(xla::OpaqueExecutable*) const /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85:2
    #18 0x7f407b7e6503 in std::unique_ptr<xla::OpaqueExecutable, std::default_delete<xla::OpaqueExecutable>>::~unique_ptr() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:361:4
    #19 0x7f407b7e6503 in xla::HloRunner::ExecuteWithMovedDeviceBuffersAndBufferAssignment(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, xla::BufferAssignmentProto const*, std::vector<xla::ScopedShapedBuffer, std::allocator<xla::ScopedShapedBuffer>>, bool, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:400:1
    #20 0x7f407b7e57c3 in xla::HloRunner::Execute(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal const* const>, bool, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:221:3
    #21 0x55b8a3cb4622 in xla::HloRunnerInterface::Execute(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal const* const>, bool) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/service/hlo_runner_interface.h:244:12
    #22 0x55b8a3cb4622 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompareInternal(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal* const>, std::optional<xla::ErrorSpec> const&, bool, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:238:5
    #23 0x55b8a3cbf766 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompare(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal* const>, std::optional<xla::ErrorSpec> const&, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:94:9
    #24 0x55b8a3cbf235 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompare(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, std::optional<xla::ErrorSpec> const&, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&, std::optional<long>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:140:12
    #25 0x55b8a3cceda8 in xla::(anonymous namespace)::ElementalIrEmitterExecutionTest::RunTypeConversionTest(std::basic_string_view<char, std::char_traits<char>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/elemental_ir_emitter_test.cc:76:5
    #26 0x55b8a3cd8cf3 in xla::(anonymous namespace)::ElementalIrEmitterExecutionTypedTest_ConvertFloatsToFloat_Test<ml_dtypes::float8_internal::float8_e5m2>::TestBody() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/elemental_ir_emitter_test.cc:472:36
    #27 0x7f407b2f09dd in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10
    #28 0x7f407b2f09dd in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14
    #29 0x7f407b2f0708 in testing::Test::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2739:5
    #30 0x7f407b2f371b in testing::TestInfo::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2885:11
    #31 0x7f407b2f65ab in testing::TestSuite::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:3063:30
    #32 0x7f407b322eba in testing::internal::UnitTestImpl::RunAllTests() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:6054:44
    #33 0x7f407b32179d in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10
    #34 0x7f407b32179d in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14
    #35 0x7f407b321203 in testing::UnitTest::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:5594:10
    #36 0x7f407b3f59b8 in RUN_ALL_TESTS() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/include/gtest/gtest.h:2334:73
    #37 0x7f407b3f59b8 in main /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/tests/xla_internal_test_main.cc:65:10
    #38 0x7f4004766d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #39 0x7f4004766e3f in __libc_start_main csu/../csu/libc-start.c:392:3
    #40 0x55b8a3b9be44 in _start (/home/atheodor/projects/tmp/xla_asan/execroot/xla/bazel-out/k8-opt/bin/xla/service/elemental_ir_emitter_test_gpu_amd_any+0x10ce44) (BuildId: 1c37d17e488373aad7bf33204cb4234e)

0x506000843a08 is located 40 bytes inside of 56-byte region [0x5060008439e0,0x506000843a18)
allocated by thread T0 here:
    #0 0x55b8a3c3607f in malloc (/home/atheodor/projects/tmp/xla_asan/execroot/xla/bazel-out/k8-opt/bin/xla/service/elemental_ir_emitter_test_gpu_amd_any+0x1a707f) (BuildId: 1c37d17e488373aad7bf33204cb4234e)
    #1 0x7f4004a4a98b in operator new(unsigned long) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xae98b) (BuildId: e37fe1a879783838de78cbc8c80621fa685d58a2)
    #2 0x7f40115449aa in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::initialize_slots() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:2505:5
    #3 0x7f40115449aa in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::resize(unsigned long) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:2515:5
    #4 0x7f40115443fa in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::prepare_insert(unsigned long) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:2672:7
    #5 0x7f40115442df in std::pair<unsigned long, bool> absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::find_or_prepare_insert<stream_executor::ModuleHandle>(stream_executor::ModuleHandle const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:2659:13
    #6 0x7f4011524701 in std::pair<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::iterator, bool> absl::lts_20230802::container_internal::raw_hash_map<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::try_emplace_impl<stream_executor::ModuleHandle const&>(stream_executor::ModuleHandle const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_map.h:202:22
    #7 0x7f4011524701 in std::pair<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::iterator, bool> absl::lts_20230802::container_internal::raw_hash_map<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::try_emplace<stream_executor::ModuleHandle, 0>(stream_executor::ModuleHandle const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_map.h:139:12
    #8 0x7f4011524701 in decltype(absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>::value(std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>* std::addressof<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>(std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>&)(decltype(__declval<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>(0)) std::declval<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>&>()()))) absl::lts_20230802::container_internal::raw_hash_map<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::operator[]<stream_executor::ModuleHandle, absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>>(stream_executor::ModuleHandle const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_map.h:184:28
    #9 0x7f4011524701 in stream_executor::gpu::RocmExecutor::LoadModuleFromHsaco(char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/stream_executor/rocm/rocm_executor.cc:717:39
    #10 0x7f4011524387 in stream_executor::gpu::RocmExecutor::LoadModule(stream_executor::MultiModuleLoaderSpec const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/stream_executor/rocm/rocm_executor.cc:705:12
    #11 0x7f405deeae34 in xla::gpu::GpuExecutable::ResolveConstantGlobals(stream_executor::Stream*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/gpu/gpu_executable.cc:499:5
    #12 0x7f405def050a in xla::gpu::GpuExecutable::ExecuteAsyncOnStreamImpl(xla::ServiceExecutableRunOptions const*, std::variant<absl::lts_20230802::Span<xla::ShapedBuffer const* const>, absl::lts_20230802::Span<xla::ExecutionInput>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/gpu/gpu_executable.cc:703:5
    #13 0x7f405deefc6f in xla::gpu::GpuExecutable::ExecuteAsyncOnStream(xla::ServiceExecutableRunOptions const*, std::vector<xla::ExecutionInput, std::allocator<xla::ExecutionInput>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/gpu/gpu_executable.cc:661:10
    #14 0x7f401607a78e in xla::Executable::ExecuteAsyncOnStreamWrapper(xla::ServiceExecutableRunOptions const*, std::vector<xla::ExecutionInput, std::allocator<xla::ExecutionInput>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/executable.cc:229:7
    #15 0x7f4016079fd3 in xla::Executable::ExecuteOnStreamWrapper(xla::ServiceExecutableRunOptions const*, std::vector<xla::ExecutionInput, std::allocator<xla::ExecutionInput>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/executable.cc:153:7
    #16 0x7f407b7ea78b in xla::HloRunner::ExecuteWithExecutionInputs(xla::Executable*, std::vector<xla::ExecutionInput, std::allocator<xla::ExecutionInput>>, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:448:3
    #17 0x7f407b7ecde2 in xla::HloRunner::ExecuteWithMovedDeviceBuffers(xla::Executable*, std::vector<xla::ScopedShapedBuffer, std::allocator<xla::ScopedShapedBuffer>>, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:415:3
    #18 0x7f407b7e642a in xla::HloRunner::ExecuteWithMovedDeviceBuffersAndBufferAssignment(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, xla::BufferAssignmentProto const*, std::vector<xla::ScopedShapedBuffer, std::allocator<xla::ScopedShapedBuffer>>, bool, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:398:10
    #19 0x7f407b7e57c3 in xla::HloRunner::Execute(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal const* const>, bool, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:221:3
    #20 0x55b8a3cb4622 in xla::HloRunnerInterface::Execute(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal const* const>, bool) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/service/hlo_runner_interface.h:244:12
    #21 0x55b8a3cb4622 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompareInternal(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal* const>, std::optional<xla::ErrorSpec> const&, bool, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:238:5
    #22 0x55b8a3cbf766 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompare(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal* const>, std::optional<xla::ErrorSpec> const&, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:94:9
    #23 0x55b8a3cbf235 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompare(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, std::optional<xla::ErrorSpec> const&, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&, std::optional<long>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:140:12
    #24 0x55b8a3cceda8 in xla::(anonymous namespace)::ElementalIrEmitterExecutionTest::RunTypeConversionTest(std::basic_string_view<char, std::char_traits<char>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/elemental_ir_emitter_test.cc:76:5
    #25 0x55b8a3cd8cf3 in xla::(anonymous namespace)::ElementalIrEmitterExecutionTypedTest_ConvertFloatsToFloat_Test<ml_dtypes::float8_internal::float8_e5m2>::TestBody() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/elemental_ir_emitter_test.cc:472:36
    #26 0x7f407b2f09dd in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10
    #27 0x7f407b2f09dd in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14
    #28 0x7f407b2f0708 in testing::Test::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2739:5
    #29 0x7f407b2f371b in testing::TestInfo::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2885:11
    #30 0x7f407b2f65ab in testing::TestSuite::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:3063:30
    #31 0x7f407b322eba in testing::internal::UnitTestImpl::RunAllTests() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:6054:44
    #32 0x7f407b32179d in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10
    #33 0x7f407b32179d in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14
    #34 0x7f407b321203 in testing::UnitTest::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:5594:10
    #35 0x7f407b3f59b8 in RUN_ALL_TESTS() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/include/gtest/gtest.h:2334:73
    #36 0x7f407b3f59b8 in main /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/tests/xla_internal_test_main.cc:65:10
    #37 0x7f4004766d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16

SUMMARY: AddressSanitizer: use-after-poison /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/stream_executor/rocm/rocm_executor.cc:596:23 in stream_executor::gpu::RocmExecutor::UnloadGpuBinary(stream_executor::ModuleHandle)
Shadow bytes around the buggy address:
  0x506000843780: fa fa fa fa fd fd fd fd fd fd fd fa fa fa fa fa
  0x506000843800: fd fd fd fd fd fd fd fa fa fa fa fa fd fd fd fd
  0x506000843880: fd fd fd fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x506000843900: fa fa fa fa 00 00 00 00 00 00 00 fa fa fa fa fa
  0x506000843980: fd fd fd fd fd fd fd fa fa fa fa fa 00 00 00 00
=>0x506000843a00: f7[f7]f7 fa fa fa fa fa 00 00 00 00 00 00 00 00
  0x506000843a80: fa fa fa fa 00 00 00 00 00 00 00 fa fa fa fa fa
  0x506000843b00: 00 00 00 00 00 00 00 fa fa fa fa fa 00 00 00 00
  0x506000843b80: 00 00 00 fa fa fa fa fa 00 00 00 00 00 00 00 fa
  0x506000843c00: fa fa fa fa 00 00 00 00 00 00 00 fa fa fa fa fa
  0x506000843c80: 00 00 00 00 00 00 00 fa fa fa fa fa fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==2457579==ABORTING

```
Copybara import of the project:

--
8f74d4c822d951b5a213500ea9396ed7b160871d by alekstheod <atheodor@amd.com>:

Fix asan report memory access vialation in rocm_executor

Merging this change closes tensorflow#24900

PiperOrigin-RevId: 745592235
pemeliya pushed a commit that referenced this pull request Apr 22, 2025
…r file

Imported from GitHub PR openxla/xla#25269

Reported issue:
```
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //xla/service:compiler_test_gpu_amd_any
-----------------------------------------------------------------------------
Running test /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-dbg/bin/xla/service/compiler_test_gpu_amd_any.runfiles/xla/xla/service/compiler_test_gpu_amd_any --gtest_shuffle --gtest_fail_if_no_test_linked on GPU 0
=================================================================
==168009==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x50400002c1c0 at pc 0x7f59e50b52e7 bp 0x7ffc8c2358d0 sp 0x7ffc8c2358c8
READ of size 8 at 0x50400002c1c0 thread T0
    #0 0x7f59e50b52e6 in absl::lts_20230802::container_internal::CommonFields::capacity() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:990:36
    #1 0x7f59e50b52e6 in absl::lts_20230802::container_internal::probe(absl::lts_20230802::container_internal::CommonFields const&, unsigned long) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:1298:41
    #2 0x7f59e50b52e6 in std::pair<unsigned long, bool> absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::find_or_prepare_insert<std::tuple<std::type_index, void*>>(std::tuple<std::type_index, void*> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:2645:16
    #3 0x7f59e50af8a8 in std::pair<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::iterator, bool> absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::EmplaceDecomposable::operator()<std::tuple<std::type_index, void*>, std::piecewise_construct_t const&, std::tuple<std::tuple<std::type_index, void*>&&>, std::tuple<stream_executor::MultiKernelLoaderSpec&&>>(std::tuple<std::type_index, void*> const&, std::piecewise_construct_t const&, std::tuple<std::tuple<std::type_index, void*>&&>&&, std::tuple<stream_executor::MultiKernelLoaderSpec&&>&&) const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:2459:20
    #4 0x7f59e50af8a8 in decltype(std::declval<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::EmplaceDecomposable>()(std::declval<std::tuple<std::type_index, void*>&& const&>(), std::piecewise_construct, std::declval<std::tuple<std::tuple<std::type_index, void*>&&>>(), std::declval<std::tuple<stream_executor::MultiKernelLoaderSpec&&>>())) absl::lts_20230802::container_internal::memory_internal::DecomposePairImpl<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::EmplaceDecomposable, std::tuple<std::type_index, void*>&&, std::tuple<stream_executor::MultiKernelLoaderSpec&&>>(absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::EmplaceDecomposable&&, std::pair<std::tuple<std::tuple<std::type_index, void*>&&>, std::tuple<stream_executor::MultiKernelLoaderSpec&&>>) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/container_memory.h:140:10
    #5 0x7f59e50af8a8 in decltype(memory_internal::DecomposePairImpl(std::forward<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::EmplaceDecomposable>(fp), PairArgs(std::forward<std::pair<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>>(fp0)))) absl::lts_20230802::container_internal::DecomposePair<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::EmplaceDecomposable, std::pair<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>>(absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::EmplaceDecomposable&&, std::pair<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>&&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/container_memory.h:207:10
    #6 0x7f59e50af8a8 in decltype(absl::container_internal::DecomposePair(std::declval<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::EmplaceDecomposable>(), std::declval<std::pair<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>>())) absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>::apply<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::EmplaceDecomposable, std::pair<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>>(absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::EmplaceDecomposable&&, std::pair<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>&&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/flat_hash_map.h:591:12
    #7 0x7f59e50af8a8 in decltype(absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>::apply(std::forward<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::EmplaceDecomposable>(fp), std::forward<std::pair<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>>(fp0))) absl::lts_20230802::container_internal::hash_policy_traits<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, void>::apply<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::EmplaceDecomposable, std::pair<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>>(absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::EmplaceDecomposable&&, std::pair<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>&&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/hash_policy_traits.h:134:12
    #8 0x7f59e50af8a8 in std::pair<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::iterator, bool> absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::emplace<std::pair<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, 0>(std::pair<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>&&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:2064:12
    #9 0x7f59e50af8a8 in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>, absl::lts_20230802::hash_internal::Hash<std::tuple<std::type_index, void*>>, std::equal_to<std::tuple<std::type_index, void*>>, std::allocator<std::pair<std::tuple<std::type_index, void*> const, stream_executor::MultiKernelLoaderSpec>>>::insert(std::pair<std::tuple<std::type_index, void*>, stream_executor::MultiKernelLoaderSpec>&&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:1991:12
    #10 0x7f59e50af8a8 in stream_executor::gpu::GpuKernelRegistry::RegisterKernel(std::type_info const&, void*, stream_executor::MultiKernelLoaderSpec const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/stream_executor/gpu/gpu_kernel_registry.cc:67:45
    #11 0x7f59e50d1982 in absl::lts_20230802::Status stream_executor::gpu::GpuKernelRegistry::RegisterKernel<stream_executor::gpu::MakeBatchPointersKernel>(void*, stream_executor::MultiKernelLoaderSpec const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/stream_executor/gpu/gpu_kernel_registry.h:86:12
    #12 0x7f59e50d1982 in RegisterKernelMakeBatchPointersKernelRocmImpl() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/stream_executor/rocm/make_batch_pointers_kernel_rocm.cu.cc:35:1
    #13 0x7f59e50d1982 in 'lambda'()::operator()() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/stream_executor/rocm/make_batch_pointers_kernel_rocm.cu.cc:35:1
    #14 0x7f59e50d1982 in 'lambda'()::__invoke() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/stream_executor/rocm/make_batch_pointers_kernel_rocm.cu.cc:35:1
    #15 0x7f59e50d1982 in stream_executor::port::Initializer::Initializer(void (*)()) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/stream_executor/platform/default/initialize.h:26:42
    #16 0x7f59e50d1982 in __cxx_global_var_init.1 /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/stream_executor/rocm/make_batch_pointers_kernel_rocm.cu.cc:35:1
    #17 0x7f59e50d1982 in _GLOBAL__sub_I_make_batch_pointers_kernel_rocm.cu.cc /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/stream_executor/rocm/make_batch_pointers_kernel_rocm.cu.cc
    #18 0x7f5a5b27a47d in call_init elf/dl-init.c:70:3
    #19 0x7f5a5b27a567 in call_init elf/dl-init.c:33:6
    #20 0x7f5a5b27a567 in _dl_init elf/dl-init.c:117:5
    #21 0x7f5a5b2942c9  (/lib64/ld-linux-x86-64.so.2+0x202c9) (BuildId: e4de036b19e4768e7591b596c4be9f9015f2d28a)

0x50400002c1c0 is located 8 bytes after 40-byte region [0x50400002c190,0x50400002c1b8)
allocated by thread T0 here:
    #0 0x557d0f77fcdf in malloc (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-dbg/bin/xla/service/compiler_test_gpu_amd_any+0x1e8cdf) (BuildId: e96972f8c7f880083ff6ad5985d3c06d)
    #1 0x7f59d733098b in operator new(unsigned long) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xae98b) (BuildId: e37fe1a879783838de78cbc8c80621fa685d58a2)

SUMMARY: AddressSanitizer: heap-buffer-overflow /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:990:36 in absl::lts_20230802::container_internal::CommonFields::capacity() const
Shadow bytes around the buggy address:
  0x50400002bf00: fa fa fd fd fd fd fd fd fa fa fd fd fd fd fd fa
  0x50400002bf80: fa fa fd fd fd fd fd fa fa fa fd fd fd fd fd fa
  0x50400002c000: fa fa fd fd fd fd fd fa fa fa 00 00 00 00 00 fa
  0x50400002c080: fa fa fd fd fd fd fd fd fa fa 00 00 00 00 00 00
  0x50400002c100: fa fa 00 00 00 00 00 00 fa fa 00 00 00 00 00 00
=>0x50400002c180: fa fa 00 00 00 00 00 fa[fa]fa fa fa fa fa fa fa
  0x50400002c200: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x50400002c280: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x50400002c300: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x50400002c380: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x50400002c400: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==168009==ABORTING
```

Why this fixes the issue:
* Consider compiling this class into a different .so files where this function will get inlined and we will get different instances while we still want to have singleton.
* In rocm compiler wrapper script we do not yet support sanitizer flags so our cu.cc files are not getting instrumented while our normal cc files do! This might cause a memory disalignment while running with asan (theory).

Copybara import of the project:

--
ffcd58918137191cdba6db571e0e5af0e57de2e1 by alekstheod <atheodor@amd.com>:

Fix asan issue do to a singleton in header file

Merging this change closes tensorflow#25269

PiperOrigin-RevId: 747823659
mmakevic-amd pushed a commit that referenced this pull request Aug 25, 2025
Fixes jax-ml/jax#30517

Example traceback of crash:
```
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
  * frame #0: 0x0000000150679fe4 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 164
    frame #1: 0x000000015067dc20 libjax_common.dylib`std::__1::__shared_ptr_emplace<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>, std::__1::allocator<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>>>::__on_zero_shared() + 32
    frame #2: 0x000000015067a0e4 libjax_common.dylib`std::__1::unique_ptr<jax::(anonymous namespace)::PjitFunctionCache::Value, std::__1::default_delete<jax::(anonymous namespace)::PjitFunctionCache::Value>>::reset[abi:ne180100](jax::(anonymous namespace)::PjitFunctionCache::Value*) + 104
    frame #3: 0x000000015067e230 libjax_common.dylib`_object* nanobind::detail::func_create<true, true, jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1, void, nanobind::handle, 0ul>(jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1&&, void (*)(nanobind::handle), std::__1::integer_sequence<unsigned long, 0ul>)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 336
    frame #4: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156
    frame #5: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116
    frame #6: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340
    frame #7: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352
    frame #8: 0x00000001506714c8 libjax_common.dylib`PjitFunction_tp_dealloc + 504
    frame #9: 0x0000000106420dd8 libpython3.12.dylib`PyDict_DelItem + 668
    frame #10: 0x00000001063cbc38 libpython3.12.dylib`_PyEval_EvalFrameDefault + 26328
    frame #11: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116
    frame #12: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340
    frame #13: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352
    frame #14: 0x0000000150676a4c libjax_common.dylib`jax::ArgumentSignature::~ArgumentSignature() + 172
    frame #15: 0x0000000150679c88 libjax_common.dylib`jax::CallSignature::~CallSignature() + 456
    frame #16: 0x0000000150679fb0 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 112
    frame #17: 0x0000000150672400 libjax_common.dylib`jax::(anonymous namespace)::PjitFunctionCache::Clear() + 44
    frame #18: 0x000000015067a340 libjax_common.dylib`_object* nanobind::detail::func_create<false, true, void nanobind::cpp_function_def<jax::(anonymous namespace)::PjitFunctionCache, void, jax::(anonymous namespace)::PjitFunctionCache, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(void (jax::(anonymous namespace)::PjitFunctionCache::*)(), nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(jax::(anonymous namespace)::PjitFunctionCache*), void, jax::(anonymous namespace)::PjitFunctionCache*, 0ul, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(jax::(anonymous namespace)::PjitFunctionCache&&, void (*)(nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self), std::__1::integer_sequence<unsigned long, 0ul>, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 80
    frame #19: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156
    frame #20: 0x00000001063ed3a8 libpython3.12.dylib`_PyEval_EvalFrameDefault + 163400
    frame #21: 0x00000001064e3450 libpython3.12.dylib`atexit_callfuncs.llvm.13196908868581062239 + 96
    frame #22: 0x00000001064ece28 libpython3.12.dylib`Py_FinalizeEx + 96
    frame #23: 0x000000010650366c libpython3.12.dylib`Py_Exit + 20
    frame #24: 0x000000010650364c libpython3.12.dylib`handle_system_exit + 32
    frame #25: 0x0000000106503330 libpython3.12.dylib`_PyErr_PrintEx.llvm.12194046240795210664 + 52
    frame #26: 0x000000010650de00 libpython3.12.dylib`_PyRun_SimpleFileObject + 464
    frame #27: 0x00000001065051e4 libpython3.12.dylib`_PyRun_AnyFileObject + 80
    frame #28: 0x00000001065045a0 libpython3.12.dylib`pymain_run_file_obj + 164
    frame #29: 0x0000000106503c00 libpython3.12.dylib`pymain_run_file + 72
    frame #30: 0x0000000106501e04 libpython3.12.dylib`Py_RunMain + 1120
    frame #31: 0x0000000106501808 libpython3.12.dylib`pymain_main + 456
    frame #32: 0x0000000106501634 libpython3.12.dylib`Py_BytesMain + 36
    frame #33: 0x00000001951fab98 dyld`start + 6076
```

PiperOrigin-RevId: 788532094
mmakevic-amd pushed a commit that referenced this pull request Sep 17, 2025
Fixes jax-ml/jax#30517

Example traceback of crash:
```
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
  * frame #0: 0x0000000150679fe4 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 164
    frame #1: 0x000000015067dc20 libjax_common.dylib`std::__1::__shared_ptr_emplace<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>, std::__1::allocator<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>>>::__on_zero_shared() + 32
    frame #2: 0x000000015067a0e4 libjax_common.dylib`std::__1::unique_ptr<jax::(anonymous namespace)::PjitFunctionCache::Value, std::__1::default_delete<jax::(anonymous namespace)::PjitFunctionCache::Value>>::reset[abi:ne180100](jax::(anonymous namespace)::PjitFunctionCache::Value*) + 104
    frame #3: 0x000000015067e230 libjax_common.dylib`_object* nanobind::detail::func_create<true, true, jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1, void, nanobind::handle, 0ul>(jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1&&, void (*)(nanobind::handle), std::__1::integer_sequence<unsigned long, 0ul>)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 336
    frame #4: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156
    frame #5: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116
    frame #6: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340
    frame #7: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352
    frame #8: 0x00000001506714c8 libjax_common.dylib`PjitFunction_tp_dealloc + 504
    frame #9: 0x0000000106420dd8 libpython3.12.dylib`PyDict_DelItem + 668
    frame #10: 0x00000001063cbc38 libpython3.12.dylib`_PyEval_EvalFrameDefault + 26328
    frame #11: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116
    frame #12: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340
    frame #13: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352
    frame #14: 0x0000000150676a4c libjax_common.dylib`jax::ArgumentSignature::~ArgumentSignature() + 172
    frame #15: 0x0000000150679c88 libjax_common.dylib`jax::CallSignature::~CallSignature() + 456
    frame #16: 0x0000000150679fb0 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 112
    frame #17: 0x0000000150672400 libjax_common.dylib`jax::(anonymous namespace)::PjitFunctionCache::Clear() + 44
    frame #18: 0x000000015067a340 libjax_common.dylib`_object* nanobind::detail::func_create<false, true, void nanobind::cpp_function_def<jax::(anonymous namespace)::PjitFunctionCache, void, jax::(anonymous namespace)::PjitFunctionCache, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(void (jax::(anonymous namespace)::PjitFunctionCache::*)(), nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(jax::(anonymous namespace)::PjitFunctionCache*), void, jax::(anonymous namespace)::PjitFunctionCache*, 0ul, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(jax::(anonymous namespace)::PjitFunctionCache&&, void (*)(nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self), std::__1::integer_sequence<unsigned long, 0ul>, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 80
    frame #19: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156
    frame #20: 0x00000001063ed3a8 libpython3.12.dylib`_PyEval_EvalFrameDefault + 163400
    frame #21: 0x00000001064e3450 libpython3.12.dylib`atexit_callfuncs.llvm.13196908868581062239 + 96
    frame #22: 0x00000001064ece28 libpython3.12.dylib`Py_FinalizeEx + 96
    frame #23: 0x000000010650366c libpython3.12.dylib`Py_Exit + 20
    frame #24: 0x000000010650364c libpython3.12.dylib`handle_system_exit + 32
    frame #25: 0x0000000106503330 libpython3.12.dylib`_PyErr_PrintEx.llvm.12194046240795210664 + 52
    frame #26: 0x000000010650de00 libpython3.12.dylib`_PyRun_SimpleFileObject + 464
    frame #27: 0x00000001065051e4 libpython3.12.dylib`_PyRun_AnyFileObject + 80
    frame #28: 0x00000001065045a0 libpython3.12.dylib`pymain_run_file_obj + 164
    frame #29: 0x0000000106503c00 libpython3.12.dylib`pymain_run_file + 72
    frame #30: 0x0000000106501e04 libpython3.12.dylib`Py_RunMain + 1120
    frame #31: 0x0000000106501808 libpython3.12.dylib`pymain_main + 456
    frame #32: 0x0000000106501634 libpython3.12.dylib`Py_BytesMain + 36
    frame #33: 0x00000001951fab98 dyld`start + 6076
```

PiperOrigin-RevId: 788532094
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants