forked from tensorflow/tensorflow
-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Porting FP16 related changes from the ROCm TF 1.3 stream #1
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@deven-amd thanks for this PR. This PR only contains patch to Eigen. Could you help review if other parts of TF need to be amended? In particular MIOpen kernel call sites for fp16. |
whchung
approved these changes
May 15, 2018
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Yes expect another PR fo the TF side of fp16 changes sometime later today |
iotamudelta
pushed a commit
to iotamudelta/tensorflow-upstream
that referenced
this pull request
Jun 12, 2018
deven-amd
pushed a commit
that referenced
this pull request
Dec 24, 2018
deven-amd
pushed a commit
that referenced
this pull request
Dec 31, 2018
Updated download script and instructions to contain modified files away from regular sources
parallelo
pushed a commit
that referenced
this pull request
Feb 27, 2019
When num_squeeze_dims is zero the data pointer should be nullptr. This addresses the following error when running NNAPIDelegate.SqueezeSimpleTest: ANeuralNetworksModel_setOperandValue setting operand 1 which has operand type that is not fully specified PiperOrigin-RevId: 235772184
jerryyin
pushed a commit
that referenced
this pull request
Apr 16, 2019
Fixing small typo in comment
deven-amd
pushed a commit
that referenced
this pull request
Jun 28, 2019
deven-amd
pushed a commit
that referenced
this pull request
Aug 29, 2019
Fix accidentally deleted #if line
deven-amd
pushed a commit
that referenced
this pull request
Nov 11, 2019
deven-amd
pushed a commit
that referenced
this pull request
Nov 26, 2019
deven-amd
pushed a commit
that referenced
this pull request
Jan 3, 2020
Added code examples tf.keras.backend.gather
deven-amd
pushed a commit
that referenced
this pull request
Jan 13, 2020
deven-amd
pushed a commit
that referenced
this pull request
Feb 11, 2020
* Add missing stdint.h include in KissFFT * Transform header files in ESP examples sources * Copy sdkconfig.defaults in ESP examples
rsanthanam-amd
pushed a commit
that referenced
this pull request
Feb 8, 2021
- This cl instructs dynamic padder to insert implicit broadcasts into the graph when a binary operation is performed on two dynamic tensors. - Optimization #1: The implicit broadcast is only inserted when we can't proof two dynamic dimensions are the same. - Optimization #2: Added a simplification pass that allows us to simplify operations on dynamic dimensions, this opens up more opportunities for optimization #1 PiperOrigin-RevId: 355539597 Change-Id: I7753550a6057155c3f436c6b51b356cb48c945e6
rsanthanam-amd
pushed a commit
that referenced
this pull request
Feb 15, 2021
- This cl instructs dynamic padder to insert implicit broadcasts into the graph when a binary operation is performed on two dynamic tensors. - Optimization #1: The implicit broadcast is only inserted when we can't proof two dynamic dimensions are the same. - Optimization #2: Added a simplification pass that allows us to simplify operations on dynamic dimensions, this opens up more opportunities for optimization #1 PiperOrigin-RevId: 356407626 Change-Id: I980477ee6f3ccb42342226afaab03b4b09549360
deven-amd
pushed a commit
that referenced
this pull request
Mar 9, 2021
…ic shapes. - This cl instructs dynamic padder to insert implicit broadcasts into the graph when a binary operation is performed on two dynamic tensors. - Optimization #1: The implicit broadcast is only inserted when we can't proof two dynamic dimensions are the same. - Optimization #2: Added a simplification pass that allows us to simplify operations on dynamic dimensions, this opens up more opportunities for optimization #1 PiperOrigin-RevId: 361684367 Change-Id: I2ce2a62932273f81b8d969de6012c40599f7d1d9
ekuznetsov139
pushed a commit
that referenced
this pull request
Apr 13, 2021
Prototype showed significant dispatch performance improvements from the new backend. This is the first of a series of commits to add a new PJRT backend. The intention is to eventually replace the existing StreamExecutor-based CPU backend. PiperOrigin-RevId: 367514967 Change-Id: I16c9523b604445015125ad2e42fd8822ec0c38c5
rsanthanam-amd
pushed a commit
that referenced
this pull request
Jul 21, 2021
… ops with dynamic shapes. This occurred when an outside compiled op with non-dynamic output came before another outside compiled op with a dynamic output. For example: %7 = "tf.E"(%4, %3) {_xla_outside_compilation = "auto"} : (tensor<?xi32>, tensor<2xi32>) -> (tensor<2xi32>) %8 = "tf.F"(%7) : (tensor<2xi32>) -> (tensor<2xi32>) %9 = "tf.G"(%8, %4, %3) {_xla_outside_compilation = "auto"} : (tensor<2xi32>, tensor<?xi32>, tensor<2xi32>) -> (tensor<?xi32>) In this case, the second op should have all of the inputs sent to host through XlaHostCompute op so that shape inference works correctly. With the bug, the statically shaped inputs were provided by communication for the earlier outside compilation op. This lead to overwriting the value in the earlier op leading to a "operand #1 does not dominate this use" error. PiperOrigin-RevId: 385813020 Change-Id: Ibe65bbf0e667d25b179af3353fa981bf91177f84
stevenireeves
pushed a commit
that referenced
this pull request
Aug 12, 2021
It makes analyzer output more useful. example) Your TFLite model has ‘1’ subgraph(s). In the subgraph description below, T# represents the Tensor numbers. For example, in Subgraph#0, the RESHAPE op takes tensor #0 and tensor #1 as input and produces tensor #5 as output. Subgraph#0 main(T#0) -> [T#9] Op#0 RESHAPE(T#0, T#1) -> [T#5] Op#1 STRIDED_SLICE(T#5, T#2, T#2, T#3) -> [T#6] Op#2 RESIZE_BILINEAR(T#6, T#4) -> [T#7] Op#3 RESIZE_BILINEAR(T#6, T#4) -> [T#8] Op#4 ADD(T#7, T#8) -> [T#9] Tensors of Subgraph#0 T#0(image) shape:[5, 5], type:FLOAT32 T#1(strided_slice) shape:[4], type:INT32 T#2(strided_slice1) shape:[4], type:INT32 T#3(strided_slice2) shape:[4], type:INT32 T#4(ResizeBilinear/size) shape:[2], type:INT32 T#5(strided_slice3) shape:[1, 5, 1, 5], type:FLOAT32 T#6(strided_slice4) shape:[1, 5, 5, 1], type:FLOAT32 T#7(ResizeBilinear) shape:[1, 2, 2, 1], type:FLOAT32 T#8(ResizeBilinear_1) shape:[1, 2, 2, 1], type:FLOAT32 T#9(Identity) shape:[1, 2, 2, 1], type:FLOAT32 PiperOrigin-RevId: 389795468 Change-Id: I0fda5bb74568c68459359a8a39f1627b459b7a4b
deven-amd
pushed a commit
that referenced
this pull request
Aug 16, 2021
It looks better for multiple subgraphs models. example) Your TFLite model has ‘3’ subgraph(s). In the subgraph description below, T# represents the Tensor numbers. For example, in Subgraph#0, the WHILE op takes tensor #1 and tensor #1 and tensor #0 as input and produces tensor #2 and tensor #3 and tensor #4 as output. Subgraph#0 main(T#0) -> [T#4, T#3] Op#0 WHILE(T#1, T#1, T#0) -> [T#2, T#3, T#4] Tensors of Subgraph#0 T#0(serving_default_inp:0) shape:[], type:FLOAT32 T#1(Const) shape:[], type:INT32 T#2(while) shape:[], type:INT32 T#3(PartitionedCall:0) shape:[], type:INT32 T#4(PartitionedCall:1) shape:[], type:FLOAT32 Subgraph#1 while_cond(T#0, T#1, T#2) -> [T#4] Op#0 LESS(T#1, T#3) -> [T#4] Tensors of Subgraph#1 T#0(arg0) shape:[], type:INT32 T#1(arg1) shape:[], type:INT32 T#2(arg2) shape:[], type:FLOAT32 T#3(while/Less/y) shape:[], type:INT32 T#4(while/Less) shape:[], type:BOOL Subgraph#2 while_body(T#0, T#1, T#2) -> [T#7, T#5, T#6] Op#0 ADD(T#1, T#3) -> [T#5] Op#1 MUL(T#2, T#4) -> [T#6] Op#2 ADD(T#0, T#3) -> [T#7] Tensors of Subgraph#2 T#0(arg0) shape:[], type:INT32 T#1(arg1) shape:[], type:INT32 T#2(arg2) shape:[], type:FLOAT32 T#3(while/Add/y) shape:[], type:INT32 T#4(while/scalar_mul/scalar) shape:[], type:FLOAT32 T#5(while/Add) shape:[], type:INT32 T#6(while/scalar_mul) shape:[], type:FLOAT32 T#7(while/add_1) shape:[], type:INT32 PiperOrigin-RevId: 390499926 Change-Id: If2dd3fc5b9f4404c83268d56f3eccb1bc1271d0f
deven-amd
added a commit
that referenced
this pull request
Nov 16, 2021
…uite Related JIRA ticket - https://ontrack-internal.amd.com/browse/SWDEV-310531 The following unit-test (when run as part of the CPU testsuite) fails on some CI nodes ``` //tensorflow/c/eager:c_api_distributed_test FAILED in 10.8s ... ... [----------] 6 tests from CAPI (1396 ms total) [----------] Global test environment tear-down [==========] 6 tests from 1 test suite ran. (1396 ms total) [ PASSED ] 6 tests. YOU HAVE 1 DISABLED TEST *** Received signal 11 *** *** BEGIN MANGLED STACK TRACE *** ================================================================================ ``` The failure seems to be server/node dependent, and I was able to consistently reproduce it on the `zt-dh170-07` node. When I bring up the core file from the crash in gdb, I see the following stack trace (running the testcase under gdb makes it pass, and hence the need to back-trace via core file) ``` (gdb) where #0 0x00007f9448612000 in ?? () #1 0x00007f9458a37934 in dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver(char const*, char const*, long, long, long, float const*, float const*, long, float const*, long, float const*, float*, long, float const*, float*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_k8/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #2 0x00007f9458a38531 in dnnl::impl::cpu::x64::jit_avx_gemm_f32(int, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_k8/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #3 0x00007f9458b491bd in dnnl_status_t dnnl::impl::cpu::x64::gemm_driver<float, float, float>(char const*, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, float const*, long const*, float const*, float const*, float*, long const*, float const*, bool, dnnl::impl::cpu::x64::pack_type, dnnl::impl::cpu::x64::gemm_pack_storage_t*, bool) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_k8/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #4 0x00007f945838ab46 in dnnl::impl::cpu::extended_sgemm(char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*, bool) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_k8/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #5 0x00007f9458024ad3 in dnnl_sgemm () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_k8/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #6 0x000055f42a5df14b in Eigen::internal::TensorContractionKernel<float, float, float, long, Eigen::internal::blas_data_mapper<float, long, 0, 0, 1>, Eigen::internal::TensorContractionInputMapper<float, long, 1, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer>, Eigen::internal::TensorContractionInputMapper<float, long, 0, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer> >::invoke(Eigen::internal::blas_data_mapper<float, long, 0, 0, 1> const&, Eigen::internal::ColMajorBlock<float, long> const&, Eigen::internal::ColMajorBlock<float, long> const&, long, long, long, float, float) () #7 0x000055f42a5e52b6 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::kernel(long, long, long, bool) () #8 0x00007f9467160491 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_k8/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest___Utensorflow/libtensorflow_framework.so.2 #9 0x00007f946715db83 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_k8/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest___Utensorflow/libtensorflow_framework.so.2 #10 0x00007f946713fdf7 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_k8/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest___Utensorflow/libtensorflow_framework.so.2 #11 0x00007f9454d536db in start_thread (arg=0x7f8fd37fe700) at pthread_create.c:463 #12 0x00007f945445671f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 ``` The crash seems to be origination from MKL DNN code...guessing there is some MKL DNN implementation bug, which manifests only on some of CI nodes that we have. Running the test with MKL DNN disabled for contractions (`--define=tensorflow_mkldnn_contraction_kernel=0`), makes this test pass Root causing the point of failure within MKL DNN implementaion and fixing it, is outside the scope of our work. Given that there are two ways to workaround this issue * remove this test from test-list * run the CPU testsuite with the build option `--define=tensorflow_mkldnn_contraction_kernel=0` For the time being, we will simply remove this test from the test-list, as this bug only seems to affect one test. If more tests start failing due to this same issue, it may warrant running the CPU testsuite with `--define=tensorflow_mkldnn_contraction_kernel=0`
deven-amd
added a commit
that referenced
this pull request
Dec 6, 2021
On some CI nodes (typically those with higher CPU core counts 128/256), the `//tensorflow/c/eager:c_api_distributed_test_gpu` test fails on an intermitent basis. When it does fail, the failures manifests as segfault at the end of the test, with the stack dump shown at the end of this commit message. The stack dump points the finger to a routine within the MKLDNN implementation. This is further confirmed by the observation that disabling the MKLDNN based Eigen contraction kernels (for ROCm) seems to make the crash go away. related JIRA ticket - https://ontrack-internal.amd.com/browse/SWDEV-313684 A previous commit disabled the `//tensorflow/c/eager:c_api_distributed_test` unit-test only in the CPU unit-tests CI job (for the same reason). That comit cannot be reverted, because this commit disables MKLDNN based Eigen contraction kernels *only* for the ROCm build. ``` Thread 191 "c_api_distribut" received signal SIGSEGV, Segmentation fault. [Switching to thread 191 (Thread 0x7ffc777fe700 (LWP 159004))] 0x00007fff54530000 in ?? () (gdb) where #0 0x00007fff54530000 in ?? () #1 0x00007fffd5d15ae4 in dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver(char const*, char const*, long, long, long, float const*, float const*, long, float const*, long, float const*, float*, long, float const*, float*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #2 0x00007fffd5d166e1 in dnnl::impl::cpu::x64::jit_avx_gemm_f32(int, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #3 0x00007fffd5e277ed in dnnl_status_t dnnl::impl::cpu::x64::gemm_driver<float, float, float>(char const*, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, float const*, long const*, float const*, float const*, float*, long const*, float const*, bool, dnnl::impl::cpu::x64::pack_type, dnnl::impl::cpu::x64::gemm_pack_storage_t*, bool) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #4 0x00007fffd5665056 in dnnl::impl::cpu::extended_sgemm(char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*, bool) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #5 0x00007fffd52fe983 in dnnl_sgemm () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #6 0x0000555557187b0b in Eigen::internal::TensorContractionKernel<float, float, float, long, Eigen::internal::blas_data_mapper<float, long, 0, 0, 1>, Eigen::internal::TensorContractionInputMapper<float, long, 1, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer>, Eigen::internal::TensorContractionInputMapper<float, long, 0, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer> >::invoke(Eigen::internal::blas_data_mapper<float, long, 0, 0, 1> const&, Eigen::internal::ColMajorBlock<float, long> const&, Eigen::internal::ColMajorBlock<float, long> const&, long, long, long, float, float) () #7 0x000055555718dc76 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::kernel(long, long, long, bool) () #8 0x000055555718f327 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::signal_kernel(long, long, long, bool, bool) () #9 0x00005555571904cb in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::pack_rhs(long, long) () #10 0x000055555718fd69 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::enqueue_packing_helper(long, long, long, bool) () #11 0x00007ffff6b607a1 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 #12 0x00007ffff6b5de93 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 #13 0x00007ffff6b40107 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 #14 0x00007fffd1ca86db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #15 0x00007fffd00b471f in clone () from /lib/x86_64-linux-gnu/libc.so.6 ```
deven-amd
added a commit
that referenced
this pull request
Dec 7, 2021
On some CI nodes (typically those with higher CPU core counts 128/256), the `//tensorflow/c/eager:c_api_distributed_test_gpu` test fails on an intermitent basis. When it does fail, the failures manifests as segfault at the end of the test, with the stack dump shown at the end of this commit message. The stack dump points the finger to a routine within the MKLDNN implementation. This is further confirmed by the observation that disabling the MKLDNN based Eigen contraction kernels (for ROCm) seems to make the crash go away. related JIRA ticket - https://ontrack-internal.amd.com/browse/SWDEV-313684 A previous commit disabled the `//tensorflow/c/eager:c_api_distributed_test` unit-test only in the CPU unit-tests CI job (for the same reason). That comit cannot be reverted, because this commit disables MKLDNN based Eigen contraction kernels *only* for the ROCm build. ``` Thread 191 "c_api_distribut" received signal SIGSEGV, Segmentation fault. [Switching to thread 191 (Thread 0x7ffc777fe700 (LWP 159004))] 0x00007fff54530000 in ?? () (gdb) where #0 0x00007fff54530000 in ?? () #1 0x00007fffd5d15ae4 in dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver(char const*, char const*, long, long, long, float const*, float const*, long, float const*, long, float const*, float*, long, float const*, float*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #2 0x00007fffd5d166e1 in dnnl::impl::cpu::x64::jit_avx_gemm_f32(int, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #3 0x00007fffd5e277ed in dnnl_status_t dnnl::impl::cpu::x64::gemm_driver<float, float, float>(char const*, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, float const*, long const*, float const*, float const*, float*, long const*, float const*, bool, dnnl::impl::cpu::x64::pack_type, dnnl::impl::cpu::x64::gemm_pack_storage_t*, bool) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #4 0x00007fffd5665056 in dnnl::impl::cpu::extended_sgemm(char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*, bool) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #5 0x00007fffd52fe983 in dnnl_sgemm () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #6 0x0000555557187b0b in Eigen::internal::TensorContractionKernel<float, float, float, long, Eigen::internal::blas_data_mapper<float, long, 0, 0, 1>, Eigen::internal::TensorContractionInputMapper<float, long, 1, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer>, Eigen::internal::TensorContractionInputMapper<float, long, 0, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer> >::invoke(Eigen::internal::blas_data_mapper<float, long, 0, 0, 1> const&, Eigen::internal::ColMajorBlock<float, long> const&, Eigen::internal::ColMajorBlock<float, long> const&, long, long, long, float, float) () #7 0x000055555718dc76 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::kernel(long, long, long, bool) () #8 0x000055555718f327 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::signal_kernel(long, long, long, bool, bool) () #9 0x00005555571904cb in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::pack_rhs(long, long) () #10 0x000055555718fd69 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::enqueue_packing_helper(long, long, long, bool) () #11 0x00007ffff6b607a1 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 #12 0x00007ffff6b5de93 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 #13 0x00007ffff6b40107 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 #14 0x00007fffd1ca86db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #15 0x00007fffd00b471f in clone () from /lib/x86_64-linux-gnu/libc.so.6 ```
deven-amd
added a commit
that referenced
this pull request
Dec 21, 2021
On some CI nodes (typically those with higher CPU core counts 128/256), the `//tensorflow/c/eager:c_api_distributed_test_gpu` test fails on an intermitent basis. When it does fail, the failures manifests as segfault at the end of the test, with the stack dump shown at the end of this commit message. The stack dump points the finger to a routine within the MKLDNN implementation. This is further confirmed by the observation that disabling the MKLDNN based Eigen contraction kernels (for ROCm) seems to make the crash go away. related JIRA ticket - https://ontrack-internal.amd.com/browse/SWDEV-313684 A previous commit disabled the `//tensorflow/c/eager:c_api_distributed_test` unit-test only in the CPU unit-tests CI job (for the same reason). That comit cannot be reverted, because this commit disables MKLDNN based Eigen contraction kernels *only* for the ROCm build. ``` Thread 191 "c_api_distribut" received signal SIGSEGV, Segmentation fault. [Switching to thread 191 (Thread 0x7ffc777fe700 (LWP 159004))] 0x00007fff54530000 in ?? () (gdb) where #0 0x00007fff54530000 in ?? () #1 0x00007fffd5d15ae4 in dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver(char const*, char const*, long, long, long, float const*, float const*, long, float const*, long, float const*, float*, long, float const*, float*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #2 0x00007fffd5d166e1 in dnnl::impl::cpu::x64::jit_avx_gemm_f32(int, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #3 0x00007fffd5e277ed in dnnl_status_t dnnl::impl::cpu::x64::gemm_driver<float, float, float>(char const*, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, float const*, long const*, float const*, float const*, float*, long const*, float const*, bool, dnnl::impl::cpu::x64::pack_type, dnnl::impl::cpu::x64::gemm_pack_storage_t*, bool) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #4 0x00007fffd5665056 in dnnl::impl::cpu::extended_sgemm(char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*, bool) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #5 0x00007fffd52fe983 in dnnl_sgemm () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #6 0x0000555557187b0b in Eigen::internal::TensorContractionKernel<float, float, float, long, Eigen::internal::blas_data_mapper<float, long, 0, 0, 1>, Eigen::internal::TensorContractionInputMapper<float, long, 1, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer>, Eigen::internal::TensorContractionInputMapper<float, long, 0, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer> >::invoke(Eigen::internal::blas_data_mapper<float, long, 0, 0, 1> const&, Eigen::internal::ColMajorBlock<float, long> const&, Eigen::internal::ColMajorBlock<float, long> const&, long, long, long, float, float) () #7 0x000055555718dc76 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::kernel(long, long, long, bool) () #8 0x000055555718f327 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::signal_kernel(long, long, long, bool, bool) () #9 0x00005555571904cb in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::pack_rhs(long, long) () #10 0x000055555718fd69 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::enqueue_packing_helper(long, long, long, bool) () #11 0x00007ffff6b607a1 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 #12 0x00007ffff6b5de93 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 #13 0x00007ffff6b40107 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 #14 0x00007fffd1ca86db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #15 0x00007fffd00b471f in clone () from /lib/x86_64-linux-gnu/libc.so.6 ```
deven-amd
added a commit
that referenced
this pull request
Jan 31, 2022
On some CI nodes (typically those with higher CPU core counts 128/256), the `//tensorflow/c/eager:c_api_distributed_test_gpu` test fails on an intermitent basis. When it does fail, the failures manifests as segfault at the end of the test, with the stack dump shown at the end of this commit message. The stack dump points the finger to a routine within the MKLDNN implementation. This is further confirmed by the observation that disabling the MKLDNN based Eigen contraction kernels (for ROCm) seems to make the crash go away. related JIRA ticket - https://ontrack-internal.amd.com/browse/SWDEV-313684 A previous commit disabled the `//tensorflow/c/eager:c_api_distributed_test` unit-test only in the CPU unit-tests CI job (for the same reason). That comit cannot be reverted, because this commit disables MKLDNN based Eigen contraction kernels *only* for the ROCm build. ``` Thread 191 "c_api_distribut" received signal SIGSEGV, Segmentation fault. [Switching to thread 191 (Thread 0x7ffc777fe700 (LWP 159004))] 0x00007fff54530000 in ?? () (gdb) where #0 0x00007fff54530000 in ?? () #1 0x00007fffd5d15ae4 in dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver(char const*, char const*, long, long, long, float const*, float const*, long, float const*, long, float const*, float*, long, float const*, float*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #2 0x00007fffd5d166e1 in dnnl::impl::cpu::x64::jit_avx_gemm_f32(int, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #3 0x00007fffd5e277ed in dnnl_status_t dnnl::impl::cpu::x64::gemm_driver<float, float, float>(char const*, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, float const*, long const*, float const*, float const*, float*, long const*, float const*, bool, dnnl::impl::cpu::x64::pack_type, dnnl::impl::cpu::x64::gemm_pack_storage_t*, bool) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #4 0x00007fffd5665056 in dnnl::impl::cpu::extended_sgemm(char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*, bool) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #5 0x00007fffd52fe983 in dnnl_sgemm () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so #6 0x0000555557187b0b in Eigen::internal::TensorContractionKernel<float, float, float, long, Eigen::internal::blas_data_mapper<float, long, 0, 0, 1>, Eigen::internal::TensorContractionInputMapper<float, long, 1, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer>, Eigen::internal::TensorContractionInputMapper<float, long, 0, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer> >::invoke(Eigen::internal::blas_data_mapper<float, long, 0, 0, 1> const&, Eigen::internal::ColMajorBlock<float, long> const&, Eigen::internal::ColMajorBlock<float, long> const&, long, long, long, float, float) () #7 0x000055555718dc76 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::kernel(long, long, long, bool) () #8 0x000055555718f327 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::signal_kernel(long, long, long, bool, bool) () #9 0x00005555571904cb in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::pack_rhs(long, long) () #10 0x000055555718fd69 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::enqueue_packing_helper(long, long, long, bool) () #11 0x00007ffff6b607a1 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 #12 0x00007ffff6b5de93 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 #13 0x00007ffff6b40107 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 #14 0x00007fffd1ca86db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #15 0x00007fffd00b471f in clone () from /lib/x86_64-linux-gnu/libc.so.6 ```
weihanmines
pushed a commit
that referenced
this pull request
Feb 7, 2022
Today this pattern assumes: 1. Input has rank 1 2. The reshape reshapes it to a tensor of shape Nx1. Neither of them always hold. For example, we could come up with something like: Equal(Reshape(X: tensor<4x4xi32>, [2,8,1]), [0,1,2]): tensor<2x8x3xi1> tfl.one_hot() requires output_dims = indices_dim + 1 (see link [1]), so in this case, we should produce something like: OneHot(Reshape(X: tensor<4x4xi32>, [2,8]), ...): tensor<2x8x3xi1> But for the sake of being more conservative, for now we will only apply this rewrite pattern when the reshape returns a tensor of rank 2; but we do remove the original assumption #1 (X has rank 1) in this CL. [1]: https://github.com/tensorflow/tensorflow/blob/8c96acb5b70a5e351931a4075cb0c40f7f04a832/tensorflow/lite/kernels/one_hot.cc#L49 PiperOrigin-RevId: 426188277 Change-Id: I180490c30347caeabe6a3b14c138f577cc2a556c
ekuznetsov139
pushed a commit
that referenced
this pull request
May 31, 2022
Added support to tf-mlir-translate and it's python wrapper for handling data types with subtypes
i-chaochen
pushed a commit
that referenced
this pull request
Dec 24, 2022
Error message: ``` [ RUN ] Conv2DTransposeTest.testGradient 2022-12-22 10:21:44.842744: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:1041] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2022-12-22 10:21:44.843106: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:1041] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2022-12-22 10:21:44.843394: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:1041] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2022-12-22 10:21:44.843738: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:1041] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2022-12-22 10:21:44.844362: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:1041] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2022-12-22 10:21:44.844628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1614] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15366 MB memory: -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 6.0 WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/bin/bazel_pip/tensorflow/python/kernel_tests/nn_ops/conv2d_transpose_test_gpu.runfiles/org_tensorflow/bazel_pip/tensorflow/python/kernel_tests/nn_ops/conv2d_transpose_test.py:176: compute_gradient_error (from tensorflow.python.ops.gradient_checker) is deprecated and will be removed in a future version. Instructions for updating: Use tf.test.compute_gradient in 2.0, which has better support for functions. Note that the two versions have different usage, so code change is needed. W1222 10:21:44.849361 140024911816512 deprecation.py:364] From /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/bin/bazel_pip/tensorflow/python/kernel_tests/nn_ops/conv2d_transpose_test_gpu.runfiles/org_tensorflow/bazel_pip/tensorflow/python/kernel_tests/nn_ops/conv2d_transpose_test.py:176: compute_gradient_error (from tensorflow.python.ops.gradient_checker) is deprecated and will be removed in a future version. Instructions for updating: Use tf.test.compute_gradient in 2.0, which has better support for functions. Note that the two versions have different usage, so code change is needed. WARNING:tensorflow:From /bazel_pip/lib/python3.8/site-packages/tensorflow/python/ops/gradient_checker.py:390: compute_gradient (from tensorflow.python.ops.gradient_checker) is deprecated and will be removed in a future version. Instructions for updating: Use tf.test.compute_gradient in 2.0, which has better support for functions. Note that the two versions have different usage, so code change is needed. W1222 10:21:44.849497 140024911816512 deprecation.py:364] From /bazel_pip/lib/python3.8/site-packages/tensorflow/python/ops/gradient_checker.py:390: compute_gradient (from tensorflow.python.ops.gradient_checker) is deprecated and will be removed in a future version. Instructions for updating: Use tf.test.compute_gradient in 2.0, which has better support for functions. Note that the two versions have different usage, so code change is needed. 2022-12-22 10:21:44.858697: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:331] MLIR V1 optimization pass is not enabled Fatal Python error: Segmentation fault Thread 0x00007f5a1720e740 (most recent call first): File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1454 in _call_tf_sessionrun File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1361 in _run_fn File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1378 in _do_call File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1371 in _do_run File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1191 in _run File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 968 in run File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/framework/test_util.py", line 2054 in run File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/ops/gradient_checker.py", line 112 in _compute_theoretical_jacobian File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/ops/gradient_checker.py", line 235 in _compute_gradient File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/ops/gradient_checker.py", line 261 in <listcomp> File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/ops/gradient_checker.py", line 261 in _compute_gradient_list File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/ops/gradient_checker.py", line 322 in compute_gradient File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py", line 371 in new_func File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/ops/gradient_checker.py", line 390 in compute_gradient_error File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py", line 371 in new_func File "/root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/bin/bazel_pip/tensorflow/python/kernel_tests/nn_ops/conv2d_transpose_test_gpu.runfiles/org_tensorflow/bazel_pip/tensorflow/python/kernel_tests/nn_ops/conv2d_transpose_test.py", line 176 in testGradient File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/framework/test_util.py", line 1624 in decorated File "/usr/lib/python3.8/unittest/case.py", line 633 in _callTestMethod File "/usr/lib/python3.8/unittest/case.py", line 676 in run File "/usr/lib/python3.8/unittest/case.py", line 736 in __call__ File "/usr/lib/python3.8/unittest/suite.py", line 122 in run File "/usr/lib/python3.8/unittest/suite.py", line 84 in __call__ File "/usr/lib/python3.8/unittest/suite.py", line 122 in run File "/usr/lib/python3.8/unittest/suite.py", line 84 in __call__ File "/usr/lib/python3.8/unittest/runner.py", line 176 in run File "/usr/lib/python3.8/unittest/main.py", line 271 in runTests File "/usr/lib/python3.8/unittest/main.py", line 101 in __init__ File "/bazel_pip/lib/python3.8/site-packages/absl/testing/absltest.py", line 2524 in _run_and_get_tests_result File "/bazel_pip/lib/python3.8/site-packages/absl/testing/absltest.py", line 2558 in run_tests File "/bazel_pip/lib/python3.8/site-packages/absl/testing/absltest.py", line 2152 in _run_in_app File "/bazel_pip/lib/python3.8/site-packages/absl/testing/absltest.py", line 2057 in main File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/platform/googletest.py", line 51 in g_main File "/bazel_pip/lib/python3.8/site-packages/absl/app.py", line 254 in _run_main File "/bazel_pip/lib/python3.8/site-packages/absl/app.py", line 308 in run File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/platform/googletest.py", line 60 in main_wrapper File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/platform/benchmark.py", line 486 in benchmarks_main File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/platform/googletest.py", line 62 in main File "/bazel_pip/lib/python3.8/site-packages/tensorflow/python/platform/test.py", line 56 in main File "/root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/bin/bazel_pip/tensorflow/python/kernel_tests/nn_ops/conv2d_transpose_test_gpu.runfiles/org_tensorflow/bazel_pip/tensorflow/python/kernel_tests/nn_ops/conv2d_transpose_test.py", line 334 in <module> *** Received signal 11 *** *** BEGIN MANGLED STACK TRACE *** /bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x17e5780)[0x7f599157e780] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f5a1741d090] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f5a1741d00b] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f5a1741d090] /lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8(+0x3986ec5)[0x7f5691b7cec5] /lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8(+0x395d0e9)[0x7f5691b530e9] /lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8(+0x19ce755)[0x7f568fbc4755] /lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8(_ZN5cudnn3cnn31ConvolutionForwardGroupedDirect21execute_internal_implERKNS_7backend11VariantPackEP11CUstream_st+0x2f6)[0x7f568fb6b536] /lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8(_ZN5cudnn3cnn15EngineInterface7executeERKNS_7backend11VariantPackEP11CUstream_st+0xd5)[0x7f568f4440a5] /lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8(_ZN5cudnn7backend7executeEP12cudnnContextRNS0_13ExecutionPlanERNS0_11VariantPackE+0x13cc)[0x7f568f4561fc] /lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8(cudnnBackendExecute+0x111)[0x7f568f4565b1] /bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(cudnnBackendExecute+0x40)[0x7f5991abacf0] /bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZNK15stream_executor3gpu24CudnnExecutionPlanRunnerIFvNS_16DeviceMemoryBaseES2_S2_EEclEPNS_6StreamEPNS_3dnn13ProfileResultES2_S2_S2_S2_+0x690)[0x7f5991a93930] /bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_cc.so.2(_ZN10tensorflow8internal16AutotuneConvImplIZNS_19AutotuneUnfusedConvIfEEN3tsl8StatusOrINS_13AutotuneEntryIN15stream_executor3dnn6ConvOpEEEEEbPNS_11AutotuneMapINS_14ConvParametersES9_NS0_17AutotuneMapHasherISC_EEEERKSC_PNS_15OpKernelContextENS7_15ConvolutionKindERKNS7_15BatchDescriptorENS6_12DeviceMemoryIT_EERKNS7_16FilterDescriptorESR_RKNS7_21ConvolutionDescriptorESO_SR_lEUlPNS6_16ScratchAllocatorERKSt10unique_ptrIKNS7_8OpRunnerIFvNS6_16DeviceMemoryBaseES12_S12_EEESt14default_deleteIS15_EEPNS7_13ProfileResultEE_S13_EENS4_ISt6vectorINS_14AutotuneResultESaIS1F_EEEESK_RS1E_IS10_IKNS11_IT0_EES16_IS1L_EESaIS1N_EEbRKSQ_mRKNS6_16RedzoneAllocatorE+0x3d4)[0x7f599c1c02d4] /bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_cc.so.2(_ZN10tensorflow19AutotuneUnfusedConvIfEEN3tsl8StatusOrINS_13AutotuneEntryIN15stream_executor3dnn6ConvOpEEEEEbPNS_11AutotuneMapINS_14ConvParametersES7_NS_8internal17AutotuneMapHasherISA_EEEERKSA_PNS_15OpKernelContextENS5_15ConvolutionKindERKNS5_15BatchDescriptorENS4_12DeviceMemoryIT_EERKNS5_16FilterDescriptorESQ_RKNS5_21ConvolutionDescriptorESN_SQ_l+0x38d)[0x7f599c1c70bd] /bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_cc.so.2(_ZN10tensorflow18LaunchConv2DOpImplIfEEvPNS_15OpKernelContextEbbRKNS_6TensorES5_iiiiRKNS_7PaddingERKSt6vectorIlSaIlEEPS3_NS_12TensorFormatE+0x1095)[0x7f599be457e5] /bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_cc.so.2(_ZN10tensorflow8Conv2DOpIN5Eigen9GpuDeviceEfE7ComputeEPNS_15OpKernelContextE+0x1d9)[0x7f599be46ce9] /bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x329)[0x7f5991523259] /bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x16e450b)[0x7f599147d50b] /bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x16e6868)[0x7f599147f868] /bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_cc.so.2(_ZN5Eigen15ThreadPoolTemplIN3tsl6thread16EigenEnvironmentEE10WorkerLoopEi+0x2a5)[0x7f59a72af595] /bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_cc.so.2(_ZNSt17_Function_handlerIFvvEZN3tsl6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x43)[0x7f59a72ad2c3] /bazel_pip/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x157078b)[0x7f599130978b] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f5a173bf609] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f5a174f9133] *** END MANGLED STACK TRACE *** *** Begin stack trace *** tsl::CurrentStackTrace[abi:cxx11]() gsignal cudnn::cnn::ConvolutionForwardGroupedDirect::execute_internal_impl(cudnn::backend::VariantPack const&, CUstream_st*) cudnn::cnn::EngineInterface::execute(cudnn::backend::VariantPack const&, CUstream_st*) cudnn::backend::execute(cudnnContext*, cudnn::backend::ExecutionPlan&, cudnn::backend::VariantPack&) cudnnBackendExecute cudnnBackendExecute stream_executor::gpu::CudnnExecutionPlanRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)>::operator()(stream_executor::Stream*, stream_executor::dnn::ProfileResult*, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase) const tsl::StatusOr<std::vector<tensorflow::AutotuneResult, std::allocator<std::vector> > > tensorflow::internal::AutotuneConvImpl<tensorflow::AutotuneUnfusedConv<float>(bool, tensorflow::AutotuneMap<tensorflow::ConvParameters, tensorflow::AutotuneEntry<stream_executor::dnn::ConvOp>, tensorflow::internal::AutotuneMapHasher<tensorflow::ConvParameters> >*, tensorflow::ConvParameters const&, tensorflow::OpKernelContext*, stream_executor::dnn::ConvolutionKind, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, long)::{lambda(stream_executor::ScratchAllocator*, std::unique_ptr<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const, std::default_delete<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const> > const&, stream_executor::dnn::ProfileResult*)#1}, void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)>(tensorflow::OpKernelContext*, tensorflow::AutotuneUnfusedConv<float>(bool, tensorflow::AutotuneMap<tensorflow::ConvParameters, tensorflow::AutotuneEntry<stream_executor::dnn::ConvOp>, tensorflow::internal::AutotuneMapHasher<tensorflow::ConvParameters> >*, tensorflow::ConvParameters const&, tensorflow::OpKernelContext*, stream_executor::dnn::ConvolutionKind, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, long)::{lambda(stream_executor::ScratchAllocator*, std::unique_ptr<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const, std::default_delete<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const> > const&, stream_executor::dnn::ProfileResult*)#1}<std::unique_ptr<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const, std::default_delete<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> > >, std::allocator<std::default_delete<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> > > >&, bool, tensorflow::AutotuneUnfusedConv<float>(bool, tensorflow::AutotuneMap<tensorflow::ConvParameters, tensorflow::AutotuneEntry<stream_executor::dnn::ConvOp>, tensorflow::internal::AutotuneMapHasher<tensorflow::ConvParameters> >*, tensorflow::ConvParameters const&, tensorflow::OpKernelContext*, stream_executor::dnn::ConvolutionKind, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, long)::{lambda(stream_executor::ScratchAllocator*, std::unique_ptr<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const, std::default_delete<stream_executor::dnn::OpRunner<void (stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase, stream_executor::DeviceMemoryBase)> const> > const&, stream_executor::dnn::ProfileResult*)#1} const&, unsigned long, stream_executor::RedzoneAllocator const&) tsl::StatusOr<tensorflow::AutotuneEntry<stream_executor::dnn::ConvOp> > tensorflow::AutotuneUnfusedConv<float>(bool, tensorflow::AutotuneMap<tensorflow::ConvParameters, tensorflow::AutotuneEntry<stream_executor::dnn::ConvOp>, tensorflow::internal::AutotuneMapHasher<tensorflow::ConvParameters> >*, tensorflow::ConvParameters const&, tensorflow::OpKernelContext*, stream_executor::dnn::ConvolutionKind, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, long) void tensorflow::LaunchConv2DOpImpl<float>(tensorflow::OpKernelContext*, bool, bool, tensorflow::Tensor const&, tensorflow::Tensor const&, int, int, int, int, tensorflow::Padding const&, std::vector<long, std::allocator<long> > const&, tensorflow::Tensor*, tensorflow::TensorFormat) tensorflow::Conv2DOp<Eigen::GpuDevice, float>::Compute(tensorflow::OpKernelContext*) tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int) std::_Function_handler<void (), tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) clone *** End stack trace *** ``` PiperOrigin-RevId: 497243188
rahulbatra85
pushed a commit
that referenced
this pull request
Jan 24, 2023
This is mostly a 1:1 restructuring with the following changes: 1) Added simple snapshot recovery from on-disk state. 2) Removed all members tracking snapshot, stream, and source completion. I think these may have been structured incorrectly, and either way they weren't tested or used. I'll reevaluate when stream completion is implemented. 3) Removed some validations that weren't tested and/or were related to #1. Will add back after addressing #1. 4) Renamed directory -> path. PiperOrigin-RevId: 502934739
i-chaochen
pushed a commit
that referenced
this pull request
Feb 20, 2023
…3b0f6530e4d7 [Snyk] Security upgrade python from 3.9.0-buster to 3.10-buster
jayfurmanek
pushed a commit
that referenced
this pull request
Mar 27, 2023
…tual number of batch threads we have. We are creating our own threadpool and passing it to ASBS; you'd think that ASBS would then ignore the num_batch_threads option, but in fact it still uses it as an indication of the max number of in-flight requests that are allowed. We also need to cap the other in-flight parameters to be also no larger than num_batch_threads. 2. Set lower defaults for these ASBS options -- models can always override them but I think the current defaults are just too high. The current settings are: - actual threadpool size: 64 (or 4 in the second experiment) - max in-flight batches: 64 - min in-flight batches: 16 - initial in-flight batches: 16 This is a lot more in-filght batches than we were getting with SBS where we had 2-8 batch threads, and cutting the actual threadpool size to 4 failed to adjust things as we expected because of issue #1. Proposed defaults are - max in-flight batches: 64 (no change) - min in-flight batches: 1 - initial in-flight batches: 2 PiperOrigin-RevId: 519142730
wenchenvincent
pushed a commit
that referenced
this pull request
Apr 25, 2023
1. Cap ASBS's num_batch_threads parameter to be no larger than the actual number of batch threads we have. We are creating our own threadpool and passing it to ASBS; you'd think that ASBS would then ignore the num_batch_threads option, but in fact it still uses it as an indication of the max number of in-flight requests that are allowed. We also need to cap the other in-flight parameters to be also no larger than num_batch_threads. 2. Set lower defaults for these ASBS options -- models can always override them but I think the current defaults are just too high. The current settings are: - actual threadpool size: 64 (or 4 in the second experiment) - max in-flight batches: 64 - min in-flight batches: 16 - initial in-flight batches: 16 This is a lot more in-filght batches than we were getting with SBS where we had 2-8 batch threads, and cutting the actual threadpool size to 4 failed to adjust things as we expected because of issue #1. Proposed defaults are - max in-flight batches: 64 (no change) - min in-flight batches: 1 - initial in-flight batches: 2 PiperOrigin-RevId: 524878260
weihanmines
pushed a commit
that referenced
this pull request
Aug 1, 2023
…-example Updated Install libffi7 package step
i-chaochen
pushed a commit
that referenced
this pull request
Aug 29, 2023
NAVI31 was not detected properly due to a typo.
rahulbatra85
pushed a commit
that referenced
this pull request
Nov 9, 2023
Imported from GitHub PR openxla/xla#6599 FP8 cublasLt matmul uses fast accumulation when both operands' precision are DEFAULT. Otherwise fall back to high precision acuumulation. Issue#openxla/xla#6168 This PR is closely related to Flax PR-![3416](google/flax#3416). Copybara import of the project: -- a4140da8ca08cd2d4796a7b8f032827867a361bc by shuw <shuw@nvidia.com>: Add FP8 fast accumulation support for cublasLt. -- 96845683cc4b1e7b947bc919fbf97d8865abeac9 by shuw <shuw@nvidia.com>: Improve based on review #1 -- e906d7620780d2cf1fe8433c933648dcb98dc61d by shuw <shuw@nvidia.com>: Improve based on review #2 Merging this change closes tensorflow#6599 PiperOrigin-RevId: 578948593
zoranjovanovic-ns
pushed a commit
that referenced
this pull request
Dec 19, 2023
Imported from GitHub PR openxla/xla#7751 Due to fast accumulation being turned on in the forward mode, the cublasLt fp8 gemm with gelu epilogue can efficiently operate with a fused kernel. Compared against the XLA-generated gelu kernel on H100, the performance demonstrates some improvement for size of [8192, 4096] x [4096, 16384] + gelu: Execution time for matmul using cublasLt and gelu (XLA): 1.28ms Execution time for matmul_gelu using cublasLt: 1.25ms Copybara import of the project: -- e8abce3b41f68cae1bb625cdecd5885413a0781d by Shu Wang <shuw@nvidia.com>: Support cublasLt Fp8 Approx Gelu epilogue fusion. -- 818127cf582af7ceba014d88bdf027857fc8f0e5 by shuw <shuw@nvidia.com>: Remove F32 check -- 5ce3108a9bc8459e20456d23a3ae493ef7a6a387 by shuw <shuw@nvidia.com>: Improve based on review #1 Merging this change closes tensorflow#7751 PiperOrigin-RevId: 591236441
hsharsha
pushed a commit
that referenced
this pull request
Mar 18, 2024
…execution scope Instead of always constructing while operation conditional in the default scope use the scope of a while operation itself. This generates correct CUDA graph: https://gist.github.com/ezhulenev/a84192fe8b46a4bf1a934a8baa08ea60 Memeset operation launched in a scope #1 is not synchronized with initial condition handle update PiperOrigin-RevId: 609742672
i-chaochen
pushed a commit
that referenced
this pull request
May 13, 2024
PiperOrigin-RevId: 629853006
i-chaochen
pushed a commit
that referenced
this pull request
May 20, 2024
…uild phase to Initialize() Imported from GitHub PR openxla/xla#12228 The first time that a NormThunk is executed, it will build a cudnn execution plan. This build step can hang if a NCCL collective is running at the same time. To fix this, I've moved the build step to take place during thunk initialization. We only observe this hang when using cudnn 9. Here's a backtrace from the hang that will be fixed: ``` Thread 585 (Thread 0x7fb9391ff640 (LWP 41364) "main.py"): #0 0x00007fd3d17cffd9 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fd3d17da24f in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007fd070967dfe in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1 #3 0x00007fd0709c928a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1 #4 0x00007f1970d76102 in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 #5 0x00007f1970f2c999 in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 #6 0x00007f1970a7d4ab in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 #7 0x00007f1970d0a9cb in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 #8 0x00007fce60b2a98c in cudnn::backend::ExecutionPlan::finalize_internal() () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 #9 0x00007fce60aefbb1 in cudnn::backend::Descriptor::finalize() () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 #10 0x00007fce60b15bec in cudnnBackendFinalize () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 #11 0x00007fd2521b8f39 in cudnn_frontend::ExecutionPlanBuilder_v8::build() () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #12 0x00007fd2521734ba in stream_executor::gpu::(anonymous namespace)::GetExecPlanFromHeuristics(cudnn_frontend::OperationGraph_v8&&, stream_executor::gpu::(anonymous namespace)::CudnnHandle const&, bool) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #13 0x00007fd25216ff9b in stream_executor::gpu::CudnnSupport::NormRunnerFromDesc(stream_executor::Stream*, stream_executor::dnn::AlgorithmDesc const&, stream_executor::dnn::NormKind, double, stream_executor::dnn::TensorDescriptor const&, stream_executor::dnn::TensorDescriptor const&, stream_executor::dnn::TensorDescriptor const&, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #14 0x00007fd24e36b88b in stream_executor::dnn::NormOp::RunnerFromAlgorithmDesc(stream_executor::dnn::AlgorithmDesc const&, stream_executor::dnn::NormOp::Config, stream_executor::Stream*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #15 0x00007fd24e36ae37 in stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}::operator()() const () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #16 0x00007fd24e36adbc in void absl::lts_20230802::base_internal::CallOnceImpl<stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}>(std::atomic<unsigned int>*, absl::lts_20230802::base_internal::SchedulingMode, stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}&&) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #17 0x00007fd24e36a9bd in stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #18 0x00007fd24e369d29 in xla::gpu::RunGpuNorm(xla::gpu::GpuNormConfig const&, stream_executor::DeviceMemoryBase const&, stream_executor::DeviceMemoryBase const&, stream_executor::DeviceMemoryBase const&, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, stream_executor::DeviceMemoryBase const&, stream_executor::Stream*, xla::gpu::RunNormOptions) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #19 0x00007fd24e368be6 in xla::gpu::NormThunk::ExecuteOnStream(xla::gpu::Thunk::ExecuteParams const&) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so ``` Copybara import of the project: -- f53533087ba1ddcf65ad7cc6268ee89de4690d15 by Trevor Morris <tmorris@nvidia.com>: Fix hang with cudnn layer norm by moving cudnn init to Initialize() Merging this change closes tensorflow#12228 PiperOrigin-RevId: 633220207
hsharsha
pushed a commit
that referenced
this pull request
Aug 15, 2024
name old cpu/op new cpu/op delta BM_SelectAndScatterF32/128/process_time 889µs ± 1% 740µs ± 3% -16.70% BM_SelectAndScatterF32/256/process_time 3.64ms ± 2% 3.00ms ± 1% -17.64% BM_SelectAndScatterF32/512/process_time 15.3ms ± 1% 13.1ms ± 3% -14.61% PiperOrigin-RevId: 658063846
alekstheod
pushed a commit
that referenced
this pull request
Jan 13, 2025
Fixes the following TSAN race: ``` WARNING: ThreadSanitizer: data race (pid=899472) Write of size 8 at 0x7f979e0f1cd8 by thread T69: #0 llvm::TargetRegistry::RegisterTargetMachine(llvm::Target&, llvm::TargetMachine* (*)(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool)) /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:827:27 (xla_extension.so+0x9803668) (BuildId: 6fa88e3910a5eb04) #1 llvm::RegisterTargetMachine<llvm::X86TargetMachine>::RegisterTargetMachine(llvm::Target&) /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:1250:5 (xla_extension.so+0x9803668) #2 LLVMInitializeX86Target /proc/self/cwd/external/llvm-project/llvm/lib/Target/X86/X86TargetMachine.cpp:69:43 (xla_extension.so+0x9803668) #3 llvm::InitializeNativeTarget() /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/TargetSelect.h:123:5 (xla_extension.so+0x48d2358) (BuildId: 6fa88e3910a5eb04) #4 xla::cpu::JitCompiler::Create(llvm::TargetOptions, xla::cpu::JitCompiler::Options, absl::lts_20230802::AnyInvocable<void (std::function<void ()>)>)::$_0::operator()() const /proc/self/cwd/external/xla/xla/backends/cpu/codegen/jit_compiler.cc:113:5 (xla_extension.so+0x48d2358) #5 xla::cpu::JitCompiler::Create(llvm::TargetOptions, xla::cpu::JitCompiler::Options, absl::lts_20230802::AnyInvocable<void (std::function<void ()>)>) /proc/self/cwd/external/xla/xla/backends/cpu/codegen/jit_compiler.cc:112:34 (xla_extension.so+0x48d209b) (BuildId: 6fa88e3910a5eb04) #6 xla::cpu::CpuCompiler::CompileLegacyCpuExecutable(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1416:3 (xla_extension.so+0x2f716a0) (BuildId: 6fa88e3910a5eb04) #7 xla::cpu::CpuCompiler::RunBackend(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1730:3 (xla_extension.so+0x2f7ae18) (BuildId: 6fa88e3910a5eb04) #8 xla::JitCompile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&, xla::ExecutionOptions const&, xla::Compiler::CompileOptions const&, int, std::function<void (xla::HloModuleConfig&)>) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:759:19 (xla_extension.so+0x2f12915) (BuildId: 6fa88e3910a5eb04) #9 xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:847:3 (xla_extension.so+0x2f12915) Previous read of size 8 at 0x7f979e0f1cd8 by thread T66: #0 llvm::Target::createTargetMachine(llvm::StringRef, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) const /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:460:10 (xla_extension.so+0x94ba6db) (BuildId: 6fa88e3910a5eb04) #1 llvm::EngineBuilder::selectTarget(llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::SmallVectorImpl<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>> const&) /proc/self/cwd/external/llvm-project/llvm/lib/ExecutionEngine/TargetSelect.cpp:88:18 (xla_extension.so+0x94ba6db) #2 xla::cpu::JitCompiler::InferTargetMachine(llvm::TargetOptions const&, llvm::CodeGenOptLevel, std::optional<tsl::port::CPUFeature>) /proc/self/cwd/external/xla/xla/backends/cpu/codegen/jit_compiler.cc:88:12 (xla_extension.so+0x48d096f) (BuildId: 6fa88e3910a5eb04) #3 xla::cpu::CpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1017:3 (xla_extension.so+0x2f70857) (BuildId: 6fa88e3910a5eb04) #4 xla::JitCompile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&, xla::ExecutionOptions const&, xla::Compiler::CompileOptions const&, int, std::function<void (xla::HloModuleConfig&)>) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:754:3 (xla_extension.so+0x2f12874) (BuildId: 6fa88e3910a5eb04) #5 xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:847:3 (xla_extension.so+0x2f12874) #6 xla::TfrtCpuClient::Compile(mlir::ModuleOp, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:893:10 (xla_extension.so+0x2f13ef2) (BuildId: 6fa88e3910a5eb04) ``` PiperOrigin-RevId: 707701032
alekstheod
pushed a commit
that referenced
this pull request
Jan 13, 2025
…r RunBackend. Both of these call into LLVM code that reads the compiler options. Fixes the following race: ``` WARNING: ThreadSanitizer: data race (pid=869815) Read of size 1 at 0x7f8b24effc08 by thread T65: #0 llvm::cl::opt_storage<bool, false, false>::getValue() const /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h:1406:38 (xla_extension.so+0xa281417) (BuildId: 7f5d2098f168c4db) #1 llvm::cl::opt_storage<bool, false, false>::operator bool() const /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h:1410:38 (xla_extension.so+0xa281417) #2 llvm::CodeGenTargetMachineImpl::CodeGenTargetMachineImpl(llvm::Target const&, llvm::StringRef, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, llvm::Reloc::Model, llvm::CodeModel::Model, llvm::CodeGenOptLevel) /proc/self/cwd/external/llvm-project/llvm/lib/CodeGen/CodeGenTargetMachineImpl.cpp:97:7 (xla_extension.so+0xa281417) #3 llvm::X86TargetMachine::X86TargetMachine(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) /proc/self/cwd/external/llvm-project/llvm/lib/Target/X86/X86TargetMachine.cpp:236:7 (xla_extension.so+0x9803b80) (BuildId: 7f5d2098f168c4db) #4 llvm::RegisterTargetMachine<llvm::X86TargetMachine>::Allocator(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:1258:16 (xla_extension.so+0x980757a) (BuildId: 7f5d2098f168c4db) #5 llvm::Target::createTargetMachine(llvm::StringRef, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) const /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:462:12 (xla_extension.so+0x94ba529) (BuildId: 7f5d2098f168c4db) #6 llvm::EngineBuilder::selectTarget(llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::SmallVectorImpl<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>> const&) /proc/self/cwd/external/llvm-project/llvm/lib/ExecutionEngine/TargetSelect.cpp:88:18 (xla_extension.so+0x94ba529) #7 xla::cpu::JitCompiler::InferTargetMachine(llvm::TargetOptions const&, llvm::CodeGenOptLevel, std::optional<tsl::port::CPUFeature>) /proc/self/cwd/external/xla/xla/backends/cpu/codegen/jit_compiler.cc:88:12 (xla_extension.so+0x48d070f) (BuildId: 7f5d2098f168c4db) #8 xla::cpu::CpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1017:3 (xla_extension.so+0x2f6dc47) (BuildId: 7f5d2098f168c4db) #9 xla::JitCompile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&, xla::ExecutionOptions const&, xla::Compiler::CompileOptions const&, int, std::function<void (xla::HloModuleConfig&)>) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:749:3 (xla_extension.so+0x2f127e2) (BuildId: 7f5d2098f168c4db) #10 xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:842:3 (xla_extension.so+0x2f127e2) #11 xla::TfrtCpuClient::Compile(mlir::ModuleOp, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:888:10 (xla_extension.so+0x2f13da2) (BuildId: 7f5d2098f168c4db) #12 xla::ifrt::PjRtLoadedExecutable::Create(xla::ifrt::PjRtCompatibleClient*, mlir::ModuleOp, xla::CompileOptions, std::vector<tsl::RCReference<xla::ifrt::LoadedHostCallback>, std::allocator<tsl::RCReference<xla::ifrt::LoadedHostCallback>>>) /proc/self/cwd/external/xla/xla/python/pjrt_ifrt/pjrt_executable.cc:258:3 (xla_extension.so+0xdd04d77) (BuildId: 7f5d2098f168c4db) #13 xla::ifrt::PjRtCompiler::Compile(std::unique_ptr<xla::ifrt::Program, std::default_delete<xla::ifrt::Program>>, std::unique_ptr<xla::ifrt::CompileOptions, std::default_delete<xla::ifrt::CompileOptions>>) /proc/self/cwd/external/xla/xla/python/pjrt_ifrt/pjrt_compiler.cc:97:10 (xla_extension.so+0xdcfd29b) (BuildId: 7f5d2098f168c4db) #14 xla::PyClient::CompileIfrtProgram(xla::nb_class_ptr<xla::PyClient>, std::unique_ptr<xla::ifrt::Program, std::default_delete<xla::ifrt::Program>>, std::unique_ptr<xla::ifrt::CompileOptions, std::default_delete<xla::ifrt::CompileOptions>>) /proc/self/cwd/external/xla/xla/python/py_client.cc:443:5 (xla_extension.so+0xc62a228) (BuildId: 7f5d2098f168c4db) #15 xla::PyClient::Compile(xla::nb_class_ptr<xla::PyClient>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, xla::CompileOptions, std::vector<nanobind::capsule, std::allocator<nanobind::capsule>>) /proc/self/cwd/external/xla/xla/python/py_client.cc:466:10 (xla_extension.so+0xc62b514) (BuildId: 7f5d2098f168c4db) Previous write of size 1 at 0x7f8b24effc08 by thread T66 (mutexes: write M0): #0 void llvm::cl::opt_storage<bool, false, false>::setValue<bool>(bool const&, bool) /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h:1401:11 (xla_extension.so+0x100bace9) (BuildId: 7f5d2098f168c4db) #1 void llvm::cl::opt<bool, false, llvm::cl::parser<bool>>::setDefaultImpl<bool, void>() /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h (xla_extension.so+0x100bace9) #2 llvm::cl::opt<bool, false, llvm::cl::parser<bool>>::setDefault() /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h:1474:32 (xla_extension.so+0x100bace9) #3 llvm::cl::Option::reset() /proc/self/cwd/external/llvm-project/llvm/lib/Support/CommandLine.cpp:460:3 (xla_extension.so+0x100cac0e) (BuildId: 7f5d2098f168c4db) #4 (anonymous namespace)::CommandLineParser::ResetAllOptionOccurrences() /proc/self/cwd/external/llvm-project/llvm/lib/Support/CommandLine.cpp:1478:17 (xla_extension.so+0x100cac0e) #5 llvm::cl::ResetAllOptionOccurrences() /proc/self/cwd/external/llvm-project/llvm/lib/Support/CommandLine.cpp:2831:17 (xla_extension.so+0x100caa72) (BuildId: 7f5d2098f168c4db) #6 xla::llvm_ir::LLVMCommandLineOptionsLock::LLVMCommandLineOptionsLock(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>> const&) /proc/self/cwd/external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:70:5 (xla_extension.so+0x91d69f4) (BuildId: 7f5d2098f168c4db) #7 xla::cpu::CpuCompiler::RunBackend(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1727:39 (xla_extension.so+0x2f781c8) (BuildId: 7f5d2098f168c4db) #8 xla::JitCompile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&, xla::ExecutionOptions const&, xla::Compiler::CompileOptions const&, int, std::function<void (xla::HloModuleConfig&)>) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:754:19 (xla_extension.so+0x2f12883) (BuildId: 7f5d2098f168c4db) #9 xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:842:3 (xla_extension.so+0x2f12883) #10 xla::TfrtCpuClient::Compile(mlir::ModuleOp, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:888:10 (xla_extension.so+0x2f13da2) (BuildId: 7f5d2098f168c4db) #11 xla::ifrt::PjRtLoadedExecutable::Create(xla::ifrt::PjRtCompatibleClient*, mlir::ModuleOp, xla::CompileOptions, std::vector<tsl::RCReference<xla::ifrt::LoadedHostCallback>, std::allocator<tsl::RCReference<xla::ifrt::LoadedHostCallback>>>) /proc/self/cwd/external/xla/xla/python/pjrt_ifrt/pjrt_executable.cc:258:3 (xla_extension.so+0xdd04d77) (BuildId: 7f5d2098f168c4db) #12 xla::ifrt::PjRtCompiler::Compile(std::unique_ptr<xla::ifrt::Program, std::default_delete<xla::ifrt::Program>>, std::unique_ptr<xla::ifrt::CompileOptions, std::default_delete<xla::ifrt::CompileOptions>>) /proc/self/cwd/external/xla/xla/python/pjrt_ifrt/pjrt_compiler.cc:97:10 (xla_extension.so+0xdcfd29b) (BuildId: 7f5d2098f168c4db) #13 xla::PyClient::CompileIfrtProgram(xla::nb_class_ptr<xla::PyClient>, std::unique_ptr<xla::ifrt::Program, std::default_delete<xla::ifrt::Program>>, std::unique_ptr<xla::ifrt::CompileOptions, std::default_delete<xla::ifrt::CompileOptions>>) /proc/self/cwd/external/xla/xla/python/py_client.cc:443:5 (xla_extension.so+0xc62a228) (BuildId: 7f5d2098f168c4db) #14 xla::PyClient::Compile(xla::nb_class_ptr<xla::PyClient>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, xla::CompileOptions, std::vector<nanobind::capsule, std::allocator<nanobind::capsule>>) /proc/self/cwd/external/xla/xla/python/py_client.cc:466:10 (xla_extension.so+0xc62b514) (BuildId: 7f5d2098f168c4db) ``` PiperOrigin-RevId: 707721170
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.