Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Energy APIs segfault with strange pthread_mutex_init issue on Lassen #568

Closed
tpatki opened this issue Jul 30, 2024 · 3 comments · Fixed by #571
Closed

Energy APIs segfault with strange pthread_mutex_init issue on Lassen #568

tpatki opened this issue Jul 30, 2024 · 3 comments · Fixed by #571

Comments

@tpatki
Copy link
Member

tpatki commented Jul 30, 2024

This was first discovered in Tre's testing of the updated integration of Variorum and Caliper when we encountered a segmentation fault.

Note that this bug is only applicable to lassen, where we use pthreads for sampling power in the energy APIs. Thanks @tjeter for finding this sneaky issue!

  • When we run the variorum-get-energy-json-example example without -fsanitize=address, it works correctly in our Variorum tests but fails in Caliper integration.
  • When we add a -fsanitize=address in our CMakeLists.txt in src/examples, we now get the same error as we see in the caliper integration PR and are able to reproduce it at our end on Lassen. Rather strange that it doesn't occur at our end when we don't use address sanitizer.

Currently, it seems like the fix is to declare the pthread_mutex_t mlock as static here. We need to discuss this more to see why this error occurs and where the memory is getting corrupted. @rountree and I spent significant time debugging with gdb and were unable to find the source of corruption.

@tpatki's guess is there is some strange naming conflict with the name mlock, as the error goes away if we (1) rename this variable to any other name, or (2) declare it as static. More investigation and understanding is needed here.

(Lassen output)
$ ./variorum-get-energy-json-example 
AddressSanitizer:DEADLYSIGNAL
=================================================================
==99435==ERROR: AddressSanitizer: SEGV on unknown address 0x2000001313e0 (pc 0x200000dcebc0 bp 0x7fffffffad30 sp 0x7fffffffad30 T0)
==99435==The signal is caused by a UNKNOWN memory access.
    #0 0x200000dcebc0 in __memset_power8 (/lib64/libc.so.6+0xaebc0)
    #1 0x200000100fb4 in memset (/lib64/libasan.so.6+0x40fb4)
    #2 0x200000f7b5cc in __GI___pthread_mutex_init /usr/src/debug/glibc-2.17-c758a686/nptl/pthread_mutex_init.c:83
    #3 0x200000e5bbdc in pthread_mutex_init /usr/src/debug/glibc-2.17-c758a686/nptl/forward.c:188
    #4 0x200000b1a890 in ibm_cpu_p9_get_node_energy_json /g/g90/patki1/src/var-clean/variorum/src/variorum/IBM/Power9.c:834
    #5 0x200000b179c4 in variorum_get_energy_json /g/g90/patki1/src/var-clean/variorum/src/variorum/variorum.c:1741
    #6 0x100010b4 in main /g/g90/patki1/src/var-clean/variorum/src/examples/variorum-get-energy-json-example.c:56
    #7 0x200000d452fc in generic_start_main ../csu/libc-start.c:266

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib64/libc.so.6+0xaebc0) in __memset_power8
==99435==ABORTING
@tpatki tpatki added this to the Minor Update: v0.8.1 milestone Jul 30, 2024
@tpatki tpatki changed the title Get Energy/Print Energy API set faults with strange pthread_mutex_init issue on Lassen Energy APIs segfault with strange pthread_mutex_init issue on Lassen Jul 30, 2024
@tpatki
Copy link
Member Author

tpatki commented Jul 30, 2024

@rountree Comments from debugging on Jul 30.

Todo:

  • Test different compiler versions on lassen
  • Test on octomore or other Intel system

Notes:

  • Same error on gcc/12.2.1. Only occurs when we build with -fsanitize=address. Here's an ldd dump from that build.
	linux-vdso64.so.1 =>  (0x0000200000050000)
	libasan.so.8 => /lib64/libasan.so.8 (0x00002000000c0000)
	libvariorum.so => /g/g90/patki1/src/var-clean/variorum/build-lassen-gcc12/variorum/libvariorum.so (0x00002000007e0000)
	libhwloc.so.15 => /usr/WS2/variorum/2024-flux-power-monitor-testing/local/lib/libhwloc.so.15 (0x0000200000820000)
	libjansson.so.4 => /lib64/libjansson.so.4 (0x00002000008d0000)
	libm.so.6 => /lib64/libm.so.6 (0x0000200000900000)
	libc.so.6 => /lib64/libc.so.6 (0x00002000009f0000)
	libdl.so.2 => /lib64/libdl.so.2 (0x0000200000be0000)
	librt.so.1 => /lib64/librt.so.1 (0x0000200000c10000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x0000200000c40000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x0000200000c80000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000200000e10000)
	libgfortran.so.5 => /lib64/libgfortran.so.5 (0x0000200000e50000)
	libquadmath.so.0 => /lib64/libquadmath.so.0 (0x0000200001040000)
	libudev.so.1 => /lib64/libudev.so.1 (0x00002000010b0000)
	libpciaccess.so.0 => /lib64/libpciaccess.so.0 (0x00002000010f0000)
	libcudart.so.12 => /usr/tce/packages/cuda/cuda-12.0.0/lib64/libcudart.so.12 (0x0000200001120000)
	libnvidia-ml.so.1 => /lib64/libnvidia-ml.so.1 (0x0000200001200000)
	libxml2.so.2 => /lib64/libxml2.so.2 (0x0000200001fe0000)
	/lib64/ld64.so.2 (0x0000200000000000)
	libz.so.1 => /lib64/libz.so.1 (0x0000200002220000)
	libcap.so.2 => /lib64/libcap.so.2 (0x0000200002260000)
	libdw.so.1 => /lib64/libdw.so.1 (0x0000200002290000)
	liblzma.so.5 => /lib64/liblzma.so.5 (0x0000200002310000)
	libattr.so.1 => /lib64/libattr.so.1 (0x0000200002370000)
	libelf.so.1 => /lib64/libelf.so.1 (0x00002000023a0000)
	libbz2.so.1 => /lib64/libbz2.so.1 (0x00002000023e0000)
  • Same error on gcc/8.3.1 as well with older (system default hwloc). ldd dump below.
	linux-vdso64.so.1 =>  (0x0000200000050000)
	libasan.so.5 => /lib64/libasan.so.5 (0x00002000000c0000)
	libvariorum.so => /g/g90/patki1/src/var-clean/variorum/build-lassen-gcc8/variorum/libvariorum.so (0x0000200000b70000)
	libhwloc.so.5 => /lib64/libhwloc.so.5 (0x0000200000bb0000)
	libjansson.so.4 => /lib64/libjansson.so.4 (0x0000200000c20000)
	libm.so.6 => /lib64/libm.so.6 (0x0000200000c50000)
	libc.so.6 => /lib64/libc.so.6 (0x0000200000d40000)
	libdl.so.2 => /lib64/libdl.so.2 (0x0000200000f30000)
	librt.so.1 => /lib64/librt.so.1 (0x0000200000f60000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x0000200000f90000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x0000200000fd0000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000200001160000)
	libgfortran.so.5 => /lib64/libgfortran.so.5 (0x00002000011a0000)
	libquadmath.so.0 => /lib64/libquadmath.so.0 (0x0000200001390000)
	libnuma.so.1 => /lib64/libnuma.so.1 (0x0000200001400000)
	libltdl.so.7 => /lib64/libltdl.so.7 (0x0000200001430000)
	/lib64/ld64.so.2 (0x0000200000000000)
	libz.so.1 => /lib64/libz.so.1 (0x0000200001460000)

@lee218llnl
Copy link

Don’t ask me to explain this, but if I declare you mlock as static in Power9.c, it appears to work:

static pthread_mutex_t mlock;

rountree@lassen709 ~/w/lassen/source/hw$ gcc -fsanitize=address -O0 -DVARIORUM_ONLY hw.c -Wall -Wextra -lvariorum -I${HOME}/w/lassen/install/variorum/include -L/usr/WS1/rountree/lassen/source/variorum/lee218build/variorum -Wl,-rpath=/usr/WS1/rountree/lassen/source/variorum/lee218build/variorum -o lee218 -g

rountree@lassen709 ~/w/lassen/source/hw$ ./lee218
Hello, world!
Failed to open occ_inband_sensors file
Failed to open occ_inband_sensors file

=================================================================
==18838==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 100 byte(s) in 1 object(s) allocated from:
#0 0x2000001f7f68 in __interceptor_malloc (/lib64/libasan.so.5+0x137f68)
#1 0x200001187360 (/lib64/libjansson.so.4+0x7360)
#2 0x20000118742c (/lib64/libjansson.so.4+0x742c)
#3 0x2000011874ac (/lib64/libjansson.so.4+0x74ac)
#4 0x20000118363c in json_dumps (/lib64/libjansson.so.4+0x363c)
#5 0x200000b83034 in variorum_get_energy_json /g/g24/rountree/w/lassen/source/variorum/src/variorum/variorum.c:1746
#6 0x10000c50 in main /g/g24/rountree/w/lassen/source/hw/hw.c:23
#7 0x200000be52fc in generic_start_main ../csu/libc-start.c:266
#8 0x200000be54f0 in __libc_start_main ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:81

SUMMARY: AddressSanitizer: 100 byte(s) leaked in 1 allocation(s).

@tpatki
Copy link
Member Author

tpatki commented Aug 1, 2024

Thanks @lee218llnl. I confirm that using static or renaming the variable fixes this issue on Lassen.

To be clear, we have discovered at this stage that this is not a Variorum bug. There seems to be a strange naming conflict with libc on Lassen, where if we use the same name as a variable that is declared in libc, we get a segfault (as tested by @rountree). This confirms my initial suspicion from 3 days ago.

Note that we get this segfault when using the -fsanitize=address or when linking with Caliper only on lassen. We were not able to reproduce this segfault on Octomore.

I tried to reproduce with clang on Lassen based on asan issues reported in these posts (this, this, and this), but could not find a compatible libasan on Lassen.

My suggestion at this point is to just rename the mlock variable or declare it as static and push a fix to Variorum. That way we can make progress on the actual Caliper port with @tjeter, which is a more meaningful use of our time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants