-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data race detected in libfabric shm provider with multi-threaded client-server setup #10528
Comments
@piotrchmiel Thanks for reporting! Any chance you have an existing reproducer you could send so I don't have to try to implement it? |
@aingerson I prepared reproducer:
I'm using clang19.1.2 (https://github.com/llvm/llvm-project/tree/llvmorg-19.1.2) Thread sanitizer logs from reproducer:
Reproducer shows also another issue that appears more often:
|
@aingerson I just wanted to kindly follow up and ask if you’ve had a chance to take a look at the reproducer I shared for the issue. |
@piotrchmiel Hello again! I was having issues building libfabric with the thread sanitizer (I've never used it before). After doing research on the thread sanitizer, I learned that it doesn't do well with code that uses atomics for serialization (which is what shm uses) and that it is likely to throw a lot of false positives because of this. I've taken a look at the backtraces and code you've provided and don't see how there could be a race there and they are managed by atomics. So I think it is just a false positive and it moved down on my priority list because of the release. I can revisit but I think most likely the thread sanitizer will really struggle with analyzing shm effectively |
Thanks for the update and the detailed explanation regarding the ThreadSanitizer behavior with atomics. I understand that handling these cases can be tricky and prone to false positives. However, could you perhaps provide a short explanation, with references to the relevant sections of the code, describing how the data race warning is avoided in this case? Specifically, I’m interested in the synchronization approach that ensures a data race should never occur between, for example, smr_name_compare and smr_send_name. Understanding the synchronization mechanism in play—such as mutexes or atomic operations—would help clarify why this warning might be a false positive. Any insights or pointers to the relevant parts of the code would be greatly appreciated. |
@piotrchmiel Absolutely! This flow happens on every single send/receive, including the start up flow which involves exchanging name information in order to do the shm mapping (if needed). This second part is where the send_name and name_compare come into play. The sender claims a command and uses a local inject buffer to copy its own name into and then commits it the receiver's command queue. So by the time we get to this commit in send_name, we have specified the unique buffer which no one else can claim (see freestack implemention for that control). The receiver will read that committed command and then insert it and map it here. The receiver has a map of names->region so it needs to look in that map and see if it already knows about this peer (that's where the name_compare happens). Once that's all done, it releases the tx_buffer and command. So in your backtrace, it looks like the thread sanitizer is complaining that the tx_buf (which has the name being sent/compared on the receive side), is racy on the read and write. But the flow is claim (atomics), write, commit (atomics), poll (atomics), read, release (atomics). So we should never be reading and writing to that buffer at the same time because of the command queue atomics that control access to it. |
Describe the bug
When using the shm provider in a simple setup with one client thread and one server thread within a single process, a data race is detected when compiled with clang-19 and run with ThreadSanitizer. The client performs one fi_send, and the server performsone fi_recv, with a message size of 1000 bytes. The data race appears during the fi_cq_read operation on the server side and the fi_send operation on the client side.
To Reproduce
Observe that ThreadSanitizer reports a data race during execution
Expected behavior
No data race should be detected when performing simple send and receive operations between a client and server thread within the same process using the shm provider.
Output
WARNING: ThreadSanitizer: data race (pid=543759)
Server thread:
Read of size 8 at 0x7febb28f3000 by thread T4 (mutexes: write M0, write M1, write M2):
#0 strncmp /home/piotrchmiel/llvm-project/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_common_interceptors.inc:487:3 (test+0x9d66d)
#1 smr_name_compare /home/piotrchmiel/test/third_party/libfabric/prov/shm/src/smr_util.c:351:9 (libfabric.so.1+0xe5a33)
Client thread:
Previous write of size 8 at 0x7febb28f3000 by main thread:
#0 memcpy /home/piotrchmiel/llvm-project/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:115:5 (test+0x8e9de)
#1 smr_send_name /home/piotrchmiel/test/third_party/libfabric/prov/shm/src/smr_ep.c:206:2 (libfabric.so.1+0xdcfcd)
Environment:
Additional context
The data race occurs specifically on memory access in smr_name_compare (during fi_cq_read on the server side) and in smr_send_name (during fi_send on the client side)
The text was updated successfully, but these errors were encountered: