-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tsan warning: lock-order-inversion (potential deadlock) #14
Comments
Hi @larsonmpdx, Rahul |
thanks. tsan normally has a low false positive rate but this kind of deadlock warning can trigger on code that makes sense. Usually the fix is to reduce the number of mutexes used or use std::lock to handle locking multiple mutexes at once as it guarantees no deadlocks (http://en.cppreference.com/w/cpp/thread/lock). There are some minor asan/ubsan warnings too, probably nothing. And some basic clang warnings when using this warning set: -Wall -Wextra -Wpedantic -Wpointer-arith -Wno-unused-parameter -Wcast-qual -Wcast-align I'm trying to get my whole project to run clean on tsan/asan/usbsan and no warnings so it would be great to get rid of these even if they're false positives. I'm busy till next week but I can provide some short instructions to reproduce monday or tuesday. |
see #15 |
Hi all, From what I can see and supported by the Traces:
We are at this stage unable to Publish with QoS1 as Publish() eventually locks up. Could you please advise @chaurah, any pointers or resolution appreciated?
|
Hi @lerebel103, Rahul |
Thanks @chaurah, |
I haven't spent time to understand your existing mutexes but I'll add two things
|
I'm considering using this code for a project and after a quick analysis it's clear that the locking here needs rework. I visually traced at least one deadlock. Even simplifying to one lock would still be problematic as there would be recursive locking paths. I feel using recursive_mutex is generally not the best fix for those. To make the locking design safer, I find the easiest way of achieving this in a manner that's easier to reason about is to be stricter about the interface design. On any team I would be involved with this design would be sent back for refactor with following recommendations:
Many of these can be addressed in a more significant refactoring of the class where you break out each lock and it's associated data and manipulation methods into its own class that is used by the parent. This make things much clearer and less error prone if the discipline isn't there to make it clear through naming and grouping conventions. ClientCoreState here could definitely be refactored into multiple classes around the locking. |
Hi @larsonmpdx and @rwightman, @larsonmpdx: I will look into simplifying the locks if I end up simply fixing this design. I think a read and write lock at the network layer + a sync/async op lock at client core may make this simpler barring any complications that @rwightman mentions. If you have any specific thoughts, would love to hear them. @rwightman: Thank you again for the excellent list of recommendations.
I know this has been open for a long time. But I will address this and your feedback is extremely important in helping me decide what the next step should be and I really appreciate you both taking the time to provide it. As I have repeatedly said, please do keep it coming. I will have further updates soon :) Rahul |
Hi @chaurah, |
Hi, |
I think I'm seeing this in production now, but our message rates are fairly low so it's every 2-4 days |
Hello Amazon Web Services, As noted, |
Hi @lerebel103, |
Thanks @vareddy, we appreciate that this is actively being resolved, as it is critical from a device point of view. |
Hi @lerebel103,
The SDK itself does not call Could you provide a code snippet if possible? |
Hi @vareddy, See:
DeleteExpiredAcks() is called internally from ClientCoreState::ProcessOutboundActionQueue() . Please note that for brevity I only provided a subset of the Traces for discussion.
However, by observing how Re code snippet: I am afraid we're not using the SDK in a particularly unusual manner, other than attempting to send regular JSON payloads on a single Topic, over a persistent Client instance/connection (Cellular). Will. |
Hi @lerebel103,
Edit: Another question, are you using sync publish or async publish in your application? |
HI @lerebel103 and @larsonmpdx, Any feedback you provide will be valuable! |
Thank you,
That’s good news and glad to see that thread sanitizer is now happy.
For sure, we’ll run this branch and report back.
Regards,
Will.
… On 22 May 2018, at 07:02, Varun Reddy ***@***.***> wrote:
HI @lerebel103 and @larsonmpdx,
We have created another branch with major changes to the locking system we currently use in the SDK. We ran it through thread sanitizer and it threw no warnings. We did not change the public APIs themselves so there ideally should not be any compatibility issues between the current SDK (on mainline and release) and this SDK (locking-fixes). Could you please test this branch with your setups?
Any feedback you provide will be valuable!
Thanks!
Varun
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I missed your previous question, we use sync publishing.
Thanks.
… On 22 May 2018, at 07:02, Varun Reddy ***@***.***> wrote:
HI @lerebel103 and @larsonmpdx,
We have created another branch with major changes to the locking system we currently use in the SDK. We ran it through thread sanitizer and it threw no warnings. We did not change the public APIs themselves so there ideally should not be any compatibility issues between the current SDK (on mainline and release) and this SDK (locking-fixes). Could you please test this branch with your setups?
Any feedback you provide will be valuable!
Thanks!
Varun
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
thank you, I don't think I actually saw this problem in the end (it was the connection stuff in #80). I just suspected this when I saw tsan warnings. I'll give this a test this week just to look for regressions. good job fixing all the warnings |
Thanks, for the fixes, we've tried it and so far I haven't seen a deadlock. We'll report back in a few days. |
Thanks for the update @lerebel103. Please let us know if anything comes up in your extended testing. |
Hi Guys, |
We're not planning on fixing this since doing so would basically be a rewrite of the entire SDK. We have been putting our effort into a new version of this SDK that should avoid all of these problems. It's currently in dev preview if you'd like to test with that: github.com/awslabs/aws-iot-device-sdk-cpp-v2 |
Reopening as this is still a known issue in v1.4.0 If you are experiencing this issue, we have a proposed fix in a branch that we encourage you to try here: (locking-fixes). If you find success with this branch, please let us know by commenting on this thread. The long-term fix for this will be in the new (C++ IoT Device SDK which is currently available in developer preview). |
Greetings! Sorry to say but this is a very old issue that is probably not getting as much attention as it deserves. We encourage you to try V2 and if you find that this is still a problem, please feel free to open a new issue there. |
found this when running llvm thread sanitizer on my program which has control flow similar to the pubsub sample:
https://github.com/aws/aws-iot-device-sdk-cpp/blob/master/samples/PubSub/PubSub.cpp
tsan warns about this mutex sequence:
first sync_action_response_lock_ here:
https://github.com/aws/aws-iot-device-sdk-cpp/blob/master/src/ClientCoreState.cpp#L98
ack_map_lock_ here:
https://github.com/aws/aws-iot-device-sdk-cpp/blob/master/src/ClientCoreState.cpp#L197
then sync_action_response_lock_ again here:
https://github.com/aws/aws-iot-device-sdk-cpp/blob/master/src/ClientCoreState.cpp#L112
before I investigate further, is this something you've seen in testing or maybe fixed in an upcoming release?
The text was updated successfully, but these errors were encountered: