-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
potential deadlock in roudi Mon+Discover and IPC-msg-process thread #1546
Comments
@elBoberido It seems this is different issue, instread of saying deadlock, it's more suitable to say Mon+Discover and IPC-msg-process threads' lock are starving,since the publisher is in a inifinite loop while it holds a lock. |
@qclzdh What version of iceoryx are you using? Could you please tell us what do you mean when you say that the publisher is stuck in an infinite loop? Do you suspect him to be caught in the while loop in |
@elfenpiff I am using iox-2.0 release, Yes, i think publisher is in the while loop in ResizeableLockFreeQueue::pushImpl. From my analysis, the sequence how this issue happened is : |
@qclzdh Thanks for discovering this issue. The short version is that I think a buffer cell in ResizeableLockFreeQueue can be lost (i.e. no one owns it and it will leak in some way) in cases a publisher app crashes. A a consequence ResizeableLockFreeQueue::pushImpl calls in other publisher apps may run into a endless loop, but I need to check the details to explain why. If true, this is actually very tricky to fix it properly. I will look into it to give a more detailed reason, then we can try to reproduce it and work on a fix. Assuming it is as I assume it shows that it is hard to get the following guarantee that is somewhat stronger then regular lock-free code: a crashing application/thread cannot block any other thread. Inside an application in the C++ thread model this is not a problem, as an individual thread that crashes causes the application itself to crash. But if each thread is in its own this is not true anymore and we arrive at this problem. TLDR: I will look into the details. |
@qclzdh I looked at the problem and you are right. While the problem occurs in the ResizeableLockFreeQueue, it is similar in the basic LockFreeQueue. There is an invariant that each index is always either owned by the queue itself or by some application. The problem here is that applications can crash and the queue will not reclaim the index. However, since we cannot write arbitrarily large types atomically we need a mechanism like this (I at least know no other way). Assume the queue has capacity 3 and some application crashes during push or pop while holding the index. For simplicity assume it is a subscribing application and the only one. Then eventually there are no free indices anymore as the publishing application fills the queue, i.e. it contains 2 elements now. The semantics of pushImpl is that it tries to obtain a free index. If it fails, it concludes the queue is (temporarily) full and tries to obtain the least recently used (FIFO) index from the used ones, but ONLY if the queue is still full. This is to keep the invariant that a full queue where nothing i popped from remains full at all time and that we do not unnecessarily lose elements. Now the queue can never become full again (full is considered 3 elements in this case), as the index is lost. So it again tries to obtain a free index and the process repeats. This means no push call can make progress unless there is some pop call if an index is lost. But there does not have to be one, as there may not be a subscriber. In a strict sense it means that the queue is NOT lock-free, as one suspended thread/application can block others from progressing. Now what can be done about it? Non-solution Observation Solution strategy Observation Observation As a pseudo fix we can have something to detect a potential problem after a predetermined number of unsuccessful loop runs. However, the reaction is unclear and it is not a nice solution (the queue is still not lock-free). Currently we have to assume that nothing crashes AND the operations are scheduled fairly. If these conditions are met the queue operations should not block indefinitely. The first requirement is fairly restrictive though and I would like to avoid it. Note that in a single process with fairly scheduled threads this would not be an issue, as a crashing thread will crash the whole application in C++. Due to fair scheduling the situation described above cannot occur in a single process. FYI @elBoberido I will try to come up with a solution but this is a difficult problem I think. However, my honor as an algorithm designer demands it and hence I will succeed (eventually ....) :-) |
I thought further about this and I think we need to distinguish two crash cases. Assume we have publishers A and B on some topic and subscribers X and Y on the same topic. Note that the queues that can block as described above belong to/are associated with the subscribers. Subscriber crash Edit: the final claim is wrong and can be ignored. This other subscriber uses its own queue and the queue of the crashed subscriber cannot be cleaned up until the publish returns (which it will not in the scenario above, hence we have a deadlock) Publisher crash Conclusion In the publisher crash case we need a stronger recovery mechanism t recover the index OR make sure that this kind of index loss that blocks other publishers cannot happen. I think the former is easier but still fairly difficult. With the the latter approach we have the problem that we cannot have atomic writes of arbitrary size and have to simulate them as done with the ownership of the index. I do not see a way around this without severely restricting the data types of the queue (to 64 bit). The good news is that the Indexqueue itself is not compromised and can still assumed to be lock-free. |
Another problem is that the queue cannot be cleaned up as the publish happens under mutex protection and cleanup needs the same mutex. Now cleanup cannot obtain the lock since publish does not return. The mutex is only contested during cleanup or generally when queues are added or removed (i.e. rarely). So we need some solution to leave the loop with reasonable semantics even in case of error (otherwise the whole cleanup must be reconsidered). |
the problem with the lost indices was known from the beginning. If a process crashes while it has the ownership of a memory area, the index on it, the index is lost and with it the memory for the data. |
@GerdHirsch Yes, a collection of SPSC queues is a solution but requires extensive redesign. @elBoberido is considering it but it has design implications. If MPMC queues are to be used, I considered the questions.
With some arguments I will not fully state here for brevity, I came to the conclusion that 1. is to be answered with This leaves considering 2. and 3., which I think can be answered positively. It is, however, quite intricate to come up with a lock-free detection and recovery mechanism. The code of this kind of mechanism will not be elegant, at least the way I am considering now. |
@MatthiasKillat @GerdHirsch yes, a SPSC has some advantages but also some disadvantages. The biggest problem is that it might result in quite a big refactoring but it also opens the door for other simplifications like memory allocations. I think at some point we need to do some brainstorming and decide on how to proceed cc @budrus |
we aslo have this problem . (gdb) bt
(gdb) p *(pthread_mutex_t *)0xffff90ccd200 roudi thread Mon+Discover wait a mutex, but this mutex is hold by a crash process. so the roudi wait nerver return .... |
@wangxuemin unfortunately a bigger refactoring is required to fix this and I'm quite busy with my startup and currently don't have that much time for iceoryx development. The startup offers paid support in case this is critical for you :) |
Thanks for your reply~ |
Required information
Operating system:
E.g. Ubuntu 20.04 LTS
Compiler version:
E.g. GCC 9.x
Observed result or behaviour:
publisher send thread, roudi Mon+Discover and IPC-msg-process threads are deadlock. And no more pub/sub/req/res can register to roudi, all the register will fail, since IPC-msg-process is waiting lock.
Expected result or behaviour:
a. pub send process won't block
b. roudi thread won't deadlcok
Conditions where it occurred / Performed steps:
During some stress test, we found there is some case, publisher and roudi will be deadlock, and no more client can register to the roudi. This issue happened, when some subscribers are crashed due to some unexpected reason.
I know we should avoid this kind of crash. But i also want to feedback this issue to you, maybe you guys know how to make roudi and publisher fail safe, fail safe in this case is:
a. publisher won't block at any case, even subscriber crash
b. iox-roudi won't deadlock, even subscriber crash
Following is dump of publisher:
and following tis the dump of roudi thread Mon+Discover
and this is the dump of roudi thread IPC-msg-process
The text was updated successfully, but these errors were encountered: