Fix #455, #587: qos2 cce and deadlock #588

hylkevds · 2021-02-25T14:56:32Z

Changes the way PUBREC is managed, instead of return the lease on inflight and request it again, the lease is kept to complete the flow instead of give precedence to enqueued messages. This is to in principle to "complete the work we are doing" instead of open as many flows as we can.

This PR builds on top of #584

#455: resendInflightNotAcked() assumed all messages in the inflightWindow are
PublishedMessage, but they can also be PubRelMarker. This caused a
ClassCastException.

#587: When sending PubRelMessages, never put them on the queue, since this
deadlocks the system. Since PubRelMessages are never put on the queue,
drainQueueToConnection can be simplified.

andsel

Thank's @hylkevds please could you rebase this PR to master branch so that the becomes difference becomes more easy to spot and read. May thanks

hylkevds · 2021-07-04T14:11:39Z

The force-push I did Friday was a rebase on the master branch.

moquette-io#455: resendInflightNotAcked() assumed all messages in the inflightWindow are PublishedMessage, but they can also be PubRelMarker. This caused a ClassCastException. moquette-io#587: When sending PubRelMessages, never put them on the queue, since this deadlocks the system. Since the queue can not contain PubRelMessages, drainQueueToConnection can be simplified.

hylkevds · 2021-07-04T14:25:09Z

The change looks big, because it adds and removes an if statement. That makes the entire content of the if "change" even though it doesn't really change.

andsel

Ive left a comment to describe why this generated a deadlock

andsel · 2021-07-09T14:06:10Z

broker/src/main/java/io/moquette/broker/Session.java

@@ -196,18 +196,12 @@ public void processPubRec(int pubRecPacketId) {
            return;
        }

-        inflightSlots.incrementAndGet();
-        if (canSkipQueue()) {


@hylkevds could elaborate more on the reason why this if and the management of the slots created deadlock or any other problem?

(assuming I understand the inner workings correctly)

Lets say we have 10 inflight slots taken by QoS2 messages, and there are 1000 QoS2 messages in the queue waiting to be sent after a burst of activity.
We've just received a message in the QoS2 workflow, so the workflow of this message is taking up a message slot.
If we give up our slot, we're not getting it back, since the messageQueue is not empty. canSkipQueue will return false.

The last step in one of our first 10 QoS2 workflows will now end up as message 1001 on the queue. We are going to try to start 1000 more QoS2 workflows before finishing the 10 we're already working on! But the client is most likely not going to respond to most of those 1000 new workflows, since the client also already has 10 workflows open (assuming the client also has 10 inflight slots).
So we're going to have to wait for all 1000 new workflows to time out before we get a free slot again to finish the first 10 workflows.

But:
The fact that we received this message means we have an inflight slot for this workflow.
The fact that we received this message also means we want to send a message and that the connection is open.
So why give up this slot, that we know we need for a message that the client really needs to close an open workflow? It's better to finish a workflow then to start a new one, so responses on messages in open workflows should have priority over messages that start new workflows and should thus not be put on the queue.

Thanks for explanation and good catch! I'll try to figure out hot to create a test for this case

This condition is difficult to test unit

andsel

LGTM

hylkevds mentioned this pull request May 8, 2021

0.14: IllegalReferenceCountException: refCnt: 0 #573

Closed

hylkevds force-pushed the fix_455_587_QOS2_CCE_and_Deadlock branch from cf788ba to 617ce22 Compare May 9, 2021 16:36

hylkevds mentioned this pull request May 24, 2021

Fix #455, #583, #587: inlfightSlots counting, CCE & QoS2 deadlock #601

Closed

hylkevds force-pushed the fix_455_587_QOS2_CCE_and_Deadlock branch 2 times, most recently from fe8d365 to 4055630 Compare July 2, 2021 13:09

andsel requested changes Jul 3, 2021

View reviewed changes

hylkevds force-pushed the fix_455_587_QOS2_CCE_and_Deadlock branch from 4055630 to 0a81d90 Compare July 4, 2021 14:22

andsel reviewed Jul 9, 2021

View reviewed changes

andsel self-requested a review July 11, 2021 12:22

andsel approved these changes Jul 11, 2021

View reviewed changes

andsel merged commit 073eaac into moquette-io:master Jul 11, 2021

hylkevds deleted the fix_455_587_QOS2_CCE_and_Deadlock branch July 20, 2021 07:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #455, #587: qos2 cce and deadlock #588

Fix #455, #587: qos2 cce and deadlock #588

hylkevds commented Feb 25, 2021 •

edited by andsel

Loading

andsel left a comment

hylkevds commented Jul 4, 2021

hylkevds commented Jul 4, 2021

andsel left a comment

andsel Jul 9, 2021

hylkevds Jul 9, 2021

andsel Jul 9, 2021

andsel Jul 11, 2021

andsel left a comment

Fix #455, #587: qos2 cce and deadlock #588

Fix #455, #587: qos2 cce and deadlock #588

Conversation

hylkevds commented Feb 25, 2021 • edited by andsel Loading

andsel left a comment

Choose a reason for hiding this comment

hylkevds commented Jul 4, 2021

hylkevds commented Jul 4, 2021

andsel left a comment

Choose a reason for hiding this comment

andsel Jul 9, 2021

Choose a reason for hiding this comment

hylkevds Jul 9, 2021

Choose a reason for hiding this comment

andsel Jul 9, 2021

Choose a reason for hiding this comment

andsel Jul 11, 2021

Choose a reason for hiding this comment

andsel left a comment

Choose a reason for hiding this comment

hylkevds commented Feb 25, 2021 •

edited by andsel

Loading