Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #455, #587: qos2 cce and deadlock #588

Merged

Conversation

hylkevds
Copy link
Collaborator

@hylkevds hylkevds commented Feb 25, 2021

Changes the way PUBREC is managed, instead of return the lease on inflight and request it again, the lease is kept to complete the flow instead of give precedence to enqueued messages. This is to in principle to "complete the work we are doing" instead of open as many flows as we can.


This PR builds on top of #584

#455: resendInflightNotAcked() assumed all messages in the inflightWindow are
PublishedMessage, but they can also be PubRelMarker. This caused a
ClassCastException.

#587: When sending PubRelMessages, never put them on the queue, since this
deadlocks the system. Since PubRelMessages are never put on the queue,
drainQueueToConnection can be simplified.

@hylkevds hylkevds force-pushed the fix_455_587_QOS2_CCE_and_Deadlock branch from cf788ba to 617ce22 Compare May 9, 2021 16:36
@hylkevds hylkevds force-pushed the fix_455_587_QOS2_CCE_and_Deadlock branch 2 times, most recently from fe8d365 to 4055630 Compare July 2, 2021 13:09
Copy link
Collaborator

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank's @hylkevds please could you rebase this PR to master branch so that the becomes difference becomes more easy to spot and read. May thanks

@hylkevds
Copy link
Collaborator Author

hylkevds commented Jul 4, 2021

The force-push I did Friday was a rebase on the master branch.

moquette-io#455: resendInflightNotAcked() assumed all messages in the inflightWindow are
PublishedMessage, but they can also be PubRelMarker. This caused a
ClassCastException.

moquette-io#587: When sending PubRelMessages, never put them on the queue, since this
deadlocks the system. Since the queue can not contain PubRelMessages,
drainQueueToConnection can be simplified.
@hylkevds hylkevds force-pushed the fix_455_587_QOS2_CCE_and_Deadlock branch from 4055630 to 0a81d90 Compare July 4, 2021 14:22
@hylkevds
Copy link
Collaborator Author

hylkevds commented Jul 4, 2021

The change looks big, because it adds and removes an if statement. That makes the entire content of the if "change" even though it doesn't really change.

Copy link
Collaborator

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ive left a comment to describe why this generated a deadlock

@@ -196,18 +196,12 @@ public void processPubRec(int pubRecPacketId) {
return;
}

inflightSlots.incrementAndGet();
if (canSkipQueue()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hylkevds could elaborate more on the reason why this if and the management of the slots created deadlock or any other problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(assuming I understand the inner workings correctly)

Lets say we have 10 inflight slots taken by QoS2 messages, and there are 1000 QoS2 messages in the queue waiting to be sent after a burst of activity.
We've just received a message in the QoS2 workflow, so the workflow of this message is taking up a message slot.
If we give up our slot, we're not getting it back, since the messageQueue is not empty. canSkipQueue will return false.

The last step in one of our first 10 QoS2 workflows will now end up as message 1001 on the queue. We are going to try to start 1000 more QoS2 workflows before finishing the 10 we're already working on! But the client is most likely not going to respond to most of those 1000 new workflows, since the client also already has 10 workflows open (assuming the client also has 10 inflight slots).
So we're going to have to wait for all 1000 new workflows to time out before we get a free slot again to finish the first 10 workflows.

But:
The fact that we received this message means we have an inflight slot for this workflow.
The fact that we received this message also means we want to send a message and that the connection is open.
So why give up this slot, that we know we need for a message that the client really needs to close an open workflow? It's better to finish a workflow then to start a new one, so responses on messages in open workflows should have priority over messages that start new workflows and should thus not be put on the queue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanation and good catch! I'll try to figure out hot to create a test for this case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is difficult to test unit

@andsel andsel self-requested a review July 11, 2021 12:22
Copy link
Collaborator

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andsel andsel merged commit 073eaac into moquette-io:master Jul 11, 2021
@hylkevds hylkevds deleted the fix_455_587_QOS2_CCE_and_Deadlock branch July 20, 2021 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants