-
Notifications
You must be signed in to change notification settings - Fork 821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #455, #587: qos2 cce and deadlock #588
Fix #455, #587: qos2 cce and deadlock #588
Conversation
cf788ba
to
617ce22
Compare
fe8d365
to
4055630
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank's @hylkevds please could you rebase this PR to master
branch so that the becomes difference becomes more easy to spot and read. May thanks
The force-push I did Friday was a rebase on the master branch. |
moquette-io#455: resendInflightNotAcked() assumed all messages in the inflightWindow are PublishedMessage, but they can also be PubRelMarker. This caused a ClassCastException. moquette-io#587: When sending PubRelMessages, never put them on the queue, since this deadlocks the system. Since the queue can not contain PubRelMessages, drainQueueToConnection can be simplified.
4055630
to
0a81d90
Compare
The change looks big, because it adds and removes an if statement. That makes the entire content of the if "change" even though it doesn't really change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ive left a comment to describe why this generated a deadlock
@@ -196,18 +196,12 @@ public void processPubRec(int pubRecPacketId) { | |||
return; | |||
} | |||
|
|||
inflightSlots.incrementAndGet(); | |||
if (canSkipQueue()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hylkevds could elaborate more on the reason why this if and the management of the slots created deadlock or any other problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(assuming I understand the inner workings correctly)
Lets say we have 10 inflight slots taken by QoS2 messages, and there are 1000 QoS2 messages in the queue waiting to be sent after a burst of activity.
We've just received a message in the QoS2 workflow, so the workflow of this message is taking up a message slot.
If we give up our slot, we're not getting it back, since the messageQueue is not empty. canSkipQueue will return false.
The last step in one of our first 10 QoS2 workflows will now end up as message 1001 on the queue. We are going to try to start 1000 more QoS2 workflows before finishing the 10 we're already working on! But the client is most likely not going to respond to most of those 1000 new workflows, since the client also already has 10 workflows open (assuming the client also has 10 inflight slots).
So we're going to have to wait for all 1000 new workflows to time out before we get a free slot again to finish the first 10 workflows.
But:
The fact that we received this message means we have an inflight slot for this workflow.
The fact that we received this message also means we want to send a message and that the connection is open.
So why give up this slot, that we know we need for a message that the client really needs to close an open workflow? It's better to finish a workflow then to start a new one, so responses on messages in open workflows should have priority over messages that start new workflows and should thus not be put on the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explanation and good catch! I'll try to figure out hot to create a test for this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition is difficult to test unit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Changes the way PUBREC is managed, instead of return the lease on inflight and request it again, the lease is kept to complete the flow instead of give precedence to enqueued messages. This is to in principle to "complete the work we are doing" instead of open as many flows as we can.
This PR builds on top of #584
#455: resendInflightNotAcked() assumed all messages in the inflightWindow are
PublishedMessage, but they can also be PubRelMarker. This caused a
ClassCastException.
#587: When sending PubRelMessages, never put them on the queue, since this
deadlocks the system. Since PubRelMessages are never put on the queue,
drainQueueToConnection can be simplified.