-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIXED] MQTT: rapid load-balanced (re-)CONNECT to cluster causes races #4734
Conversation
Actually, still failed the new test when I run it with |
While testing with --count=X I saw that persistent sessions were hitting a similar race condition when the session persist notification was not processed in-time; it was causing invalid seq errors. The processing of persist notification was in a separate go-routine fed from ipQueue so it could load session messages for inspection. However, the only data we need from the loaded message was the client ID (hash) which we can add to the JS ACK subject and use directly... This is what the last commit does. This PR shortens the notification processing time by cutting out a goroutine/queue, but more importantly a loadMsg call from this processing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about interoperability, some not and recommendation about error checking.
as.processJSAPIReplies, &sid, &subs); err != nil { | ||
return nil, err | ||
} | ||
|
||
// We will listen for replies to session persist requests so that we can | ||
// detect the use of a session with the same client ID anywhere in the cluster. | ||
if err := as.createSubscription(mqttJSARepliesPrefix+"*."+mqttJSASessPersist+".*", | ||
// `$MQTT.JSA.{js-id}.SP.{client-id-hash}.{uuid}` | ||
if err := as.createSubscription(mqttJSARepliesPrefix+"*."+mqttJSASessPersist+".*.*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does that mean that server with this fix will not be able to co-operate with other servers (say current v2.10.3)? That is, a current server that would persist a session with a reply subject on $MQTT.JSA.<serverId>.SP.<nuid>
would not be received by a server with this fix. Maybe it's ok, but I am just raising this to make sure that you thought about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks for raising it. I did consider the compatibility issue and kinda punted on it, because of the "edge condition" nature of the use-case. I considered serializing the clientID into the uuid token using a different separator, but that felt too hacky for a permanent solution to a temporary edge-case. Nothing else that I could think of would make this PR backwards-operable, i.e. broadcasting to the <2.10.(N) servers in a way that they'd understand. I could easily add another listening subscription to pick up their messages, but that'd be 1-way only.
All in all, 1/5 leave as is and require that all servers in an MQTT cluster are upgraded/downgraded at approximately the same time. (Note for others, this is not affecting the session store itself, just the ACK change notifications.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having said that, I think using say, $MQTT.JSA.{js-id}.SP.{client-id-hash}_{uuid}
would work just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kozlovic you agree with ^^? (leaving as is?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage is that you could revert some of the create subscription to keep the same number of tokens. But you would need to do more processing to extract the client ID from the last token. Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Inline persistent sess notification processing PR feedback: nit _EMPTY_ PR feedback: more robust error handling, _EMPTY_ PR feedback: error handling
b7ff15f
to
6626091
Compare
The purpose of this test was to illustrate and verify #4734. Due to its timing sensitivity it has never been 100% reliable, prone to flaking on an occasional MQTT Connect failure (API timeout?). Skip for now, to reduce the flaky noise. Signed-off-by: Lev Brouk <lev@synadia.com>
The tests explain the condition. TL;DR: a rapid, load-balanced sequence of connects/disconnects from the same client to a cluster was causing failures.
This PR:
@derekcollison please note a highly variable test run times O(1ms) - O(100ms), including targeting the same server. Not sure what's going on there. (
go test -v ./server --run TestMQTTClusterConnectDisconnect
prints them out):Same test failing in
main
: