Skip to content

[KAFKA-13848] Clients remain connected after SASL re-authentication f…#12179

Merged
showuon merged 2 commits intoapache:trunkfrom
acsaki:trunk
Jun 10, 2022
Merged

[KAFKA-13848] Clients remain connected after SASL re-authentication f…#12179
showuon merged 2 commits intoapache:trunkfrom
acsaki:trunk

Conversation

@acsaki
Copy link
Contributor

@acsaki acsaki commented May 19, 2022

Clients remain connected and able to produce or consume despite an expired OAUTHBEARER token.

The problem can be reproduced using the https://github.com/acsaki/kafka-sasl-reauth project by starting the embedded OAuth2 server and Kafka, then running the long running consumer in OAuthBearerTest and then killing the OAuth2 server thus making the client unable to re-authenticate.

Root cause seems to be SaslServerAuthenticator#calcCompletionTimesAndReturnSessionLifetimeMs failing to set ReauthInfo#sessionExpirationTimeNanos when tokens have already expired (when session life time goes negative), in turn causing KafkaChannel#serverAuthenticationSessionExpired returning false and finally SocketServer not closing the channel.

The issue is observed with OAUTHBEARER but seems to have a wider impact on SASL re-authentication.

@tombentley
Copy link
Member

@rajinisivaram, @rondagostino please could you take a look?

Copy link
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth a comment explaining that sessionExpirationTimeNanos should always be set to a non-null value.

Copy link
Member

@showuon showuon May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some investigation, I found the null value is indicating the re-auth is disabled:
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/network/KafkaChannel.java#L535-L543

So, maybe here, we should make sure we need re-auth (that is, we did have expiration time, if it's 0, which means it's expired, not re-auth disabled). ex:

if (retvalSessionLifetimeMs > 0L || credentialExpirationMs != null || connectionsMaxReauthMs != null)
                    sessionExpirationTimeNanos = authenticationEndNanos + 1000 * 1000 * retvalSessionLifetimeMs;

So, if the retvalSessionLifetimeMs==0, and we don't have credential expiration ms and connection max reauth ms, we should keep it as null
Does that make sense?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding || credentialExpirationMs != null || connectionsMaxReauthMs != null to the if condition adds no value as we already know one of them has to be true to reach the if guard in general.

I think the correct fix is to delay the initialisation of retvalSessionLifetimeMs so that we can use a >= test in the expiry case and 0 in the non expiry cases while still allowing sessionExpirationTimeNanos being null to signal that re-auth is disabled.

@showuon
Copy link
Member

showuon commented May 30, 2022

@acsaki , we haven't get your response for some days, do you need help on it? We can co-author with you to address the comments and fix the tests. Please let me know. Thank you.

@acsaki
Copy link
Contributor Author

acsaki commented May 31, 2022

@acsaki , we haven't get your response for some days, do you need help on it? We can co-author with you to address the comments and fix the tests. Please let me know. Thank you.

Hi @showuon , Thank you! I wasn't working for a few days.. Thanks for all the suggestions (@SamBarker and @tombentley too), I'm writing some more tests to capture the intended behavior, which is as I understood:

  • when reauth is disabled (when max reauth ms is NOT set), leave ReauthInfo#sessionExpirationTimeNanos as null but return millis until the token expires in SaslAuthenticateResponse's sessionLifetimeMs
  • when reauth is enabled and the token expires after max reauth time, set sessionExpirationTimeNanos to when the reauth is due and also return sessionLifetimeMs in the SaslAuthenticateResponse accordingly
  • when reauth is enabled and the token expires before max reauth time, set sessionExpirationTimeNanos to when the token expires and also return sessionLifetimeMs in the SaslAuthenticateResponse accordingly.

SaslServerAuthenticatorTest does not lend itself easily to testing all these, oh well. :)

Copy link

@SamBarker SamBarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that testing is painful and you are doing a great job improving it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if its worth extracting a new test class SaslServerAuthenticatorSessionExpiryTest or similar and that would allow some of the common setup code to actually be extracted to a @BeforeEach and thus make it easier to reason about what's happening in the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've cleaned up tests a bit to make them more readable but leave them as explicit as they can be. I usually don't mind a bit relaxed DRY if tests can be understood as is. Also, MockedStatics are supposed to be closed shorty after they did their thing so using them in a try-with-resources block is usually considered best. Any remaining redundancies in your opinion that should be dealt with?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I don't tend to be overly bothered by DRY in tests and its about read ability thus being able to easily see what the test is trying to prove.

I think the tests are fine as is, anything else is just polish and style changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when reauth is disabled (when max reauth ms is NOT set), leave ReauthInfo#sessionExpirationTimeNanos as null but return millis until the token expires in SaslAuthenticateResponse's sessionLifetimeMs

I think I have a different understanding. My mental model was that if either of credentialExpirationMs != null || connectionsMaxReauthMs != null was true we have re-auth enabled. The re-auth disabled case I was thinking of was when both were false. Now that I think about that I'm not sure that is a valid scenario.

I think your original change to make setting sessionExpirationTimeNanos un-conditional was the right one, my >=0 test is effectively the same as making it un-conditional. Therefore my suggestion of delayed initialisation becomes moot as we don't need to worry about excluding the default value from the test.
I was wondering if the property change was what was actually causing the test failures that you found rather than making it un-conditional.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@showuon what is your thinking on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if the property change was what was actually causing the test failures that you found rather than making it un-conditional.

@SamBarker ,I don't understand this. Which property are you referring to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, for @SamBarker , I think we can discuss the property change in a separate thread since it's related to tests.

For this change:

if (connectionsMaxReauthMs != null) {
       sessionExpirationTimeNanos = authenticationEndNanos + 1000 * 1000 * retvalSessionLifetimeMs;
}

I understand why you did this @acsaki . It's because you think:

when reauth is disabled (when max reauth ms is NOT set), leave ReauthInfo#sessionExpirationTimeNanos as null but return millis until the token expires in SaslAuthenticateResponse's sessionLifetimeMs

I think it's correct, IF the sasl client did close the connection after the sessionLifetimeMs. But I don't think we should have this optimistic assumption for this "potential" security issue. I agree with @SamBarker about your original version of "removing the if condition" is a good fix.:

...
else
      retvalSessionLifetimeMs = zeroIfNegative(Math.min(credentialExpirationMs - authenticationEndMs, 

sessionExpirationTimeNanos = authenticationEndNanos + 1000 * 1000 * retvalSessionLifetimeMs;

That is, no matter the reauth is enabled or not, we set the sessionExpirationTimeNanos, to inform the server, too. So that we can make sure when the session expired, either server or client will kill this connection. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the debug line in L698 confused me a little. When credentialExpirationMs != null && sessionExpirationTimeNanos == null, we'll log:

"Authentication complete; session max lifetime from broker config={} ms, credential expiration={} ({} ms); no session expiration, sending 0 ms to client"

The point I'm confused here is no session expiration. Why no session expiration? Since credentialExpirationMs != null, there must be some credential expiration time, so in this case, sending 0 to client doesn't make sense to me, right? I think the log should also be updated. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rondagostino @rajinisivaram , do you have any comments? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@showuon this is what I've also found rather confusing. I agree that after the token expires the connection should be closed sooner or later which isn't going to happen when sessionExpirationTimeNanos is not set. While there is the Authenticator interface where comments suggest that #serverSessionExpirationTimeNanos should be left as null when re-authentication is "disabled". Does it make sense for reauth to be disabled? Or rather there are some clients or SASL mechanisms where we don't expect reauthentication to ever happen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While there is the Authenticator interface where comments suggest that #serverSessionExpirationTimeNanos should be left as null when re-authentication is "disabled"

Is this javadoc what you mean? From the javadoc, I don't see it said we should return null for re-auth disabled. I think it's OK to return the value when re-auth disabled.

here are some clients or SASL mechanisms where we don't expect reauthentication to ever happen?

Yes, it is. You can check this for reference.

Thanks.

Copy link
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tests! Thank you. Left some comments.
Also, I found there are many tests failed with the error:

org.opentest4j.AssertionFailedError: Topic [__consumer_offsets] metadata not propagated after 60000 ms

ref: https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-12179/4/#showFailuresLink

I guess maybe it's because we use mock time, instead of system time now? Please help check them. Thank you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, for @SamBarker , I think we can discuss the property change in a separate thread since it's related to tests.

For this change:

if (connectionsMaxReauthMs != null) {
       sessionExpirationTimeNanos = authenticationEndNanos + 1000 * 1000 * retvalSessionLifetimeMs;
}

I understand why you did this @acsaki . It's because you think:

when reauth is disabled (when max reauth ms is NOT set), leave ReauthInfo#sessionExpirationTimeNanos as null but return millis until the token expires in SaslAuthenticateResponse's sessionLifetimeMs

I think it's correct, IF the sasl client did close the connection after the sessionLifetimeMs. But I don't think we should have this optimistic assumption for this "potential" security issue. I agree with @SamBarker about your original version of "removing the if condition" is a good fix.:

...
else
      retvalSessionLifetimeMs = zeroIfNegative(Math.min(credentialExpirationMs - authenticationEndMs, 

sessionExpirationTimeNanos = authenticationEndNanos + 1000 * 1000 * retvalSessionLifetimeMs;

That is, no matter the reauth is enabled or not, we set the sessionExpirationTimeNanos, to inform the server, too. So that we can make sure when the session expired, either server or client will kill this connection. WDYT?

Comment on lines +784 to -791
when(selectionKey.attachment()).thenReturn(kafkaChannel);

selectionKey.attach(kafkaChannel);
Set<SelectionKey> selectionKeys = Utils.mkSet(selectionKey);
selector.pollSelectionKeys(selectionKeys, false, System.nanoTime());

assertFalse(selector.connected().contains(kafkaChannel.id()));
assertTrue(selector.disconnected().containsKey(kafkaChannel.id()));
assertNull(selectionKey.attachment());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we did this change? Did we change anything affect this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test failed after my change to use mockitoInline to mock statics. The selectionKey.attach line called the mock's method directly (pointless?) while it looked like the actual intention was selectionKey.attachment() to return the kafkaChannel. That's how it worked actually. Calling the mock method directly in the test and later asserting on attachment being null seemed confusing to me and the assert actually fails too. (with #attachment returning the kafkaChannel). Maybe this was sort of overbearing, is there a simpler way to fix the test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I agree with the change. Make sense.

Comment on lines +978 to +982
when(selectionKey.isReadable()).thenReturn(true);
when(selectionKey.readyOps()).thenReturn(SelectionKey.OP_READ);
selectionKey.attach(channel);
when(selectionKey.attachment())
.thenReturn(channel)
.thenReturn(null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

testImplementation libs.bcpkix
testImplementation libs.junitJupiter
testImplementation libs.mockitoCore
testImplementation libs.mockitoInline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why we need MockitoInline here? Any method only exists in mockitoInline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed this to mock out ChannelBuilders#createPrincipalBuilder and Sasl#createSaslServer. It's a petty there are these static calls, hard to test them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChannelBuilders is our code, we should avoid mocking static methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acsaki , as @ijuma suggested, would you file another small PR to update the tests to avoid mocking the static methods? For the ChannelBuilders#createPrincipalBuilder, it looks like we actually don't need it for these tests, right? (the tests still pass after I removed them)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump, I second we should aim to remove this

@acsaki
Copy link
Contributor Author

acsaki commented Jun 3, 2022

Nice tests! Thank you. Left some comments. Also, I found there are many tests failed with the error:

org.opentest4j.AssertionFailedError: Topic [__consumer_offsets] metadata not propagated after 60000 ms

ref: https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-12179/4/#showFailuresLink

I guess maybe it's because we use mock time, instead of system time now? Please help check them. Thank you.

Thank you, haven't thought of MockTime. Indeed, lot of tests are failing yet, I've found out that in some test connections.max.reauth.ms gets set to 0 instead of being left as null, so my current code sets the session expiration eventually and clients seem to get disconnected.

@showuon
Copy link
Member

showuon commented Jun 9, 2022

@acsaki , is there any update for this PR? Do you need any help?

…ails

Reauthentication should be considered unset when connections.max.reauth.ms is at its default 0L value.
@acsaki
Copy link
Contributor Author

acsaki commented Jun 9, 2022

Thanks @showuon , I really like the idea that we should set sessionExpirationTimeNanos regardless whether max reauth is set or not, otherwise clients probably won't get disconnected after their tokens expire. Maybe the original intention was to still allow the clients to re-authenticate but nothing else? Feels safer this way.

Regarding whether max reauth is in play, I realized that a null check might not be not enough, we should probably treat the default 0L the same. (I think max reauth = 0 doesn't really make sense, does it?)
So I've updated the PR and now it checks connectionsMaxReauthMs != null && connectionsMaxReauthMs > 0 where there was only a null check. Tons of failing tests turned green instantly. Does this make sense?

Copy link
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acsaki , thanks for the update. LGTM! Could you convert this PR into "non-draft" so that I can merge it?
I'm triggering another jenkins build to make sure the tests passed.
Thanks.

Comment on lines +784 to -791
when(selectionKey.attachment()).thenReturn(kafkaChannel);

selectionKey.attach(kafkaChannel);
Set<SelectionKey> selectionKeys = Utils.mkSet(selectionKey);
selector.pollSelectionKeys(selectionKeys, false, System.nanoTime());

assertFalse(selector.connected().contains(kafkaChannel.id()));
assertTrue(selector.disconnected().containsKey(kafkaChannel.id()));
assertNull(selectionKey.attachment());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I agree with the change. Make sense.

.getNegotiatedProperty(SaslInternalConfigs.CREDENTIAL_LIFETIME_MS_SASL_NEGOTIATED_PROPERTY_KEY);
Long connectionsMaxReauthMs = connectionsMaxReauthMsByMechanism.get(saslMechanism);
if (credentialExpirationMs != null || connectionsMaxReauthMs != null) {
boolean maxReauthSet = connectionsMaxReauthMs != null && connectionsMaxReauthMs > 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boolean maxReauthSet = connectionsMaxReauthMs != null && connectionsMaxReauthMs > 0;

Make sense to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Some tests are failing on my machine also but pass when I run them in the IDE. :( Might be some flakiness but I'm looking into it right now.

@acsaki acsaki marked this pull request as ready for review June 10, 2022 08:34
@showuon
Copy link
Member

showuon commented Jun 10, 2022

Failed tests are unrelated.

    Build / JDK 17 and Scala 2.13 / org.apache.kafka.controller.QuorumControllerTest.testFenceMultipleBrokers()
    Build / JDK 11 and Scala 2.13 / kafka.api.ConsumerBounceTest.testCloseDuringRebalance()
    Build / JDK 8 and Scala 2.12 / org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest.testOffsetSyncsTopicsOnTarget()

@showuon showuon merged commit 3d5b41e into apache:trunk Jun 10, 2022
@showuon
Copy link
Member

showuon commented Jun 10, 2022

@acsaki , thanks for your good finding and the fix! And thanks for @tombentley and @SamBarker 's review!

@acsaki
Copy link
Contributor Author

acsaki commented Jun 10, 2022

Thank you @showuon, @tombentley and @SamBarker for the review, guidance and help! I'm really happy to see my first contribution merged, thank you!

rajinisivaram added a commit to confluentinc/kafka that referenced this pull request Jun 12, 2022
…-2022

* apache/trunk: (52 commits)
  KAFKA-13967: Document guarantees for producer callbacks on transaction commit (apache#12264)
  [KAFKA-13848] Clients remain connected after SASL re-authentication f… (apache#12179)
  KAFKA-10000: Zombie fencing logic (apache#11779)
  KAFKA-13947: Use %d formatting for integers rather than %s (apache#12267)
  KAFKA-13929: Replace legacy File.createNewFile() with NIO.2 Files.createFile() (apache#12197)
  KAFKA-13780: Generate OpenAPI file for Connect REST API (apache#12067)
  KAFKA-13917: Avoid calling lookupCoordinator() in tight loop (apache#12180)
  KAFKA-10199: Implement removing active and standby tasks from the state updater (apache#12270)
  MINOR: Update Scala to 2.13.8 in gradle.properties (apache#12273)
  MINOR: add java 8/scala 2.12 deprecation info in doc (apache#12261)
  ...

 Conflicts:
	gradle.properties
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants