Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] KerberosAuthenticationIT testLoginByUsernamePassword failing #89324

Closed
mark-vieira opened this issue Aug 12, 2022 · 24 comments · Fixed by #89788
Closed

[CI] KerberosAuthenticationIT testLoginByUsernamePassword failing #89324

mark-vieira opened this issue Aug 12, 2022 · 24 comments · Fixed by #89788
Assignees
Labels
:Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) Team:Security Meta label for security team >test-failure Triaged test failures from CI

Comments

@mark-vieira
Copy link
Contributor

This and a couple of other Kerberos tests are failing on Amazon Linux 2022. This OS was recently added to our testing matrix so it's likely there's something funny with that CI image that these tests don't like. Since this seems to only reproduce on that system, I was unable to reproduce locally.

We may need to spin up an Amazon 2022 machine on AWS to reproduce this.

Build scan:
https://gradle-enterprise.elastic.co/s/3tsrwmkn3as6k/tests/:x-pack:qa:kerberos-tests:javaRestTest/org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT/testLoginByUsernamePassword

Reproduction line:
./gradlew ':x-pack:qa:kerberos-tests:javaRestTest' --tests "org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT.testLoginByUsernamePassword" -Dtests.seed=96CAAA1E169FAF62 -Dtests.locale=es-MX -Dtests.timezone=Australia/West -Druntime.java=17

Applicable branches:
main, 8.4, 8.3, 7.17

Reproduces locally?:
Didn't try

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT&tests.test=testLoginByUsernamePassword

Failure excerpt:

java.security.PrivilegedActionException: (No message provided)

  at __randomizedtesting.SeedInfo.seed([96CAAA1E169FAF62:AEEBA327C743A7FB]:0)
  at java.security.AccessController.doPrivileged(AccessController.java:716)
  at javax.security.auth.Subject.doAsPrivileged(Subject.java:584)
  at org.elasticsearch.xpack.security.authc.kerberos.SpnegoHttpClientConfigCallbackHandler.lambda$doAsPrivilegedWrapper$2(SpnegoHttpClientConfigCallbackHandler.java:207)
  at java.security.AccessController.doPrivileged(AccessController.java:569)
  at org.elasticsearch.xpack.security.authc.kerberos.SpnegoHttpClientConfigCallbackHandler.doAsPrivilegedWrapper(SpnegoHttpClientConfigCallbackHandler.java:207)
  at org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT.executeRequestAndVerifyResponse(KerberosAuthenticationIT.java:176)
  at org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT.testLoginByUsernamePassword(KerberosAuthenticationIT.java:120)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:833)

  Caused by: org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:34083], URI [/_security/_authenticate], status line [HTTP/1.1 401 Unauthorized]
  {"error":{"root_cause":[{"type":"security_exception","reason":"missing authentication credentials for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Negotiate","Bearer realm=\"security\"","ApiKey"]}}],"type":"security_exception","reason":"missing authentication credentials for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Negotiate","Bearer realm=\"security\"","ApiKey"]}},"status":401}

    at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:313)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:288)
    at org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT.lambda$executeRequestAndVerifyResponse$0(KerberosAuthenticationIT.java:178)
    at java.security.AccessController.doPrivileged(AccessController.java:712)
    at javax.security.auth.Subject.doAsPrivileged(Subject.java:584)
    at org.elasticsearch.xpack.security.authc.kerberos.SpnegoHttpClientConfigCallbackHandler.lambda$doAsPrivilegedWrapper$2(SpnegoHttpClientConfigCallbackHandler.java:207)
    at java.security.AccessController.doPrivileged(AccessController.java:569)
    at org.elasticsearch.xpack.security.authc.kerberos.SpnegoHttpClientConfigCallbackHandler.doAsPrivilegedWrapper(SpnegoHttpClientConfigCallbackHandler.java:207)
    at org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT.executeRequestAndVerifyResponse(KerberosAuthenticationIT.java:176)
    at org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT.testLoginByUsernamePassword(KerberosAuthenticationIT.java:120)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:568)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
    at java.lang.Thread.run(Thread.java:833)

@mark-vieira mark-vieira added :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) >test-failure Triaged test failures from CI labels Aug 12, 2022
@elasticsearchmachine elasticsearchmachine added the Team:Security Meta label for security team label Aug 12, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@droberts195
Copy link
Contributor

This error message is also in the output:

>>>KRBError:
	 cTime is Mon Mar 14 10:41:49 AQTT 2022 1647236509000
	 sTime is Mon Aug 15 11:32:47 AQTT 2022 1660545167000
	 suSec is 311990
	 error code is 7
	 error Message is Server not found in Kerberos database
	 crealm is BUILD.ELASTIC.CO
	 cname is george@BUILD.ELASTIC.CO
	 sname is HTTP/127.0.0.1@BUILD.ELASTIC.CO
	 msgType is 30
KrbException: Server not found in Kerberos database (7) - LOOKING_UP_SERVER
	at java.security.jgss/sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:72)
	at java.security.jgss/sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:224)
	at java.security.jgss/sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:235)
	at java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCredsSingle(CredentialsUtil.java:477)
	at java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:340)
	at java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:314)
	at java.security.jgss/sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:169)
	at java.security.jgss/sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:493)
	at java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:700)
	at java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266)
	at java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196)
	at java.security.jgss/sun.security.jgss.spnego.SpNegoContext.GSS_initSecContext(SpNegoContext.java:883)
	at java.security.jgss/sun.security.jgss.spnego.SpNegoContext.initSecContext(SpNegoContext.java:315)
	at java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266)
	at java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196)
	at org.apache.http.impl.auth.GGSSchemeBase.generateGSSToken(GGSSchemeBase.java:123)
	at org.apache.http.impl.auth.SPNegoScheme.generateToken(SPNegoScheme.java:95)
	at org.apache.http.impl.auth.GGSSchemeBase.authenticate(GGSSchemeBase.java:221)
	at org.apache.http.impl.auth.SPNegoScheme.authenticate(SPNegoScheme.java:85)
	at org.apache.http.impl.auth.HttpAuthenticator.doAuth(HttpAuthenticator.java:233)
	at org.apache.http.impl.auth.HttpAuthenticator.generateAuthResponse(HttpAuthenticator.java:198)
	at org.apache.http.impl.nio.client.MainClientExec.generateRequest(MainClientExec.java:224)
	at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.generateRequest(DefaultClientExchangeHandlerImpl.java:134)
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.requestReady(HttpAsyncRequestExecutor.java:193)
	at org.apache.http.impl.nio.DefaultNHttpClientConnection.produceOutput(DefaultNHttpClientConnection.java:287)
	at org.apache.http.impl.nio.client.InternalIODispatch.onOutputReady(InternalIODispatch.java:86)
	at org.apache.http.impl.nio.client.InternalIODispatch.onOutputReady(InternalIODispatch.java:39)
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.outputReady(AbstractIODispatch.java:145)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.writable(BaseIOReactor.java:187)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:341)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: KrbException: Identifier doesn't match expected value (906)
	at java.security.jgss/sun.security.krb5.internal.KDCRep.init(KDCRep.java:140)
	at java.security.jgss/sun.security.krb5.internal.TGSRep.init(TGSRep.java:65)
	at java.security.jgss/sun.security.krb5.internal.TGSRep.<init>(TGSRep.java:60)
	at java.security.jgss/sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:54)
	... 34 more

@mark-vieira
Copy link
Contributor Author

FYI, there is some urgency on tracking this down. We want to certify that Elasticsearch is compatbile with Amazon LInux 2022 to be included in the official launch partner program. We have until 31 August to show proof of validation to Amazon for this.

@slobodanadamovic slobodanadamovic self-assigned this Aug 15, 2022
@slobodanadamovic
Copy link
Contributor

@mark-vieira
The only thing which looks odd to me is the fact that Kerberos is searching for HTTP/127.0.0.1@BUILD.ELASTIC.CO server principal, but in our test es.keytab file we have only one entry defined for the principal HTTP/localhost@BUILD.ELASTIC.CO.

I'm suspecting this is caused by the missing mapping for 127.0.0.1 to localhost in /etc/hosts. I might be wrong, but maybe it's worth checking.

@mark-vieira
Copy link
Contributor Author

I'm suspecting this is caused by the missing mapping for 127.0.0.1 to localhost in /etc/hosts. I might be wrong, but maybe it's worth checking.

That's a reasonable assumption. I'll reach out to the infra team and see what's happening here.

@mark-vieira
Copy link
Contributor Author

mark-vieira commented Aug 15, 2022

@slobodanadamovic The kerberos server is running as a fixture in a Docker container. Are we perhaps missing a extra_hosts entry here:

- "kerberos.build.elastic.co:127.0.0.1"

I'm not familiar with how that principal gets resolved and what would determine localhost vs 127.0.0.1.

@mark-vieira
Copy link
Contributor Author

Ok, that compose file fix didn't work and I've checked /etc/hosts and it includes a proper entry for localhost

127.0.0.1	localhost	localhost.localdomain	localhost4	localhost4.localdomain4

@slobodanadamovic do you mind perhaps syncing up to work though this? I have a remote environment that I can reproduce this error on.

@slobodanadamovic
Copy link
Contributor

@mark-vieira Yeah, let's sync when you're online and we can debug it together.

@slobodanadamovic
Copy link
Contributor

slobodanadamovic commented Aug 16, 2022

Ok, that compose file fix didn't work and I've checked /etc/hosts and it includes a proper entry for localhost

@mark-vieira I did a bit more digging and I think the problem here is on the client side (Amazon Linux). Is it possible to check /etc/hosts on amazon linux image?

The reason I think it's caused by DNS misconfiguration is because I was able to reproduce the error by changing the buildHttpHost method in KerberosAuthenticationIT to always return 127.0.0.1:port.

127.0.0.1 is defined in build.gradle:

setting 'http.host', '127.0.0.1'

This IP is then resolved to a host name by calling InetAddress.getCanonicalHostName(). getCanonicalHostName method will return passed IP address if it fails to resolve a domain name.

protected HttpHost buildHttpHost(String host, int port) {
try {
InetAddress inetAddress = InetAddress.getByName(host);
return super.buildHttpHost(inetAddress.getCanonicalHostName(), port);
} catch (UnknownHostException e) {
assumeNoException("failed to resolve host [" + host + "]", e);
}
throw new IllegalStateException("DNS not resolved and assume did not trip");
}

Resolved hostname from above is then used in tests to form a principal by prefixing it with HTTP/:

(PrivilegedExceptionAction<GSSName>) () -> gssManager.createName("HTTP/" + serviceHost, null)

@mark-vieira
Copy link
Contributor Author

@mark-vieira I did a bit more digging and I think the problem here is on the client side (Amazon Linux). Is it possible to check /etc/hosts on amazon linux image?

The snippet I showed above is of the /etc/hosts file on the Amazon host, not the Docker image. Sorry if that was confusing.

@slobodanadamovic
Copy link
Contributor

slobodanadamovic commented Aug 17, 2022

Sorry if that was confusing.

No worries. I saw a merge request adding extra_hosts to Kerberos' Docker compose file and assumed it's from docker image.

The snippet I showed above is of the /etc/hosts file on the Amazon host, not the Docker image.

The hosts file looks okay. Would it be possible to get access to the remote environment where error is reproducible?

@valeriy42
Copy link
Contributor

Another failure here https://gradle-enterprise.elastic.co/s/we7tftzqd4eos

  • KerberosAuthenticationIT » testLoginByUsernamePassword FAILED
  • KerberosAuthenticationIT » testGetOauth2TokenInExchangeForKerberosTickets FAILED
  • KerberosAuthenticationIT » testLoginByKeytab FAILED

@slobodanadamovic
Copy link
Contributor

slobodanadamovic commented Aug 31, 2022

I was able to reproduce the issue and the problem seems to be caused by the /etc/hosts file. I still did not figure out why but the order of entries matters.

The first line of /etc/hosts file on Amazon Linux is:

::1	localhost6	localhost6.localdomain6

This is somehow causing that the resolution of 127.0.0.1 does not get resolved to the localhost but to the 127.0.0.1.

It all works as expected when a first line of /etc/hosts is:

127.0.0.1	localhost	localhost.localdomain	localhost4	localhost4.localdomain4

I've tested this locally on my Ubuntu machine and it behaves the same.

@mark-vieira
Can we influence the order of entries in /etc/hosts file? How is /etc/hosts file generated?

@mark-vieira
Copy link
Contributor Author

mark-vieira commented Aug 31, 2022

@mark-vieira
Can we influence the order of entries in /etc/hosts file? How is /etc/hosts file generated?

I'm not sure, this is probably a question for @elastic/ci-systems.

@droberts195
Copy link
Contributor

I think the answer is that we could change the file for our own CI systems. But should we? If the out-of-the-box configuration causes our production code a problem as well as the integration test then users are going to run into it. Do we know that?

But even if this is purely a test problem, isn't the code pasted in #89324 (comment) in our test code? Couldn't that be changed to use something different if it initially decides on HTTP/127.0.0.1? For the CI team to change the machine setup will involve them writing some code, so it seems like we're asking another team to write some code save the work of changing the code of a test. And if the way /etc/hosts is set up on Amazon Linux 2022 is indicative of how it will be set up on many future Linux distributions then changing the test code would be a one-off but changing the machine setup would need to be done for all those future distributions too.

@mark-vieira
Copy link
Contributor Author

mark-vieira commented Aug 31, 2022

Thanks for the insight @droberts195. I agree completely. We should adapt our tests to work with this scenario as it's likely to be encountered again. Although it's possible we are unintentionally mitigating this on our other CI images, perhaps by disabling IPv6 networking altogether? But again, if possible, we should make our tests more robust here.

@slobodanadamovic
Copy link
Contributor

slobodanadamovic commented Sep 1, 2022

I agree that test should be made more robust to not depend on the platform's DNS configuration. Kerberos integration test is anyway assuming that 127.0.0.1 will always be resolved to localhost since it has defined a principal with localhost in a keytab file. I'll try to change this in order to avoid depending on OS to resolve the 127.0.0.1 address.

FWIW the out-of-the-box /etc/hosts file on Amazon Linux is defined with IPv4 loopback address first and looks like this:

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost6 localhost6.localdomain6

on our test instance /etc/hosts is changed (I assume by ansible) and looks like this:

::1	localhost6	localhost6.localdomain6

127.0.1.1	localhost6.localdomain6	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD0



127.0.0.1	localhost	localhost.localdomain	localhost4	localhost4.localdomain4
127.0.0.1	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD1	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD1
127.0.0.1	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD2	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD2
127.0.0.1	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD0	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD0
127.0.0.1	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD3	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD3
127.0.0.1	elasticsearch-ci-immutable-amazonlinux-2022-1234567890.fq.domain.name	elasticsearch-ci-immutable-amazonlinux-2022-1234567890

I added the /etc/hosts here for reference in case we come across other test failures in the future (which I don't expect). I only took a brief look in the codebase and I don't see that any other integration test depends on DNS resolution but in case we see some strange failures this might help to narrow down the issue.

@slobodanadamovic
Copy link
Contributor

@mark-vieira @droberts195
I've raised a PR (#89788) which will fall-back to localhost in case loopback address cannot be resolved. It's a bit hacky but it should avoid problems with resolution of 127.0.0.1.

Also, I'd like that we consider adjusting our CI systems as well and keep the original order of entries in the /etc/hosts file by only appending the new ones at the bottom of the file. The order matters and implies the priority of each entry. It took some time to find the cause and if we can avoid similar problems in the future it might be worth doing it.

@droberts195
Copy link
Contributor

Also, I'd like that we consider adjusting our CI systems as well and keep the original order of entries in the /etc/hosts file by only appending the new ones at the bottom of the file.

From the example you posted I think the bug in the CI system setup is that there are multiple lines for 127.0.0.1. This is discussed in https://unix.stackexchange.com/questions/102660/hosts-file-is-it-incorrect-to-have-the-same-ip-address-on-multiple-lines and it seems to be a violation of the spec of that file, in particular:

For each host a single line should be present

So in that sense it's fair enough to ask the CI systems team to modify the images.

However, I think it's also worth asking them if what they're doing is common practice, for example some standard way of using Packer or Ansible or whatever. If it is then it might also be worth adding something to the docs to say Elasticsearch doesn't work with multiple entries for the same IP address in /etc/hosts. For example, other OS gotchas are documented in https://www.elastic.co/guide/en/elasticsearch/reference/current/system-config.html.

It's easy to fall into the mindset of thinking we're employed to make all the tests pass, whereas in reality we're employed to make sure the software works for end users and the tests are just a tool to help us do that. So if we fix a test by changing the CI machine setup then we need to always think about whether end users need to do the same thing. For example, Elasticsearch needs a lot of file descriptors, and we avoid test failures caused by file descriptor exhaustion by making sure all our CI machines are configured to have at least 65536 file descriptors. But we also tell end users they need to configure their machines that way, otherwise we'd still get a load of support cases due to insufficient file descriptors despite all the tests passing.

@slobodanadamovic
Copy link
Contributor

@droberts195

Thank you for engaging and looking into this issue as well. I will reach out to the CI team to get their opinion on this issue.

Regarding:

For each host a single line should be present

This was something I've noticed as well but it seems it's allowed because the resolution stops at the first matching line.
As part of this bug investigation I've also realised that there is a hard limit on the size each line in /etc/hosts can have. For linux this limit seems to be hardcoded to 1024 characters. I'm assuming CI team is splitting it into multiple lines to avoid this limit as it behaves similar as if we have one long entry with all aliases.

It's easy to fall into the mindset of thinking we're employed to make all the tests pass, whereas in reality we're employed to make sure the software works for end users and the tests are just a tool to help us do that.

I must shamefully admit that I fell into this mindset. :( I'll keep this in mind next time.

@mark-vieira
Copy link
Contributor Author

This was something I've noticed as well but it seems it's allowed because the resolution stops at the first matching line.
As part of this bug investigation I've also realised that there is a hard limit on the size each line in /etc/hosts can have. For linux this limit seems to be hardcoded to 1024 characters. I'm assuming CI team is splitting it into multiple lines to avoid this limit as it behaves similar as if we have one long entry with all aliases.

Is there any scenario where the second entry is used though?

@slobodanadamovic
Copy link
Contributor

Is there any scenario where the second entry is used though?

The second entry is used to resolve the address based on the hostname.
Some use this as a way for blocking ads.

@mark-vieira
Copy link
Contributor Author

Ok, it's still no completely clear to me if the hosts file on this system is indeed wrong and that we should fix it. To me this seems like a brittle assumption:

Kerberos integration test is anyway assuming that 127.0.0.1 will always be resolved to localhost

Realistically, whoever is setting up kerberos auth has to take into consideration DNS resolution, so I don't think this is an actual problem, it's a problem with test fixture assumptions. It seems #89788 should address this, and make the test more robust to this scenario. I don't think changing the CI agent is a solution here, as you say, the multiple entries are likely a workaround to the line length limits.

So I'm apt to say that merging the linked PR is "good enough" here.

slobodanadamovic added a commit that referenced this issue Sep 8, 2022
Implemented a fall-back to `localhost` when FQDN for
loopback address (`127.0.0.1`) cannot be resolved.
This can happen if test platform's DNS resolution
is not properly configured.

Closes #89324
slobodanadamovic added a commit to slobodanadamovic/elasticsearch that referenced this issue Sep 8, 2022
…9788)

Implemented a fall-back to `localhost` when FQDN for
loopback address (`127.0.0.1`) cannot be resolved.
This can happen if test platform's DNS resolution
is not properly configured.

Closes elastic#89324
slobodanadamovic added a commit to slobodanadamovic/elasticsearch that referenced this issue Sep 8, 2022
…9788)

Implemented a fall-back to `localhost` when FQDN for
loopback address (`127.0.0.1`) cannot be resolved.
This can happen if test platform's DNS resolution
is not properly configured.

Closes elastic#89324
elasticsearchmachine pushed a commit that referenced this issue Sep 8, 2022
…89899)

Implemented a fall-back to `localhost` when FQDN for
loopback address (`127.0.0.1`) cannot be resolved.
This can happen if test platform's DNS resolution
is not properly configured.

Closes #89324
elasticsearchmachine pushed a commit that referenced this issue Sep 8, 2022
…89898)

Implemented a fall-back to `localhost` when FQDN for
loopback address (`127.0.0.1`) cannot be resolved.
This can happen if test platform's DNS resolution
is not properly configured.

Closes #89324
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) Team:Security Meta label for security team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants