[CI] KerberosAuthenticationIT testLoginByUsernamePassword failing #89324

mark-vieira · 2022-08-12T17:36:59Z

This and a couple of other Kerberos tests are failing on Amazon Linux 2022. This OS was recently added to our testing matrix so it's likely there's something funny with that CI image that these tests don't like. Since this seems to only reproduce on that system, I was unable to reproduce locally.

We may need to spin up an Amazon 2022 machine on AWS to reproduce this.

Build scan:
https://gradle-enterprise.elastic.co/s/3tsrwmkn3as6k/tests/:x-pack:qa:kerberos-tests:javaRestTest/org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT/testLoginByUsernamePassword

Reproduction line:
./gradlew ':x-pack:qa:kerberos-tests:javaRestTest' --tests "org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT.testLoginByUsernamePassword" -Dtests.seed=96CAAA1E169FAF62 -Dtests.locale=es-MX -Dtests.timezone=Australia/West -Druntime.java=17

Applicable branches:
main, 8.4, 8.3, 7.17

Reproduces locally?:
Didn't try

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT&tests.test=testLoginByUsernamePassword

Failure excerpt:

java.security.PrivilegedActionException: (No message provided)

  at __randomizedtesting.SeedInfo.seed([96CAAA1E169FAF62:AEEBA327C743A7FB]:0)
  at java.security.AccessController.doPrivileged(AccessController.java:716)
  at javax.security.auth.Subject.doAsPrivileged(Subject.java:584)
  at org.elasticsearch.xpack.security.authc.kerberos.SpnegoHttpClientConfigCallbackHandler.lambda$doAsPrivilegedWrapper$2(SpnegoHttpClientConfigCallbackHandler.java:207)
  at java.security.AccessController.doPrivileged(AccessController.java:569)
  at org.elasticsearch.xpack.security.authc.kerberos.SpnegoHttpClientConfigCallbackHandler.doAsPrivilegedWrapper(SpnegoHttpClientConfigCallbackHandler.java:207)
  at org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT.executeRequestAndVerifyResponse(KerberosAuthenticationIT.java:176)
  at org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT.testLoginByUsernamePassword(KerberosAuthenticationIT.java:120)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:833)

  Caused by: org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:34083], URI [/_security/_authenticate], status line [HTTP/1.1 401 Unauthorized]
  {"error":{"root_cause":[{"type":"security_exception","reason":"missing authentication credentials for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Negotiate","Bearer realm=\"security\"","ApiKey"]}}],"type":"security_exception","reason":"missing authentication credentials for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Negotiate","Bearer realm=\"security\"","ApiKey"]}},"status":401}

    at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:313)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:288)
    at org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT.lambda$executeRequestAndVerifyResponse$0(KerberosAuthenticationIT.java:178)
    at java.security.AccessController.doPrivileged(AccessController.java:712)
    at javax.security.auth.Subject.doAsPrivileged(Subject.java:584)
    at org.elasticsearch.xpack.security.authc.kerberos.SpnegoHttpClientConfigCallbackHandler.lambda$doAsPrivilegedWrapper$2(SpnegoHttpClientConfigCallbackHandler.java:207)
    at java.security.AccessController.doPrivileged(AccessController.java:569)
    at org.elasticsearch.xpack.security.authc.kerberos.SpnegoHttpClientConfigCallbackHandler.doAsPrivilegedWrapper(SpnegoHttpClientConfigCallbackHandler.java:207)
    at org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT.executeRequestAndVerifyResponse(KerberosAuthenticationIT.java:176)
    at org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT.testLoginByUsernamePassword(KerberosAuthenticationIT.java:120)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:568)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
    at java.lang.Thread.run(Thread.java:833)

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2022-08-12T17:37:22Z

Pinging @elastic/es-security (Team:Security)

droberts195 · 2022-08-15T08:29:15Z

This error message is also in the output:

>>>KRBError:
	 cTime is Mon Mar 14 10:41:49 AQTT 2022 1647236509000
	 sTime is Mon Aug 15 11:32:47 AQTT 2022 1660545167000
	 suSec is 311990
	 error code is 7
	 error Message is Server not found in Kerberos database
	 crealm is BUILD.ELASTIC.CO
	 cname is george@BUILD.ELASTIC.CO
	 sname is HTTP/127.0.0.1@BUILD.ELASTIC.CO
	 msgType is 30
KrbException: Server not found in Kerberos database (7) - LOOKING_UP_SERVER
	at java.security.jgss/sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:72)
	at java.security.jgss/sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:224)
	at java.security.jgss/sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:235)
	at java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCredsSingle(CredentialsUtil.java:477)
	at java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:340)
	at java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:314)
	at java.security.jgss/sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:169)
	at java.security.jgss/sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:493)
	at java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:700)
	at java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266)
	at java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196)
	at java.security.jgss/sun.security.jgss.spnego.SpNegoContext.GSS_initSecContext(SpNegoContext.java:883)
	at java.security.jgss/sun.security.jgss.spnego.SpNegoContext.initSecContext(SpNegoContext.java:315)
	at java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266)
	at java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196)
	at org.apache.http.impl.auth.GGSSchemeBase.generateGSSToken(GGSSchemeBase.java:123)
	at org.apache.http.impl.auth.SPNegoScheme.generateToken(SPNegoScheme.java:95)
	at org.apache.http.impl.auth.GGSSchemeBase.authenticate(GGSSchemeBase.java:221)
	at org.apache.http.impl.auth.SPNegoScheme.authenticate(SPNegoScheme.java:85)
	at org.apache.http.impl.auth.HttpAuthenticator.doAuth(HttpAuthenticator.java:233)
	at org.apache.http.impl.auth.HttpAuthenticator.generateAuthResponse(HttpAuthenticator.java:198)
	at org.apache.http.impl.nio.client.MainClientExec.generateRequest(MainClientExec.java:224)
	at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.generateRequest(DefaultClientExchangeHandlerImpl.java:134)
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.requestReady(HttpAsyncRequestExecutor.java:193)
	at org.apache.http.impl.nio.DefaultNHttpClientConnection.produceOutput(DefaultNHttpClientConnection.java:287)
	at org.apache.http.impl.nio.client.InternalIODispatch.onOutputReady(InternalIODispatch.java:86)
	at org.apache.http.impl.nio.client.InternalIODispatch.onOutputReady(InternalIODispatch.java:39)
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.outputReady(AbstractIODispatch.java:145)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.writable(BaseIOReactor.java:187)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:341)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: KrbException: Identifier doesn't match expected value (906)
	at java.security.jgss/sun.security.krb5.internal.KDCRep.init(KDCRep.java:140)
	at java.security.jgss/sun.security.krb5.internal.TGSRep.init(TGSRep.java:65)
	at java.security.jgss/sun.security.krb5.internal.TGSRep.<init>(TGSRep.java:60)
	at java.security.jgss/sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:54)
	... 34 more

mark-vieira · 2022-08-15T15:56:15Z

FYI, there is some urgency on tracking this down. We want to certify that Elasticsearch is compatbile with Amazon LInux 2022 to be included in the official launch partner program. We have until 31 August to show proof of validation to Amazon for this.

slobodanadamovic · 2022-08-15T19:35:03Z

@mark-vieira
The only thing which looks odd to me is the fact that Kerberos is searching for HTTP/127.0.0.1@BUILD.ELASTIC.CO server principal, but in our test es.keytab file we have only one entry defined for the principal HTTP/localhost@BUILD.ELASTIC.CO.

I'm suspecting this is caused by the missing mapping for 127.0.0.1 to localhost in /etc/hosts. I might be wrong, but maybe it's worth checking.

mark-vieira · 2022-08-15T19:36:41Z

I'm suspecting this is caused by the missing mapping for 127.0.0.1 to localhost in /etc/hosts. I might be wrong, but maybe it's worth checking.

That's a reasonable assumption. I'll reach out to the infra team and see what's happening here.

mark-vieira · 2022-08-15T19:40:57Z

@slobodanadamovic The kerberos server is running as a fixture in a Docker container. Are we perhaps missing a extra_hosts entry here:

elasticsearch/test/fixtures/krb5kdc-fixture/docker-compose.yml

Line 9 in 6bc4b56

- "kerberos.build.elastic.co:127.0.0.1"

I'm not familiar with how that principal gets resolved and what would determine localhost vs 127.0.0.1.

mark-vieira · 2022-08-15T23:36:30Z

Ok, that compose file fix didn't work and I've checked /etc/hosts and it includes a proper entry for localhost

127.0.0.1	localhost	localhost.localdomain	localhost4	localhost4.localdomain4

@slobodanadamovic do you mind perhaps syncing up to work though this? I have a remote environment that I can reproduce this error on.

slobodanadamovic · 2022-08-16T08:24:09Z

@mark-vieira Yeah, let's sync when you're online and we can debug it together.

slobodanadamovic · 2022-08-16T10:24:45Z

Ok, that compose file fix didn't work and I've checked /etc/hosts and it includes a proper entry for localhost

@mark-vieira I did a bit more digging and I think the problem here is on the client side (Amazon Linux). Is it possible to check /etc/hosts on amazon linux image?

The reason I think it's caused by DNS misconfiguration is because I was able to reproduce the error by changing the buildHttpHost method in KerberosAuthenticationIT to always return 127.0.0.1:port.

127.0.0.1 is defined in build.gradle:

elasticsearch/x-pack/qa/kerberos-tests/build.gradle

Line 19 in e4ff839

setting 'http.host', '127.0.0.1'

This IP is then resolved to a host name by calling InetAddress.getCanonicalHostName(). getCanonicalHostName method will return passed IP address if it fails to resolve a domain name.

elasticsearch/x-pack/qa/kerberos-tests/src/javaRestTest/java/org/elasticsearch/xpack/security/authc/kerberos/KerberosAuthenticationIT.java

Lines 158 to 166 in e4ff839

    
           protected HttpHost buildHttpHost(String host, int port) { 
        
               try { 
        
                   InetAddress inetAddress = InetAddress.getByName(host); 
        
                   return super.buildHttpHost(inetAddress.getCanonicalHostName(), port); 
        
               } catch (UnknownHostException e) { 
        
                   assumeNoException("failed to resolve host [" + host + "]", e); 
        
               } 
        
               throw new IllegalStateException("DNS not resolved and assume did not trip"); 
        
           }

Resolved hostname from above is then used in tests to form a principal by prefixing it with HTTP/:

elasticsearch/x-pack/qa/kerberos-tests/src/javaRestTest/java/org/elasticsearch/xpack/security/authc/kerberos/SpnegoHttpClientConfigCallbackHandler.java

Line 354 in f87ce07

    
           (PrivilegedExceptionAction<GSSName>) () -> gssManager.createName("HTTP/" + serviceHost, null)

mark-vieira · 2022-08-16T18:33:33Z

@mark-vieira I did a bit more digging and I think the problem here is on the client side (Amazon Linux). Is it possible to check /etc/hosts on amazon linux image?

The snippet I showed above is of the /etc/hosts file on the Amazon host, not the Docker image. Sorry if that was confusing.

slobodanadamovic · 2022-08-17T09:37:14Z

Sorry if that was confusing.

No worries. I saw a merge request adding extra_hosts to Kerberos' Docker compose file and assumed it's from docker image.

The snippet I showed above is of the /etc/hosts file on the Amazon host, not the Docker image.

The hosts file looks okay. Would it be possible to get access to the remote environment where error is reproducible?

valeriy42 · 2022-08-29T08:29:44Z

Another failure here https://gradle-enterprise.elastic.co/s/we7tftzqd4eos

KerberosAuthenticationIT » testLoginByUsernamePassword FAILED
KerberosAuthenticationIT » testGetOauth2TokenInExchangeForKerberosTickets FAILED
KerberosAuthenticationIT » testLoginByKeytab FAILED

csoulios · 2022-08-30T14:21:18Z

Falied again today:
https://gradle-enterprise.elastic.co/s/7xeqkizdx7rji/

slobodanadamovic · 2022-08-31T16:09:32Z

I was able to reproduce the issue and the problem seems to be caused by the /etc/hosts file. I still did not figure out why but the order of entries matters.

The first line of /etc/hosts file on Amazon Linux is:

::1	localhost6	localhost6.localdomain6

This is somehow causing that the resolution of 127.0.0.1 does not get resolved to the localhost but to the 127.0.0.1.

It all works as expected when a first line of /etc/hosts is:

127.0.0.1	localhost	localhost.localdomain	localhost4	localhost4.localdomain4

I've tested this locally on my Ubuntu machine and it behaves the same.

@mark-vieira
Can we influence the order of entries in /etc/hosts file? How is /etc/hosts file generated?

mark-vieira · 2022-08-31T16:54:02Z

@mark-vieira
Can we influence the order of entries in /etc/hosts file? How is /etc/hosts file generated?

I'm not sure, this is probably a question for @elastic/ci-systems.

droberts195 · 2022-08-31T17:11:32Z

I think the answer is that we could change the file for our own CI systems. But should we? If the out-of-the-box configuration causes our production code a problem as well as the integration test then users are going to run into it. Do we know that?

But even if this is purely a test problem, isn't the code pasted in #89324 (comment) in our test code? Couldn't that be changed to use something different if it initially decides on HTTP/127.0.0.1? For the CI team to change the machine setup will involve them writing some code, so it seems like we're asking another team to write some code save the work of changing the code of a test. And if the way /etc/hosts is set up on Amazon Linux 2022 is indicative of how it will be set up on many future Linux distributions then changing the test code would be a one-off but changing the machine setup would need to be done for all those future distributions too.

mark-vieira · 2022-08-31T17:14:51Z

Thanks for the insight @droberts195. I agree completely. We should adapt our tests to work with this scenario as it's likely to be encountered again. Although it's possible we are unintentionally mitigating this on our other CI images, perhaps by disabling IPv6 networking altogether? But again, if possible, we should make our tests more robust here.

slobodanadamovic · 2022-09-01T10:36:14Z

I agree that test should be made more robust to not depend on the platform's DNS configuration. Kerberos integration test is anyway assuming that 127.0.0.1 will always be resolved to localhost since it has defined a principal with localhost in a keytab file. I'll try to change this in order to avoid depending on OS to resolve the 127.0.0.1 address.

FWIW the out-of-the-box /etc/hosts file on Amazon Linux is defined with IPv4 loopback address first and looks like this:

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost6 localhost6.localdomain6

on our test instance /etc/hosts is changed (I assume by ansible) and looks like this:

::1	localhost6	localhost6.localdomain6

127.0.1.1	localhost6.localdomain6	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD0



127.0.0.1	localhost	localhost.localdomain	localhost4	localhost4.localdomain4
127.0.0.1	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD1	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD1
127.0.0.1	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD2	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD2
127.0.0.1	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD0	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD0
127.0.0.1	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD3	packer-111111111-AAAA-BBBB-CCCC-DDDDDDDDDDD3
127.0.0.1	elasticsearch-ci-immutable-amazonlinux-2022-1234567890.fq.domain.name	elasticsearch-ci-immutable-amazonlinux-2022-1234567890

I added the /etc/hosts here for reference in case we come across other test failures in the future (which I don't expect). I only took a brief look in the codebase and I don't see that any other integration test depends on DNS resolution but in case we see some strange failures this might help to narrow down the issue.

slobodanadamovic · 2022-09-02T10:11:57Z

@mark-vieira @droberts195
I've raised a PR (#89788) which will fall-back to localhost in case loopback address cannot be resolved. It's a bit hacky but it should avoid problems with resolution of 127.0.0.1.

Also, I'd like that we consider adjusting our CI systems as well and keep the original order of entries in the /etc/hosts file by only appending the new ones at the bottom of the file. The order matters and implies the priority of each entry. It took some time to find the cause and if we can avoid similar problems in the future it might be worth doing it.

droberts195 · 2022-09-05T08:44:06Z

Also, I'd like that we consider adjusting our CI systems as well and keep the original order of entries in the /etc/hosts file by only appending the new ones at the bottom of the file.

From the example you posted I think the bug in the CI system setup is that there are multiple lines for 127.0.0.1. This is discussed in https://unix.stackexchange.com/questions/102660/hosts-file-is-it-incorrect-to-have-the-same-ip-address-on-multiple-lines and it seems to be a violation of the spec of that file, in particular:

For each host a single line should be present

So in that sense it's fair enough to ask the CI systems team to modify the images.

However, I think it's also worth asking them if what they're doing is common practice, for example some standard way of using Packer or Ansible or whatever. If it is then it might also be worth adding something to the docs to say Elasticsearch doesn't work with multiple entries for the same IP address in /etc/hosts. For example, other OS gotchas are documented in https://www.elastic.co/guide/en/elasticsearch/reference/current/system-config.html.

It's easy to fall into the mindset of thinking we're employed to make all the tests pass, whereas in reality we're employed to make sure the software works for end users and the tests are just a tool to help us do that. So if we fix a test by changing the CI machine setup then we need to always think about whether end users need to do the same thing. For example, Elasticsearch needs a lot of file descriptors, and we avoid test failures caused by file descriptor exhaustion by making sure all our CI machines are configured to have at least 65536 file descriptors. But we also tell end users they need to configure their machines that way, otherwise we'd still get a load of support cases due to insufficient file descriptors despite all the tests passing.

slobodanadamovic · 2022-09-06T14:05:10Z

@droberts195

Thank you for engaging and looking into this issue as well. I will reach out to the CI team to get their opinion on this issue.

Regarding:

For each host a single line should be present

This was something I've noticed as well but it seems it's allowed because the resolution stops at the first matching line.
As part of this bug investigation I've also realised that there is a hard limit on the size each line in /etc/hosts can have. For linux this limit seems to be hardcoded to 1024 characters. I'm assuming CI team is splitting it into multiple lines to avoid this limit as it behaves similar as if we have one long entry with all aliases.

It's easy to fall into the mindset of thinking we're employed to make all the tests pass, whereas in reality we're employed to make sure the software works for end users and the tests are just a tool to help us do that.

I must shamefully admit that I fell into this mindset. :( I'll keep this in mind next time.

mark-vieira · 2022-09-06T16:20:55Z

This was something I've noticed as well but it seems it's allowed because the resolution stops at the first matching line.
As part of this bug investigation I've also realised that there is a hard limit on the size each line in /etc/hosts can have. For linux this limit seems to be hardcoded to 1024 characters. I'm assuming CI team is splitting it into multiple lines to avoid this limit as it behaves similar as if we have one long entry with all aliases.

Is there any scenario where the second entry is used though?

slobodanadamovic · 2022-09-07T08:35:35Z

Is there any scenario where the second entry is used though?

The second entry is used to resolve the address based on the hostname.
Some use this as a way for blocking ads.

mark-vieira · 2022-09-07T17:41:45Z

Ok, it's still no completely clear to me if the hosts file on this system is indeed wrong and that we should fix it. To me this seems like a brittle assumption:

Kerberos integration test is anyway assuming that 127.0.0.1 will always be resolved to localhost

Realistically, whoever is setting up kerberos auth has to take into consideration DNS resolution, so I don't think this is an actual problem, it's a problem with test fixture assumptions. It seems #89788 should address this, and make the test more robust to this scenario. I don't think changing the CI agent is a solution here, as you say, the multiple entries are likely a workaround to the line length limits.

So I'm apt to say that merging the linked PR is "good enough" here.

Implemented a fall-back to `localhost` when FQDN for loopback address (`127.0.0.1`) cannot be resolved. This can happen if test platform's DNS resolution is not properly configured. Closes #89324

…9788) Implemented a fall-back to `localhost` when FQDN for loopback address (`127.0.0.1`) cannot be resolved. This can happen if test platform's DNS resolution is not properly configured. Closes elastic#89324

…89899) Implemented a fall-back to `localhost` when FQDN for loopback address (`127.0.0.1`) cannot be resolved. This can happen if test platform's DNS resolution is not properly configured. Closes #89324

…89898) Implemented a fall-back to `localhost` when FQDN for loopback address (`127.0.0.1`) cannot be resolved. This can happen if test platform's DNS resolution is not properly configured. Closes #89324

mark-vieira added :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) >test-failure Triaged test failures from CI labels Aug 12, 2022

elasticsearchmachine added the Team:Security Meta label for security team label Aug 12, 2022

slobodanadamovic self-assigned this Aug 15, 2022

mark-vieira mentioned this issue Aug 15, 2022

Add explicit host entry for localhost to kerberos test fixture #89350

Closed

slobodanadamovic mentioned this issue Aug 17, 2022

[CI] KerberosAuthenticationIT.testGetOauth2TokenInExchangeForKerberosTickets #89426

Closed

csoulios mentioned this issue Aug 30, 2022

[7.17] Mute flaky KerberosAuthenticationIT tests #89732

Closed

slobodanadamovic mentioned this issue Sep 2, 2022

Make hostname resolution for loopback address more robust. #89788

Merged

slobodanadamovic closed this as completed in #89788 Sep 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] KerberosAuthenticationIT testLoginByUsernamePassword failing #89324

[CI] KerberosAuthenticationIT testLoginByUsernamePassword failing #89324

mark-vieira commented Aug 12, 2022

elasticsearchmachine commented Aug 12, 2022

droberts195 commented Aug 15, 2022

mark-vieira commented Aug 15, 2022

slobodanadamovic commented Aug 15, 2022

mark-vieira commented Aug 15, 2022

mark-vieira commented Aug 15, 2022 •

edited

Loading

mark-vieira commented Aug 15, 2022

slobodanadamovic commented Aug 16, 2022

slobodanadamovic commented Aug 16, 2022 •

edited

Loading

mark-vieira commented Aug 16, 2022

slobodanadamovic commented Aug 17, 2022 •

edited

Loading

valeriy42 commented Aug 29, 2022

csoulios commented Aug 30, 2022 •

edited

Loading

slobodanadamovic commented Aug 31, 2022 •

edited

Loading

mark-vieira commented Aug 31, 2022 •

edited

Loading

droberts195 commented Aug 31, 2022

mark-vieira commented Aug 31, 2022 •

edited

Loading

slobodanadamovic commented Sep 1, 2022 •

edited

Loading

slobodanadamovic commented Sep 2, 2022

droberts195 commented Sep 5, 2022

slobodanadamovic commented Sep 6, 2022

mark-vieira commented Sep 6, 2022

slobodanadamovic commented Sep 7, 2022

mark-vieira commented Sep 7, 2022

[CI] KerberosAuthenticationIT testLoginByUsernamePassword failing #89324

[CI] KerberosAuthenticationIT testLoginByUsernamePassword failing #89324

Comments

mark-vieira commented Aug 12, 2022

elasticsearchmachine commented Aug 12, 2022

droberts195 commented Aug 15, 2022

mark-vieira commented Aug 15, 2022

slobodanadamovic commented Aug 15, 2022

mark-vieira commented Aug 15, 2022

mark-vieira commented Aug 15, 2022 • edited Loading

mark-vieira commented Aug 15, 2022

slobodanadamovic commented Aug 16, 2022

slobodanadamovic commented Aug 16, 2022 • edited Loading

mark-vieira commented Aug 16, 2022

slobodanadamovic commented Aug 17, 2022 • edited Loading

valeriy42 commented Aug 29, 2022

csoulios commented Aug 30, 2022 • edited Loading

slobodanadamovic commented Aug 31, 2022 • edited Loading

mark-vieira commented Aug 31, 2022 • edited Loading

droberts195 commented Aug 31, 2022

mark-vieira commented Aug 31, 2022 • edited Loading

slobodanadamovic commented Sep 1, 2022 • edited Loading

slobodanadamovic commented Sep 2, 2022

droberts195 commented Sep 5, 2022

slobodanadamovic commented Sep 6, 2022

mark-vieira commented Sep 6, 2022

slobodanadamovic commented Sep 7, 2022

mark-vieira commented Sep 7, 2022

mark-vieira commented Aug 15, 2022 •

edited

Loading

slobodanadamovic commented Aug 16, 2022 •

edited

Loading

slobodanadamovic commented Aug 17, 2022 •

edited

Loading

csoulios commented Aug 30, 2022 •

edited

Loading

slobodanadamovic commented Aug 31, 2022 •

edited

Loading

mark-vieira commented Aug 31, 2022 •

edited

Loading

mark-vieira commented Aug 31, 2022 •

edited

Loading

slobodanadamovic commented Sep 1, 2022 •

edited

Loading