-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] KerberosAuthenticationIT testLoginByUsernamePassword failing #89324
Comments
Pinging @elastic/es-security (Team:Security) |
This error message is also in the output:
|
FYI, there is some urgency on tracking this down. We want to certify that Elasticsearch is compatbile with Amazon LInux 2022 to be included in the official launch partner program. We have until 31 August to show proof of validation to Amazon for this. |
@mark-vieira I'm suspecting this is caused by the missing mapping for |
That's a reasonable assumption. I'll reach out to the infra team and see what's happening here. |
@slobodanadamovic The kerberos server is running as a fixture in a Docker container. Are we perhaps missing a
I'm not familiar with how that principal gets resolved and what would determine |
Ok, that compose file fix didn't work and I've checked
@slobodanadamovic do you mind perhaps syncing up to work though this? I have a remote environment that I can reproduce this error on. |
@mark-vieira Yeah, let's sync when you're online and we can debug it together. |
@mark-vieira I did a bit more digging and I think the problem here is on the client side (Amazon Linux). Is it possible to check The reason I think it's caused by DNS misconfiguration is because I was able to reproduce the error by changing the
This IP is then resolved to a host name by calling Lines 158 to 166 in e4ff839
Resolved hostname from above is then used in tests to form a principal by prefixing it with Line 354 in f87ce07
|
The snippet I showed above is of the |
No worries. I saw a merge request adding extra_hosts to Kerberos' Docker compose file and assumed it's from docker image.
The hosts file looks okay. Would it be possible to get access to the remote environment where error is reproducible? |
Another failure here https://gradle-enterprise.elastic.co/s/we7tftzqd4eos
|
I was able to reproduce the issue and the problem seems to be caused by the The first line of
This is somehow causing that the resolution of It all works as expected when a first line of
I've tested this locally on my Ubuntu machine and it behaves the same. @mark-vieira |
I'm not sure, this is probably a question for @elastic/ci-systems. |
I think the answer is that we could change the file for our own CI systems. But should we? If the out-of-the-box configuration causes our production code a problem as well as the integration test then users are going to run into it. Do we know that? But even if this is purely a test problem, isn't the code pasted in #89324 (comment) in our test code? Couldn't that be changed to use something different if it initially decides on |
Thanks for the insight @droberts195. I agree completely. We should adapt our tests to work with this scenario as it's likely to be encountered again. Although it's possible we are unintentionally mitigating this on our other CI images, perhaps by disabling IPv6 networking altogether? But again, if possible, we should make our tests more robust here. |
I agree that test should be made more robust to not depend on the platform's DNS configuration. Kerberos integration test is anyway assuming that FWIW the out-of-the-box
on our test instance
I added the |
@mark-vieira @droberts195 Also, I'd like that we consider adjusting our CI systems as well and keep the original order of entries in the |
From the example you posted I think the bug in the CI system setup is that there are multiple lines for
So in that sense it's fair enough to ask the CI systems team to modify the images. However, I think it's also worth asking them if what they're doing is common practice, for example some standard way of using Packer or Ansible or whatever. If it is then it might also be worth adding something to the docs to say Elasticsearch doesn't work with multiple entries for the same IP address in It's easy to fall into the mindset of thinking we're employed to make all the tests pass, whereas in reality we're employed to make sure the software works for end users and the tests are just a tool to help us do that. So if we fix a test by changing the CI machine setup then we need to always think about whether end users need to do the same thing. For example, Elasticsearch needs a lot of file descriptors, and we avoid test failures caused by file descriptor exhaustion by making sure all our CI machines are configured to have at least 65536 file descriptors. But we also tell end users they need to configure their machines that way, otherwise we'd still get a load of support cases due to insufficient file descriptors despite all the tests passing. |
Thank you for engaging and looking into this issue as well. I will reach out to the CI team to get their opinion on this issue. Regarding:
This was something I've noticed as well but it seems it's allowed because the resolution stops at the first matching line.
I must shamefully admit that I fell into this mindset. :( I'll keep this in mind next time. |
Is there any scenario where the second entry is used though? |
The second entry is used to resolve the address based on the hostname. |
Ok, it's still no completely clear to me if the hosts file on this system is indeed wrong and that we should fix it. To me this seems like a brittle assumption:
Realistically, whoever is setting up kerberos auth has to take into consideration DNS resolution, so I don't think this is an actual problem, it's a problem with test fixture assumptions. It seems #89788 should address this, and make the test more robust to this scenario. I don't think changing the CI agent is a solution here, as you say, the multiple entries are likely a workaround to the line length limits. So I'm apt to say that merging the linked PR is "good enough" here. |
Implemented a fall-back to `localhost` when FQDN for loopback address (`127.0.0.1`) cannot be resolved. This can happen if test platform's DNS resolution is not properly configured. Closes #89324
…9788) Implemented a fall-back to `localhost` when FQDN for loopback address (`127.0.0.1`) cannot be resolved. This can happen if test platform's DNS resolution is not properly configured. Closes elastic#89324
…9788) Implemented a fall-back to `localhost` when FQDN for loopback address (`127.0.0.1`) cannot be resolved. This can happen if test platform's DNS resolution is not properly configured. Closes elastic#89324
This and a couple of other Kerberos tests are failing on Amazon Linux 2022. This OS was recently added to our testing matrix so it's likely there's something funny with that CI image that these tests don't like. Since this seems to only reproduce on that system, I was unable to reproduce locally.
We may need to spin up an Amazon 2022 machine on AWS to reproduce this.
Build scan:
https://gradle-enterprise.elastic.co/s/3tsrwmkn3as6k/tests/:x-pack:qa:kerberos-tests:javaRestTest/org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT/testLoginByUsernamePassword
Reproduction line:
./gradlew ':x-pack:qa:kerberos-tests:javaRestTest' --tests "org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT.testLoginByUsernamePassword" -Dtests.seed=96CAAA1E169FAF62 -Dtests.locale=es-MX -Dtests.timezone=Australia/West -Druntime.java=17
Applicable branches:
main, 8.4, 8.3, 7.17
Reproduces locally?:
Didn't try
Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.security.authc.kerberos.KerberosAuthenticationIT&tests.test=testLoginByUsernamePassword
Failure excerpt:
The text was updated successfully, but these errors were encountered: