-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
:test:fixtures:krb5kdc-fixture:composeUp failure #48027
Comments
Pinging @elastic/es-core-infra (:Core/Infra/Build) |
This seems like an issue with the underlying docker image we are trying to start. @jbaiera can you have a look? |
This particular failure is on ubuntu-1804 I know we'we had similar issues on SLES and maybe fedora 29 and we pulled them from the general worker pool, but didn't get to the bottom of the issue. |
Another one happened in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob+fast+part2/1688/consoleText:
The worker this happened on was elasticsearch-ci-immutable-ubuntu-1804-1571225771493903523. As it was ephemeral it's gone now. |
This also just happened on a CI build on one of my PRs: https://gradle-enterprise.elastic.co/s/aexbtvi3nivxo/console-log?task=:test:fixtures:krb5kdc-fixture:composeUp |
Both of those are on ubuntu 18, so I'm starting to believe there's something specific on that platform, possibly specific to newer versions of something ( kernel or docker ) and we got an upgrade in this version of Ubuntu. I have a set of PRs that changes how we gather logs once the build completes to include system logs. |
Definitely seems related to ubuntu 18.04. This Debian 10 build also failed for the same reason, which would make sense since Ubuntu 18.04 is based on Debian 10. |
New failures should have a "GCP upload" link in the build scan. We have to wait for one of those to get additional insight on this. |
@atorok I just took a look at the logs uploaded to GCP for this failure and the only thing in there was the Gradle daemon log which is effectively identical to the console output. What kind of Docker-specific logging were we expecting to capture here? |
@atorok it looks like we aren't getting a
|
I opened #48276 to address that |
Another failure here: https://gradle-enterprise.elastic.co/s/iaherl5mpkitc |
After discussing, since this looks to be something somewhat related Docker volumes, would be to use a different storage driver here. We've already switched to using @fatmcgav What would be the implications of ☝️? In https://github.com/elastic/infra/pull/15395 we conditionally apply this to only one image, what would be involve in doing this by default? What else would this effect (ideally only elasticsearch-ci images)? |
@mark-vieira As to wider implications, I'm not sure... I'll have a look at what options exist... With regards to making the change across the board for |
@fatmcgav Makes sense. My main concern is that changes in those ansible scripts are "global" and would apply to other project CI images as well. Is that accurate? |
Another occurrence: https://gradle-enterprise.elastic.co/s/gyjdbjuetvz66 One concern I have changing the storage driver to devicemapper is we'd be switching away from Docker's defaults on many distributions, which is currently I wonder if the specific issue is related to moby/moby#39475; do we have any There have been some fixes in moby/moby#37993 that landed in Docker 18.09 that could be useful to read. |
We'll get a full system log, including the kernel log once elastic/infra#15520 merge. We might have it for some platforms already in the additional logs linked from the build scan. I think @dliappis brings up a good point. We need to make this change for elasticsearch-ci only and not packaging-ci, we have packaging tests that run on GCP these days, not only the vagrant ones. |
The
I'm also staring to doubt if this is necessarily related to the storage driver as on SLES it's trying to create a |
I saw this today and also on one of my PRs: |
It seems that the new images where the logs would be included were not picked up by gobld. |
What's indicating this? If this were the case wouldn't we get a completely empty
|
Also, I think we should consider blacklisting these images for "general-purpose" builds soon. This is currently the #1 non-test failure cause of build failures. It's getting quite disruptive. |
I take that back, since that wouldn't solve the platform matrix builds 😞 |
The console output would have:
The newer images do have |
@dliappis we indeed have
|
Thanks for that, Zach. Yeah, the delay is introduced for all builds so the theory there was that something on the host wasn't properly initialized and introducing the delay would help. I think that theory is busted at this point. Also the fact that some of the compose tasks succeed and others do not kinda continues to lean me towards some kind of race condition in concurrently calling I think experimenting with not running any of these tasks in parallel is a worthwhile experiment, even if it's a shitty solution. There's definitely a build performance penalty associated with it. |
Still seeing failures even after merging #51439. Starting to run out of ideas here. |
I still think this might be somewhat related though. The number of occurrences are definitely less in |
Here's something interesting, you can see in this example (which is missing a failure message of any kind) that calling |
I'm wondering if there's also an issue with interacting with other concurrently running |
It failed today on master: https://gradle-enterprise.elastic.co/s/aj77f6ib2zoy2 I'm not sure this is the exact same issue, but since a fix was merged a week ago you might find it interesting @mark-vieira : |
Yes, I have seen this continue to happen on master as mentioned in #48027 (comment). The specific error differs between linux distributions but I'm quite certain they are all related, but the root cause evades me. I'm at my wits-end regarding this issue to be honest and I'm frankly out of ideas. We need to consider simply disabling any tests that rely on docker test fixtures on these known "bad" Linux distributions. |
That said, it seems since merging my "fix" into master, only SLES remains a culprit here so that might make it easier to justify blacklisting that distro. To verify this I'm going to go ahead and backport that PR to |
So it does indeed seem like it's just SLES that's failing with this now after that change. It seems we already blacklist SLES for other Docker-related build steps in https://github.com/elastic/elasticsearch/blob/0a3a7d6179aabe21d1f51a88284fa183b2769d76/.ci/dockerOnLinuxExclusions#L16-L15 so it only makes sense to do so for |
Here's a failure for SLES 15 on 6.8.
|
And a release test with the same error https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.6+release-tests/104/console
|
Grrr, looks like Debian 10 is still having occasional issues as well. |
Another SLES 15 failure, this time in |
Fix for this is coming. |
Another failure, on 7.6 with SLES 12 for task
|
This should be addressed by #52736 as we'll no longer try to run docker tasks on these known "bad" operating systems. |
@mark-vieira while investigating this failure: https://gradle-enterprise.elastic.co/s/5skdsekiggbmm (branch #7.6),
this task shouldn't execute, any idea what went wrong here?
|
CI failed on 7.x with this error:
Build scan
Looks like docker-compose had trouble creating an
hdfs
container. I'm not sure if this issue is more appropriate for Elasticsearch or Infra.The text was updated successfully, but these errors were encountered: