Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

:test:fixtures:krb5kdc-fixture:composeUp failure #48027

Closed
gwbrown opened this issue Oct 14, 2019 · 69 comments
Closed

:test:fixtures:krb5kdc-fixture:composeUp failure #48027

gwbrown opened this issue Oct 14, 2019 · 69 comments
Assignees
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI

Comments

@gwbrown
Copy link
Contributor

gwbrown commented Oct 14, 2019

CI failed on 7.x with this error:

ERROR: for a5c3c1596d62b7b1842fcb2469c0c711_krb5kdc-fixture__hdfs_1  Cannot create container for service hdfs: error creating overlay mount to /var/lib/docker/overlay2/e40001620e6f9a60ea6001580ccd075835a2d8234c75c50bf056f66d9ba1cba7/merged: device or resource busy

Build scan

Looks like docker-compose had trouble creating an hdfs container. I'm not sure if this issue is more appropriate for Elasticsearch or Infra.

@gwbrown gwbrown added :Delivery/Build Build or test infrastructure >test-failure Triaged test failures from CI labels Oct 14, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Build)

@rjernst
Copy link
Member

rjernst commented Oct 14, 2019

This seems like an issue with the underlying docker image we are trying to start. @jbaiera can you have a look?

@alpar-t
Copy link
Contributor

alpar-t commented Oct 15, 2019

This particular failure is on ubuntu-1804 I know we'we had similar issues on SLES and maybe fedora 29 and we pulled them from the general worker pool, but didn't get to the bottom of the issue.

@droberts195
Copy link
Contributor

Another one happened in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob+fast+part2/1688/consoleText:

Creating network "a5c3c1596d62b7b1842fcb2469c0c711_krb5kdc-fixture__default" with the default driver
Creating a5c3c1596d62b7b1842fcb2469c0c711_krb5kdc-fixture__hdfs_1 ... 
Creating a5c3c1596d62b7b1842fcb2469c0c711_krb5kdc-fixture__peppa_1 ... 
Creating a5c3c1596d62b7b1842fcb2469c0c711_krb5kdc-fixture__peppa_1 ... error

ERROR: for a5c3c1596d62b7b1842fcb2469c0c711_krb5kdc-fixture__peppa_1  Cannot create container for service peppa: error creating overlay mount to /var/lib/docker/overlay2/9d2967ca570b8c46643a30a2e801bcdde60e0dfbbed40c7f41936e41d75f6f57/merged: device or resource busy
Creating a5c3c1596d62b7b1842fcb2469c0c711_krb5kdc-fixture__hdfs_1  ... done

ERROR: for peppa  Cannot create container for service peppa: error creating overlay mount to /var/lib/docker/overlay2/9d2967ca570b8c46643a30a2e801bcdde60e0dfbbed40c7f41936e41d75f6f57/merged: device or resource busy
Encountered errors while bringing up the project.
Stopping a5c3c1596d62b7b1842fcb2469c0c711_krb5kdc-fixture__hdfs_1 ... 

The worker this happened on was elasticsearch-ci-immutable-ubuntu-1804-1571225771493903523. As it was ephemeral it's gone now.

@romseygeek
Copy link
Contributor

@alpar-t
Copy link
Contributor

alpar-t commented Oct 16, 2019

Both of those are on ubuntu 18, so I'm starting to believe there's something specific on that platform, possibly specific to newer versions of something ( kernel or docker ) and we got an upgrade in this version of Ubuntu.

I have a set of PRs that changes how we gather logs once the build completes to include system logs.
I'm hoping that there will be something useful in there. I'm holding off on removing the platform from the general pool for now.
I tried to reproduce something similar on SLES but I it never did, so might also have to do with the fact that the worker just started.

@alpar-t alpar-t self-assigned this Oct 16, 2019
@alpar-t
Copy link
Contributor

alpar-t commented Oct 18, 2019

New failures should have a "GCP upload" link in the build scan. We have to wait for one of those to get additional insight on this.

@mark-vieira
Copy link
Contributor

@atorok I just took a look at the logs uploaded to GCP for this failure and the only thing in there was the Gradle daemon log which is effectively identical to the console output. What kind of Docker-specific logging were we expecting to capture here?

@mark-vieira
Copy link
Contributor

@atorok it looks like we aren't getting a journalctl log because we need to call supply() here:

@alpar-t
Copy link
Contributor

alpar-t commented Oct 21, 2019

I opened #48276 to address that

@romseygeek
Copy link
Contributor

Another failure here: https://gradle-enterprise.elastic.co/s/iaherl5mpkitc

@mark-vieira
Copy link
Contributor

After discussing, since this looks to be something somewhat related Docker volumes, would be to use a different storage driver here. We've already switched to using devicemaper on Centos 8 due to another issue, perhaps we could try using devicemaper across the board.

@fatmcgav What would be the implications of ☝️? In https://github.com/elastic/infra/pull/15395 we conditionally apply this to only one image, what would be involve in doing this by default? What else would this effect (ideally only elasticsearch-ci images)?

@fatmcgav
Copy link

@mark-vieira As to wider implications, I'm not sure... I'll have a look at what options exist...

With regards to making the change across the board for elasticsearch-ci images, that's also something we could look at doing... We'd need to handle the sysv/systemd behaviour differences, but we've already got systemd supported via https://github.com/elastic/infra/pull/15395

@mark-vieira
Copy link
Contributor

mark-vieira commented Oct 23, 2019

@fatmcgav Makes sense. My main concern is that changes in those ansible scripts are "global" and would apply to other project CI images as well. Is that accurate?

@dliappis
Copy link
Contributor

Another occurrence: https://gradle-enterprise.elastic.co/s/gyjdbjuetvz66

One concern I have changing the storage driver to devicemapper is we'd be switching away from Docker's defaults on many distributions, which is currently overlay2. Perhaps this is not a huge issue as e.g. the Docker packaging tests are performed inside Vagrant, but still I'd feel more confident doing Docker related things using the default storage driver elected for the distro.

I wonder if the specific issue is related to moby/moby#39475; do we have any dmesg output? I'd look for something like overlayfs: lowerdir is in-use as upperdir/workdir there.

There have been some fixes in moby/moby#37993 that landed in Docker 18.09 that could be useful to read.

@alpar-t
Copy link
Contributor

alpar-t commented Oct 24, 2019

We'll get a full system log, including the kernel log once elastic/infra#15520 merge. We might have it for some platforms already in the additional logs linked from the build scan.

I think @dliappis brings up a good point. We need to make this change for elasticsearch-ci only and not packaging-ci, we have packaging tests that run on GCP these days, not only the vagrant ones.

@mark-vieira
Copy link
Contributor

The journalctl logs haven't provided any insight. On SLES the log is completely empty and on Ubuntu there are no entries.

-- Logs begin at Tue 2019-10-29 04:09:52 UTC, end at Tue 2019-10-29 04:42:35 UTC. --
-- No entries --

I'm also staring to doubt if this is necessarily related to the storage driver as on SLES it's trying to create a tmpfs mount, not overlay2, yet encountering the same error.

@bizybot
Copy link
Contributor

bizybot commented Oct 30, 2019

@alpar-t
Copy link
Contributor

alpar-t commented Oct 30, 2019

It seems that the new images where the logs would be included were not picked up by gobld.

@mark-vieira
Copy link
Contributor

It seems that the new images where the logs would be included were not picked up by gobld.

What's indicating this? If this were the case wouldn't we get a completely empty journalctl.log file? Checking again today I'm seeing the same in the log.

-- Logs begin at Wed 2019-10-30 04:09:46 UTC, end at Wed 2019-10-30 04:42:29 UTC. --
-- No entries --

@mark-vieira
Copy link
Contributor

Also, I think we should consider blacklisting these images for "general-purpose" builds soon. This is currently the #1 non-test failure cause of build failures. It's getting quite disruptive.

@mark-vieira
Copy link
Contributor

Also, I think we should consider blacklisting these images for "general-purpose" builds soon. This is currently the #1 non-test failure cause of build failures. It's getting quite disruptive.

I take that back, since that wouldn't solve the platform matrix builds 😞

@alpar-t
Copy link
Contributor

alpar-t commented Oct 31, 2019

The console output would have:

Hint: You are currently not seeing messages from other users and the system.
      Users in groups 'adm', 'systemd-journal' can see all messages.
      Pass -q to turn off this notice.

The newer images do have jenkins added to systemd-journal and manually starting one does show the journal

@alpar-t
Copy link
Contributor

alpar-t commented Oct 31, 2019

@dliappis we indeed have overlayfs: lowerdir is in-use as upperdir/workdir in the logs, but are running a newer docker and older kernel

docker version
...
 Engine:
  Version:          19.03.4
...
jenkins@alpar-ci-ubuntu-1804-lts-1030-1809:/dev/shm/elasticsearch$ uname -a 
Linux alpar-ci-ubuntu-1804-lts-1030-1809 5.0.0-1021-gcp #21~18.04.1-Ubuntu SMP Mon Oct 7 04:51:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

@mark-vieira
Copy link
Contributor

Thanks for that, Zach. Yeah, the delay is introduced for all builds so the theory there was that something on the host wasn't properly initialized and introducing the delay would help.

I think that theory is busted at this point. Also the fact that some of the compose tasks succeed and others do not kinda continues to lean me towards some kind of race condition in concurrently calling docker-compose.

I think experimenting with not running any of these tasks in parallel is a worthwhile experiment, even if it's a shitty solution. There's definitely a build performance penalty associated with it.

@mark-vieira
Copy link
Contributor

Still seeing failures even after merging #51439. Starting to run out of ideas here.

@mark-vieira
Copy link
Contributor

I still think this might be somewhat related though. The number of occurrences are definitely less in master after the "fix" which leads me to believe that there is some kind of race condition but perhaps there are other processes in play here, not just our composeUp tasks.

@mark-vieira
Copy link
Contributor

Here's something interesting, you can see in this example (which is missing a failure message of any kind) that calling composeUp first tries to remove some existing containers. As if the task has already been run, or there is some kind of stale state on the machine. Given this is an ephemeral worker, this makes no sense to me.

https://gradle-enterprise.elastic.co/s/4uw3yxiq6ep4g/console-log?task=:x-pack:test:idp-fixture:composeUp

@mark-vieira
Copy link
Contributor

I'm wondering if there's also an issue with interacting with other concurrently running composeDown tasks? Perhaps all these should be restricted to be run serially to avoid any funniness. My test build only runs composeUp which would explain why I've no longer seen any issues after merging #51439.

@tlrx
Copy link
Member

tlrx commented Feb 6, 2020

It failed today on master: https://gradle-enterprise.elastic.co/s/aj77f6ib2zoy2

I'm not sure this is the exact same issue, but since a fix was merged a week ago you might find it interesting @mark-vieira :
Caused by: java.lang.RuntimeException: Exit-code 1 when calling /usr/local/bin/docker-compose, stdout: N/A

@mark-vieira
Copy link
Contributor

Yes, I have seen this continue to happen on master as mentioned in #48027 (comment). The specific error differs between linux distributions but I'm quite certain they are all related, but the root cause evades me.

I'm at my wits-end regarding this issue to be honest and I'm frankly out of ideas. We need to consider simply disabling any tests that rely on docker test fixtures on these known "bad" Linux distributions.

@mark-vieira
Copy link
Contributor

mark-vieira commented Feb 6, 2020

That said, it seems since merging my "fix" into master, only SLES remains a culprit here so that might make it easier to justify blacklisting that distro. To verify this I'm going to go ahead and backport that PR to 7.x and earlier to see if I notice the same pattern.

@mark-vieira
Copy link
Contributor

So it does indeed seem like it's just SLES that's failing with this now after that change. It seems we already blacklist SLES for other Docker-related build steps in https://github.com/elastic/elasticsearch/blob/0a3a7d6179aabe21d1f51a88284fa183b2769d76/.ci/dockerOnLinuxExclusions#L16-L15 so it only makes sense to do so for docker-compose as well. We have a lot of Docker-related build logic spread across several plugins, I think it makes sense to finally align this stuff, so that when we "blacklist docker" we do so across the board, not just for some operations. I'm going to do that in a PR to address this issue.

@davidkyle
Copy link
Member

So it does indeed seem like it's just SLES that's failing with this now after that change.

Here's a failure for SLES 15 on 6.8.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.8+multijob-unix-compatibility/os=sles-15&&immutable/562/console

ERROR: for elasticsearch-oss Cannot start service elasticsearch-oss: unable to remount dir as readonly: mount tmpfs:/var/lib/docker/containers/d4f77774c82c36f495d186c3f3e615d51aaf7176456bd5853cd6c7677c001cca/mounts/secrets, flags: 0x21, data: uid=0,gid=0: device or resource busy

@davidkyle
Copy link
Member

And a release test with the same error

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.6+release-tests/104/console

ERROR: for d8865decd592dfc16994307e75357a63_idp-fixture__shibboleth-idp_1  Cannot create container for service shibboleth-idp: error creating overlay mount to /var/lib/docker/overlay2/ed1a5dbf1008faaa838bf73387ba916a20e90ccd43fca7a3bbdc73cea62b4646/merged: device or resource busy

@mark-vieira
Copy link
Contributor

Grrr, looks like Debian 10 is still having occasional issues as well.

@pugnascotia
Copy link
Contributor

Another SLES 15 failure, this time in :qa:remote-clusters:composeUp 😢

https://gradle-enterprise.elastic.co/s/ew2jsmzctud2q

@mark-vieira
Copy link
Contributor

Fix for this is coming.

@astefan
Copy link
Contributor

astefan commented Feb 17, 2020

Another failure, on 7.6 with SLES 12 for task :test:fixtures:minio-fixture:composeUp with identical error message:

01:26:49 > Task :test:fixtures:minio-fixture:composeUp
01:26:49 Creating 2b75b65dda7279078f50b53ea2c3c31f_minio-fixture__minio-fixture_1 ... 
01:26:49 Creating 2b75b65dda7279078f50b53ea2c3c31f_minio-fixture__minio-fixture-for-snapshot-tool_1 ... 
01:26:49 Creating 2b75b65dda7279078f50b53ea2c3c31f_minio-fixture__minio-fixture_1                   ... error
01:26:49 
01:26:49 ERROR: for 2b75b65dda7279078f50b53ea2c3c31f_minio-fixture__minio-fixture_1  Cannot start service minio-fixture: unable to remount dir as readonly: mount tmpfs:/var/lib/docker/containers/932871950a6887081a1b8db70e49b456571fd208b17478d6c150a5072440f7e1/mounts/secrets, flags: 0x21, data: uid=0,gid=0: device or resource busy
01:26:49 Creating 2b75b65dda7279078f50b53ea2c3c31f_minio-fixture__minio-fixture-for-snapshot-tool_1 ... done
01:26:49 
01:26:49 ERROR: for minio-fixture  Cannot start service minio-fixture: unable to remount dir as readonly: mount tmpfs:/var/lib/docker/containers/932871950a6887081a1b8db70e49b456571fd208b17478d6c150a5072440f7e1/mounts/secrets, flags: 0x21, data: uid=0,gid=0: device or resource busy
01:26:49 Encountered errors while bringing up the project.
01:26:49 Stopping 2b75b65dda7279078f50b53ea2c3c31f_minio-fixture__minio-fixture-for-snapshot-tool_1 ... 

@mark-vieira
Copy link
Contributor

This should be addressed by #52736 as we'll no longer try to run docker tasks on these known "bad" operating systems.

@dliappis
Copy link
Contributor

@mark-vieira while investigating this failure: https://gradle-enterprise.elastic.co/s/5skdsekiggbmm (branch #7.6),
I still see krb5kdc-fixture attempting to create docker networks (see below) on opensuse-15.1. Based on #42736 and

this task shouldn't execute, any idea what went wrong here?

> Task :test:fixtures:krb5kdc-fixture:composeUp
09:28:47 Building hdfs
09:28:48 /bin/sh: /tmp/_MEIVjssgb/libreadline.so.7: no version information available (required by /bin/sh)
09:28:48 Creating network "8e4ad254157fd4e9add1175f6536a0ad_krb5kdc-fixture__default" with the default driver
09:28:48 Creating 8e4ad254157fd4e9add1175f6536a0ad_krb5kdc-fixture__peppa_1 ... 
09:28:48 Creating 8e4ad254157fd4e9add1175f6536a0ad_krb5kdc-fixture__hdfs_1  ... 
09:28:49 Creating 8e4ad254157fd4e9add1175f6536a0ad_krb5kdc-fixture__peppa_1 ... done
09:28:49 Creating 8e4ad254157fd4e9add1175f6536a0ad_krb5kdc-fixture__hdfs_1  ... done
09:28:49 
09:28:49 > Task :distribution:bwc:maintenance:fetchLatest
09:28:51 
09:28:51 > Task :test:fixtures:krb5kdc-fixture:composeUp
09:28:51 /bin/sh: /tmp/_MEI5Ycyaj/libreadline.so.7: no version information available (required by /bin/sh)
09:28:51 /bin/sh: /tmp/_MEIHZ3YZN/libreadline.so.7: no version information available (required by /bin/sh)
09:28:51 Will use 172.18.0.1 (network 8e4ad254157fd4e9add1175f6536a0ad_krb5kdc-fixture__default) as host of peppa
09:28:51 Will use 172.18.0.1 (network 8e4ad254157fd4e9add1175f6536a0ad_krb5kdc-fixture__default) as host of hdfs
...

@dliappis dliappis reopened this Apr 10, 2020
@mark-vieira
Copy link
Contributor

Thanks for reporting this @dliappis. It seems the reported identifier on opensuse has changed slightly. I've updated this in f2c0ad6 so these docker tasks will be properly skipped on this platform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests