Cloudstor EBS Volume Recreated on Service Restart, Original Volume Destroyed #176

dviator · 2018-10-08T21:02:42Z

Expected behavior

When a swarm service crashes and restarts, it mounts the same cloudstor EBS volume it was using before it restarted.

Actual behavior

A new EBS volume was created with the same CloudstorVolumeName in AWS. The new volume was mounted in the restarted service, which happens to be a jenkins master. As a result, the service lost access to it's configuration data, and appeared to come up as an entirely fresh instance.

At this point, the original volume was listed in AWS as 'available', while the newly created volume was listed as 'in-use'

Unfortunately, while investigating the issue, we restarted the service. This replayed the issue, and caused another new volume to be created and mounted by the service. This was the third volume total.

At this point, I happened to see in the AWS console that the original volume containing our actual data was destroyed and disappeared from the console. The 2 EBS volumes with the same cloudstor name are now both listed as 'in-use.' I am not entirely sure which one is actually mounted in the container, though I suspect it is the latest one.

Information

~ $ docker-diagnose
OK hostname=ip-172-31-2-55-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-38-198-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-16-88-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-28-72-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-33-252-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-13-102-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-33-37-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
Done requesting diagnostics.
Your diagnostics session ID is 1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3

~ $ docker version
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:05:03 2017
OS/Arch: linux/amd64

Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:12:30 2017
OS/Arch: linux/amd64
Experimental: true

It may also be relevant to this issue that our swarm appears to be suffering from the issue where we cannot receive docker events in the swarm. Described here: moby/moby#36834

Steps to reproduce the behavior

As this is our production swarm, and this issue may result in losing a data volume, I am very reluctant to reproduce this issue in in this swarm. Up until now, we have seen cloudstor behavior act as expected in both our staging and production swarms.

I have used the workaround from this issue: #122 to take manual backup snapshots, which appear to now be avoiding deletion, but I'd like to take some time to be certain that these snapshots will stick around and be restorable in case this issue happens to any more of our volumes.

Happy to provide more info as needed, as I am not sure where I can find swarm system related logs to try to deduce what exactly went wrong here.

This was referenced Nov 1, 2018

Master nodes unable to communicate or reach quorum after version upgrade, resulting in swarm failure #178

Open

Master nodes unable to communicate or reach quorum after version upgrade, resulting in swarm failure moby/moby#38124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloudstor EBS Volume Recreated on Service Restart, Original Volume Destroyed #176

Cloudstor EBS Volume Recreated on Service Restart, Original Volume Destroyed #176

dviator commented Oct 8, 2018 •

edited

Loading

Cloudstor EBS Volume Recreated on Service Restart, Original Volume Destroyed #176

Cloudstor EBS Volume Recreated on Service Restart, Original Volume Destroyed #176

Comments

dviator commented Oct 8, 2018 • edited Loading

Expected behavior

Actual behavior

Information

Steps to reproduce the behavior

dviator commented Oct 8, 2018 •

edited

Loading