Skip to content
This repository has been archived by the owner on Jan 23, 2020. It is now read-only.

Cloudstor EBS Volume Recreated on Service Restart, Original Volume Destroyed #176

Open
dviator opened this issue Oct 8, 2018 · 0 comments

Comments

@dviator
Copy link

dviator commented Oct 8, 2018

Expected behavior

When a swarm service crashes and restarts, it mounts the same cloudstor EBS volume it was using before it restarted.

Actual behavior

A new EBS volume was created with the same CloudstorVolumeName in AWS. The new volume was mounted in the restarted service, which happens to be a jenkins master. As a result, the service lost access to it's configuration data, and appeared to come up as an entirely fresh instance.

At this point, the original volume was listed in AWS as 'available', while the newly created volume was listed as 'in-use'

Unfortunately, while investigating the issue, we restarted the service. This replayed the issue, and caused another new volume to be created and mounted by the service. This was the third volume total.

At this point, I happened to see in the AWS console that the original volume containing our actual data was destroyed and disappeared from the console. The 2 EBS volumes with the same cloudstor name are now both listed as 'in-use.' I am not entirely sure which one is actually mounted in the container, though I suspect it is the latest one.

Information

~ $ docker-diagnose
OK hostname=ip-172-31-2-55-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-38-198-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-16-88-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-28-72-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-33-252-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-13-102-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-33-37-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
Done requesting diagnostics.
Your diagnostics session ID is 1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3

~ $ docker version
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:05:03 2017
OS/Arch: linux/amd64

Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:12:30 2017
OS/Arch: linux/amd64
Experimental: true

It may also be relevant to this issue that our swarm appears to be suffering from the issue where we cannot receive docker events in the swarm. Described here: moby/moby#36834

Steps to reproduce the behavior

As this is our production swarm, and this issue may result in losing a data volume, I am very reluctant to reproduce this issue in in this swarm. Up until now, we have seen cloudstor behavior act as expected in both our staging and production swarms.

I have used the workaround from this issue: #122 to take manual backup snapshots, which appear to now be avoiding deletion, but I'd like to take some time to be certain that these snapshots will stick around and be restorable in case this issue happens to any more of our volumes.

Happy to provide more info as needed, as I am not sure where I can find swarm system related logs to try to deduce what exactly went wrong here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant