You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 23, 2020. It is now read-only.
When a swarm service crashes and restarts, it mounts the same cloudstor EBS volume it was using before it restarted.
Actual behavior
A new EBS volume was created with the same CloudstorVolumeName in AWS. The new volume was mounted in the restarted service, which happens to be a jenkins master. As a result, the service lost access to it's configuration data, and appeared to come up as an entirely fresh instance.
At this point, the original volume was listed in AWS as 'available', while the newly created volume was listed as 'in-use'
Unfortunately, while investigating the issue, we restarted the service. This replayed the issue, and caused another new volume to be created and mounted by the service. This was the third volume total.
At this point, I happened to see in the AWS console that the original volume containing our actual data was destroyed and disappeared from the console. The 2 EBS volumes with the same cloudstor name are now both listed as 'in-use.' I am not entirely sure which one is actually mounted in the container, though I suspect it is the latest one.
Information
~ $ docker-diagnose
OK hostname=ip-172-31-2-55-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-38-198-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-16-88-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-28-72-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-33-252-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-13-102-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-33-37-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
Done requesting diagnostics.
Your diagnostics session ID is 1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
~ $ docker version
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:05:03 2017
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:12:30 2017
OS/Arch: linux/amd64
Experimental: true
It may also be relevant to this issue that our swarm appears to be suffering from the issue where we cannot receive docker events in the swarm. Described here: moby/moby#36834
Steps to reproduce the behavior
As this is our production swarm, and this issue may result in losing a data volume, I am very reluctant to reproduce this issue in in this swarm. Up until now, we have seen cloudstor behavior act as expected in both our staging and production swarms.
I have used the workaround from this issue: #122 to take manual backup snapshots, which appear to now be avoiding deletion, but I'd like to take some time to be certain that these snapshots will stick around and be restorable in case this issue happens to any more of our volumes.
Happy to provide more info as needed, as I am not sure where I can find swarm system related logs to try to deduce what exactly went wrong here.
The text was updated successfully, but these errors were encountered:
Expected behavior
When a swarm service crashes and restarts, it mounts the same cloudstor EBS volume it was using before it restarted.
Actual behavior
A new EBS volume was created with the same CloudstorVolumeName in AWS. The new volume was mounted in the restarted service, which happens to be a jenkins master. As a result, the service lost access to it's configuration data, and appeared to come up as an entirely fresh instance.
At this point, the original volume was listed in AWS as 'available', while the newly created volume was listed as 'in-use'
Unfortunately, while investigating the issue, we restarted the service. This replayed the issue, and caused another new volume to be created and mounted by the service. This was the third volume total.
At this point, I happened to see in the AWS console that the original volume containing our actual data was destroyed and disappeared from the console. The 2 EBS volumes with the same cloudstor name are now both listed as 'in-use.' I am not entirely sure which one is actually mounted in the container, though I suspect it is the latest one.
Information
~ $ docker-diagnose
OK hostname=ip-172-31-2-55-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-38-198-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-16-88-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-28-72-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-33-252-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-13-102-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
OK hostname=ip-172-31-33-37-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
Done requesting diagnostics.
Your diagnostics session ID is 1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3
~ $ docker version
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:05:03 2017
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:12:30 2017
OS/Arch: linux/amd64
Experimental: true
It may also be relevant to this issue that our swarm appears to be suffering from the issue where we cannot receive docker events in the swarm. Described here: moby/moby#36834
Steps to reproduce the behavior
As this is our production swarm, and this issue may result in losing a data volume, I am very reluctant to reproduce this issue in in this swarm. Up until now, we have seen cloudstor behavior act as expected in both our staging and production swarms.
I have used the workaround from this issue: #122 to take manual backup snapshots, which appear to now be avoiding deletion, but I'd like to take some time to be certain that these snapshots will stick around and be restorable in case this issue happens to any more of our volumes.
Happy to provide more info as needed, as I am not sure where I can find swarm system related logs to try to deduce what exactly went wrong here.
The text was updated successfully, but these errors were encountered: