Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk Corruption #51

Closed
MMartyn opened this issue Sep 19, 2018 · 12 comments
Closed

Disk Corruption #51

MMartyn opened this issue Sep 19, 2018 · 12 comments

Comments

@MMartyn
Copy link

MMartyn commented Sep 19, 2018

Not sure if this is the proper place to open an issue as I am unclear where the issue is actually happening, but we are experiencing disk corruption on the worker nodes. We have tried v20, v23, and v24 and the root volumes showed corruption in as little as a couple hours and up to 4 days.

$ df -hT /dev/xvda1
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/xvda1     xfs    20G  -64T   65T    - /

What ends up happening is that we start to see that these errors in the logs:

kubelet: W0919 14:54:08.962476    5823 image_gc_manager.go:285] available 70376773459968 is larger than capacity 21462233088

Then eventually the node runs out of disk space, fails instance status checks, and it gets replaced by the auto scaling group.

Any help would be appreciated.

Thanks.

@deiwin
Copy link

deiwin commented Sep 19, 2018

Same experience here. Haven't seen this on v22 so far, however.

@turbodawg
Copy link

Same experience here. Haven't seen this on v22 so far, however.

@deiwin How long have you been running v22 without encountering this issue?

@deiwin
Copy link

deiwin commented Sep 20, 2018

A bit over a month. An illustration:
screen shot 2018-09-20 at 08 33 42

@deiwin
Copy link

deiwin commented Sep 20, 2018

Although v22 isn't all roses either - we've seen a somewhat related issue of docker freezing up (node not ready with "PLEG is not healthy", docker ps not responding, docker/node restart fixes) with an accompanying overlayfs kernel panic akin to this one. Another illustration:
screen shot 2018-09-20 at 09 08 05

We're currently experimenting with a different Docker version (18.06.1ce) on v22 to see if we can somehow get a stable worker.

@brycecarman
Copy link

What is running on the affected worker nodes? Does this only happen on heavily loaded nodes or would this happen on an idle node? Instance types? Are there any messages in dmesg?

@MMartyn
Copy link
Author

MMartyn commented Oct 2, 2018

@brycecarman It does seem to happen more frequently on nodes that have more traffic and churn on the deployment images. Instance type's are r4.xl and r4.2xl. Nothing of note in dmesg.

@micahhausler
Copy link
Member

I'm wondering if it could be related to xfs, but ideally I'd like to reproduce this first. Do you have any reliable reproduction for this?

@MMartyn
Copy link
Author

MMartyn commented Oct 10, 2018

I don't have a reliable way to test which makes it hard to confirm a fix and confirm the problem. However, I converted the root fs to ext4 to see if I would still have issues and my nodes are much more stable, so I believe it is xfs related. In regards to that link, I did check the dd_type and that was set correctly as far as I could tell.

@micahhausler
Copy link
Member

@deiwin are you seeing any difference with ext4? Do you have a repro for this?

@deiwin
Copy link

deiwin commented Oct 11, 2018

We also believe that it's something to do with xfs. Mostly because of this mail thread, which describes a very similar issue.

I haven't tried ext4 and don't have a repro, although it does fail reliably in our setup. We're currently using v22 of the AMI + Docker 18.06.1ce, which has been a stable combination thus far. However, I have provided information about specific instances and occurrences to AWS support. They told me that the EKS team has passed the issue onto the EC2 team. Waiting to see where that gets us.

For what it's worth, according to this, a patch should be available in newer kernel versions (4.19) and should fix the effect, although cause isn't known.

kinghajj pushed a commit to AdvMicrogrid/amazon-eks-ami that referenced this issue Oct 18, 2018
Because of issue awslabs#51, we may want to use a distro that defaults to ext4
instead of xfs, since the former tends to be more stable.
@lachlancooper
Copy link

We have seen a similar disk corruption issue occur roughly once every two weeks in a cluster of six m5.xlarge nodes. The nodes are lightly loaded: ~5% CPU usage, ~50% memory usage, ~50 pods. We haven't been able to track down any causal pattern or unusual system logs; the corruption appears to occur at random.

This doesn't immediately impact service, only monitoring, so while we lack a repro we have obtained a snapshot of an affected volume. We use 100G root volumes, which after corruption are consistently reported as -16T Used and 17T Avail:

# df -hT /dev/nvme0n1p1
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p1  100G  -16T   17T    - /

Attaching the affected volume to a separate instance and running an xfs_repair fixes the corruption, with the relevant output being:

        - scan filesystem freespace and inode maps...
sb_fdblocks 4304471112, counted 9503816

Comparing those two numbers:

100000000100100010000010001001000
000000000100100010000010001001000

This looks like a bit-flip in the superblock free data block count, just as described in the above mail thread.

The origin of that thread is https://phabricator.wikimedia.org/T199198, in which the authors appear to have narrowed down the issue to the free inode btree (finobt) feature. However, in our case it is already disabled:

# xfs_info /
meta-data=/dev/nvme0n1p1         isize=512    agcount=201, agsize=130943 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=26213883, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

@cartermckinnon
Copy link
Member

This doesn't appear to be specifically related to the AMI, so I'm going to close it. Feel free to re-open or contact AWS support if you are still experiencing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants