Disk Corruption #51

MMartyn · 2018-09-19T15:03:54Z

Not sure if this is the proper place to open an issue as I am unclear where the issue is actually happening, but we are experiencing disk corruption on the worker nodes. We have tried v20, v23, and v24 and the root volumes showed corruption in as little as a couple hours and up to 4 days.

$ df -hT /dev/xvda1
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/xvda1     xfs    20G  -64T   65T    - /

What ends up happening is that we start to see that these errors in the logs:

kubelet: W0919 14:54:08.962476    5823 image_gc_manager.go:285] available 70376773459968 is larger than capacity 21462233088

Then eventually the node runs out of disk space, fails instance status checks, and it gets replaced by the auto scaling group.

Any help would be appreciated.

Thanks.

The text was updated successfully, but these errors were encountered:

deiwin · 2018-09-19T16:03:45Z

Same experience here. Haven't seen this on v22 so far, however.

turbodawg · 2018-09-19T17:35:52Z

Same experience here. Haven't seen this on v22 so far, however.

@deiwin How long have you been running v22 without encountering this issue?

deiwin · 2018-09-20T05:41:35Z

A bit over a month. An illustration:

deiwin · 2018-09-20T06:12:02Z

Although v22 isn't all roses either - we've seen a somewhat related issue of docker freezing up (node not ready with "PLEG is not healthy", docker ps not responding, docker/node restart fixes) with an accompanying overlayfs kernel panic akin to this one. Another illustration:

We're currently experimenting with a different Docker version (18.06.1ce) on v22 to see if we can somehow get a stable worker.

brycecarman · 2018-10-01T20:09:18Z

What is running on the affected worker nodes? Does this only happen on heavily loaded nodes or would this happen on an idle node? Instance types? Are there any messages in dmesg?

MMartyn · 2018-10-02T14:15:17Z

@brycecarman It does seem to happen more frequently on nodes that have more traffic and churn on the deployment images. Instance type's are r4.xl and r4.2xl. Nothing of note in dmesg.

micahhausler · 2018-10-10T18:46:09Z

I'm wondering if it could be related to xfs, but ideally I'd like to reproduce this first. Do you have any reliable reproduction for this?

MMartyn · 2018-10-10T22:20:58Z

I don't have a reliable way to test which makes it hard to confirm a fix and confirm the problem. However, I converted the root fs to ext4 to see if I would still have issues and my nodes are much more stable, so I believe it is xfs related. In regards to that link, I did check the dd_type and that was set correctly as far as I could tell.

micahhausler · 2018-10-10T22:31:58Z

@deiwin are you seeing any difference with ext4? Do you have a repro for this?

deiwin · 2018-10-11T06:50:10Z

We also believe that it's something to do with xfs. Mostly because of this mail thread, which describes a very similar issue.

I haven't tried ext4 and don't have a repro, although it does fail reliably in our setup. We're currently using v22 of the AMI + Docker 18.06.1ce, which has been a stable combination thus far. However, I have provided information about specific instances and occurrences to AWS support. They told me that the EKS team has passed the issue onto the EC2 team. Waiting to see where that gets us.

For what it's worth, according to this, a patch should be available in newer kernel versions (4.19) and should fix the effect, although cause isn't known.

Because of issue awslabs#51, we may want to use a distro that defaults to ext4 instead of xfs, since the former tends to be more stable.

lachlancooper · 2018-11-15T06:47:03Z

We have seen a similar disk corruption issue occur roughly once every two weeks in a cluster of six m5.xlarge nodes. The nodes are lightly loaded: ~5% CPU usage, ~50% memory usage, ~50 pods. We haven't been able to track down any causal pattern or unusual system logs; the corruption appears to occur at random.

This doesn't immediately impact service, only monitoring, so while we lack a repro we have obtained a snapshot of an affected volume. We use 100G root volumes, which after corruption are consistently reported as -16T Used and 17T Avail:

# df -hT /dev/nvme0n1p1
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p1  100G  -16T   17T    - /

Attaching the affected volume to a separate instance and running an xfs_repair fixes the corruption, with the relevant output being:

        - scan filesystem freespace and inode maps...
sb_fdblocks 4304471112, counted 9503816

Comparing those two numbers:

100000000100100010000010001001000
000000000100100010000010001001000

This looks like a bit-flip in the superblock free data block count, just as described in the above mail thread.

The origin of that thread is https://phabricator.wikimedia.org/T199198, in which the authors appear to have narrowed down the issue to the free inode btree (finobt) feature. However, in our case it is already disabled:

# xfs_info /
meta-data=/dev/nvme0n1p1         isize=512    agcount=201, agsize=130943 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=26213883, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

cartermckinnon · 2021-12-20T18:26:28Z

This doesn't appear to be specifically related to the AMI, so I'm going to close it. Feel free to re-open or contact AWS support if you are still experiencing this issue.

kinghajj pushed a commit to AdvMicrogrid/amazon-eks-ami that referenced this issue Oct 18, 2018

Use Ubuntu 18.04 as source AMI

dd9f304

Because of issue awslabs#51, we may want to use a distro that defaults to ext4 instead of xfs, since the former tends to be more stable.

kinghajj mentioned this issue Oct 18, 2018

Use Ubuntu 18.04 as source AMI AdvMicrogrid/amazon-eks-ami#2

Merged

agcooke mentioned this issue Oct 31, 2018

Node becomes NotReady #79

Closed

cartermckinnon closed this as completed Dec 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk Corruption #51

Disk Corruption #51

MMartyn commented Sep 19, 2018

deiwin commented Sep 19, 2018

turbodawg commented Sep 19, 2018

deiwin commented Sep 20, 2018

deiwin commented Sep 20, 2018

brycecarman commented Oct 1, 2018

MMartyn commented Oct 2, 2018

micahhausler commented Oct 10, 2018

MMartyn commented Oct 10, 2018

micahhausler commented Oct 10, 2018

deiwin commented Oct 11, 2018

lachlancooper commented Nov 15, 2018

cartermckinnon commented Dec 20, 2021

Disk Corruption #51

Disk Corruption #51

Comments

MMartyn commented Sep 19, 2018

deiwin commented Sep 19, 2018

turbodawg commented Sep 19, 2018

deiwin commented Sep 20, 2018

deiwin commented Sep 20, 2018

brycecarman commented Oct 1, 2018

MMartyn commented Oct 2, 2018

micahhausler commented Oct 10, 2018

MMartyn commented Oct 10, 2018

micahhausler commented Oct 10, 2018

deiwin commented Oct 11, 2018

lachlancooper commented Nov 15, 2018

cartermckinnon commented Dec 20, 2021