[Release-1.27] - K3s etcd snapshot reconcile consumes excessive memory when a large number of snapshots are present #10562

brandond · 2024-07-24T19:41:35Z

Backport fix for

K3s etcd snapshot reconcile consumes excessive memory when a large number of snapshots are present #10450

aganesh-suse · 2024-08-13T21:48:51Z

Validated on release-1.27 branch with commit `112a185`

Environment Details

Infrastructure

Cloud
Hosted

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"

$ uname -m
x86_64

Setup Size: 4Gb Memory, 2vCPU, 30G disc size.

Cluster Configuration:

HA: 3 server/ 1 agent

Config.yaml:

token: xxxx
cluster-init: true
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1
node-label:
- k3s-upgrade=server

etcd-snapshot-retention: 255
etcd-snapshot-schedule-cron: "* * * * *"
etcd-s3: true
etcd-s3-access-key: <access_key>
etcd-s3-secret-key: <secret_key>
etcd-s3-bucket: <bucket>
etcd-s3-folder: <folder>
etcd-s3-region: <region>

Testing Steps

Copy config.yaml

$ sudo mkdir -p /etc/rancher/k3s && sudo cp config.yaml /etc/rancher/k3s

Install k3s

curl -sfL https://get.k3s.io | sudo INSTALL_K3S_COMMIT='112a185c346faafc8e2e5fe68fdb52e33aafabbd' sh -s - server

Verify Cluster Status:

kubectl get nodes -o wide
kubectl get pods -A

Apply the k3s extra metadata here:

kubectl apply -f https://gist.githubusercontent.com/aganesh-suse/52c3d6c3d7fe70141fa3a49431ac0032/raw/20039a159ab0f5fce1930f5ec12f6afc2b034784/k3s-etcd-snapshot-extra-metadata.yaml

Monitor the memory usage of k3s.service, while taking snapshots every 1 minute for upto 255 snapshots.

for (( I=0; I < "${ON_DEMAND_SNAPSHOT_COUNT}"; I++ ))
do
    sudo k3s etcd-snapshot save
    sleep 5
done

write_mem_usage_k3s_to_file () {
    while true; do
        echo "$(top -b -n 1 | grep k3s)"  | tee -a top-output.log
        sleep 1
    done
}

ttyplot_k3s_memory () {
    K3S_PID=$(ps aux | grep 'k3s' | head -n 1 | awk '{print $2}')
    while :; do grep -oP '^VmRSS:\s+\K\d+' /proc/$K3S_PID/status \
    | numfmt --from-unit Ki --to-unit Mi; sleep 1; done | ttyplot -u Mi
}

P.S: We run out of disc space before running out of memory space by ~280 snapshots, so capping the snapshot count to 255 for testing purposes.

Replication Results:

k3s version used for replication:

$ k3s -v
k3s version v1.27.16+k3s1 (a45cef49)
go version go1.22.5

Validation Results:

k3s version used for validation:

$ k3s -v
k3s version v1.27.16+k3s-112a185c (112a185c)
go version go1.22.5

Memory Usage Comparison Results

Plot to compare memory usage between released version and latest commit:
Compare % Memory, Max Memory used in Mb, Avg Memory in Mb for both released version and latest commit for various snapshot counts - 120, 150, 170, 200, 230, 250, 255.

          v1.27.16        |  release-1.27 commit: 112a185c
Snaps   %M  Max     Avg     %M   Max    Avg
100     47  2049    1842    41  1655    1555
120     53  2073    1980    48  1914    1770
150     62  2451    2336    56  2175    2040
170     63  2655    2475    59  2342    2248
200     70  2737    2416    52  2080    1927
230     76  3045    2855    56  2412    2170
250     83  3240    3022    60  2726    2455
255     83  3240    3120    70  2846    2547

Observations

Memory wise, till about 120 snapshots, the new commit is 2% lesser than the older vesion memory usage.
On an average, the difference starts increasing to ~5% till 200 snapshots, then to ~10+% lesser memory usage for more than 200 snapshots (conservatively).

brandond assigned aganesh-suse Jul 24, 2024

brandond changed the title ~~[Release-1.27] - Multiple simultaneous snapshots result in silent failure and/or corruption of at least one snapshot~~ [Release-1.27] - K3s etcd snapshot reconcile consumes excessive memory when a large number of snapshots are present Jul 24, 2024

brandond assigned brandond and unassigned aganesh-suse Jul 24, 2024

caroline-suse-rancher added this to the v1.27.17+k3s1 milestone Jul 29, 2024

brandond mentioned this issue Aug 2, 2024

[release-1.27] Backports for 2024-08 release cycle #10667

Merged

endawkins assigned fmoral2 and aganesh-suse and unassigned fmoral2 Aug 13, 2024

aganesh-suse closed this as completed Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Release-1.27] - K3s etcd snapshot reconcile consumes excessive memory when a large number of snapshots are present #10562

[Release-1.27] - K3s etcd snapshot reconcile consumes excessive memory when a large number of snapshots are present #10562

brandond commented Jul 24, 2024 •

edited

Loading

aganesh-suse commented Aug 13, 2024

[Release-1.27] - K3s etcd snapshot reconcile consumes excessive memory when a large number of snapshots are present #10562

[Release-1.27] - K3s etcd snapshot reconcile consumes excessive memory when a large number of snapshots are present #10562

Comments

brandond commented Jul 24, 2024 • edited Loading

aganesh-suse commented Aug 13, 2024

Validated on release-1.27 branch with commit 112a185

Environment Details

Testing Steps

brandond commented Jul 24, 2024 •

edited

Loading

Validated on release-1.27 branch with commit `112a185`