Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release-1.27] - K3s etcd snapshot reconcile consumes excessive memory when a large number of snapshots are present #10562

Closed
brandond opened this issue Jul 24, 2024 · 1 comment
Assignees
Milestone

Comments

@brandond
Copy link
Member

brandond commented Jul 24, 2024

Backport fix for

@brandond brandond changed the title [Release-1.27] - Multiple simultaneous snapshots result in silent failure and/or corruption of at least one snapshot [Release-1.27] - K3s etcd snapshot reconcile consumes excessive memory when a large number of snapshots are present Jul 24, 2024
@brandond brandond assigned brandond and unassigned aganesh-suse Jul 24, 2024
@caroline-suse-rancher caroline-suse-rancher added this to the v1.27.17+k3s1 milestone Jul 29, 2024
@endawkins endawkins assigned fmoral2 and aganesh-suse and unassigned fmoral2 Aug 13, 2024
@aganesh-suse
Copy link

Validated on release-1.27 branch with commit 112a185

Environment Details

Infrastructure

  • Cloud
  • Hosted

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"

$ uname -m
x86_64

Setup Size: 4Gb Memory, 2vCPU, 30G disc size.

Cluster Configuration:

HA: 3 server/ 1 agent

Config.yaml:

token: xxxx
cluster-init: true
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1
node-label:
- k3s-upgrade=server

etcd-snapshot-retention: 255
etcd-snapshot-schedule-cron: "* * * * *"
etcd-s3: true
etcd-s3-access-key: <access_key>
etcd-s3-secret-key: <secret_key>
etcd-s3-bucket: <bucket>
etcd-s3-folder: <folder>
etcd-s3-region: <region>

Testing Steps

  1. Copy config.yaml
$ sudo mkdir -p /etc/rancher/k3s && sudo cp config.yaml /etc/rancher/k3s
  1. Install k3s
curl -sfL https://get.k3s.io | sudo INSTALL_K3S_COMMIT='112a185c346faafc8e2e5fe68fdb52e33aafabbd' sh -s - server
  1. Verify Cluster Status:
kubectl get nodes -o wide
kubectl get pods -A
  1. Apply the k3s extra metadata here:
kubectl apply -f https://gist.githubusercontent.com/aganesh-suse/52c3d6c3d7fe70141fa3a49431ac0032/raw/20039a159ab0f5fce1930f5ec12f6afc2b034784/k3s-etcd-snapshot-extra-metadata.yaml
  1. Monitor the memory usage of k3s.service, while taking snapshots every 1 minute for upto 255 snapshots.
for (( I=0; I < "${ON_DEMAND_SNAPSHOT_COUNT}"; I++ ))
do
    sudo k3s etcd-snapshot save
    sleep 5
done
write_mem_usage_k3s_to_file () {
    while true; do
        echo "$(top -b -n 1 | grep k3s)"  | tee -a top-output.log
        sleep 1
    done
}

ttyplot_k3s_memory () {
    K3S_PID=$(ps aux | grep 'k3s' | head -n 1 | awk '{print $2}')
    while :; do grep -oP '^VmRSS:\s+\K\d+' /proc/$K3S_PID/status \
    | numfmt --from-unit Ki --to-unit Mi; sleep 1; done | ttyplot -u Mi
}

P.S: We run out of disc space before running out of memory space by ~280 snapshots, so capping the snapshot count to 255 for testing purposes.

Replication Results:

  • k3s version used for replication:
$ k3s -v
k3s version v1.27.16+k3s1 (a45cef49)
go version go1.22.5

Validation Results:

  • k3s version used for validation:
$ k3s -v
k3s version v1.27.16+k3s-112a185c (112a185c)
go version go1.22.5

Memory Usage Comparison Results

Plot to compare memory usage between released version and latest commit:
Compare % Memory, Max Memory used in Mb, Avg Memory in Mb for both released version and latest commit for various snapshot counts - 120, 150, 170, 200, 230, 250, 255.

          v1.27.16        |  release-1.27 commit: 112a185c
Snaps   %M  Max     Avg     %M   Max    Avg
100     47  2049    1842    41  1655    1555
120     53  2073    1980    48  1914    1770
150     62  2451    2336    56  2175    2040
170     63  2655    2475    59  2342    2248
200     70  2737    2416    52  2080    1927
230     76  3045    2855    56  2412    2170
250     83  3240    3022    60  2726    2455
255     83  3240    3120    70  2846    2547

Observations

Memory wise, till about 120 snapshots, the new commit is 2% lesser than the older vesion memory usage.
On an average, the difference starts increasing to ~5% till 200 snapshots, then to ~10+% lesser memory usage for more than 200 snapshots (conservatively).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants