Investigate effective storage limits for completed runs #3175

imjasonh · 2020-09-08T15:37:38Z

We know that storing details of completed Runs eventually results in too many stored resources/bytes and unresponsive behavior from etcd and the K8s API server. We don't have a good idea exactly how many resources/bytes it takes to start causing problems.

We should explore this on a standard GKE cluster and document (even if it's just in this issue) our findings about what symptoms we observed, roughly how many resources it took to see them, etc.

If anybody else has experienced this on their own clusters and could contribute data, even anecdata, that would be helpful.

Related #454
cc @wlynch

wlynch · 2020-09-08T16:55:23Z

tektoncd/experimental#479 is also related for a cronjob to clean up these completed resources.

psschwei · 2020-10-08T22:56:25Z

Anecdotally, I started seeing some minor sluggishness with about 1600 completed taskruns (this was doing an artificial test on minikube, where I just spammed it with new taskruns of the "hello world" task from the tutorial).

Going to spend some time over the next couple of days doing some more rigorous testing and documenting the results in this issue.

psschwei · 2020-10-08T23:42:05Z

/assign @psschwei

tekton-robot · 2021-01-07T14:16:07Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-02-06T14:19:08Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-05-07T13:40:44Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-05-07T13:40:45Z

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

imjasonh · 2021-07-28T16:29:38Z

/reopen
/lifecycle frozen

Excessive etcd storage is still an issue when results and pruning aren't configured.

We should run some tests to get a rough idea how many resources can be reliably stored in etcd storage at different resource levels, both as documentated guidance to operators and as a sales pitch for enabling results and/or pruning to avoid these issues.

While we're doing this, we should collect some symptoms of an overloaded cluster (what behavior does the cluster exhibit under excessive etcd load, what error messages can people google to find our docs)

bobcatfish added the kind/design Categorizes issue or PR as related to design. label Sep 15, 2020

tekton-robot assigned psschwei Oct 8, 2020

afrittoli added the area/performance Issues or PRs that are related to performance aspects. label Oct 9, 2020

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 7, 2021

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 6, 2021

psschwei removed their assignment Feb 11, 2021

tekton-robot closed this as completed May 7, 2021

imjasonh reopened this Jul 28, 2021

tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate effective storage limits for completed runs #3175

Investigate effective storage limits for completed runs #3175

imjasonh commented Sep 8, 2020

wlynch commented Sep 8, 2020

psschwei commented Oct 8, 2020

psschwei commented Oct 8, 2020

tekton-robot commented Jan 7, 2021

tekton-robot commented Feb 6, 2021

tekton-robot commented May 7, 2021

tekton-robot commented May 7, 2021

imjasonh commented Jul 28, 2021

Investigate effective storage limits for completed runs #3175

Investigate effective storage limits for completed runs #3175

Comments

imjasonh commented Sep 8, 2020

wlynch commented Sep 8, 2020

psschwei commented Oct 8, 2020

psschwei commented Oct 8, 2020

tekton-robot commented Jan 7, 2021

tekton-robot commented Feb 6, 2021

tekton-robot commented May 7, 2021

tekton-robot commented May 7, 2021

imjasonh commented Jul 28, 2021