Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate effective storage limits for completed runs #3175

Open
imjasonh opened this issue Sep 8, 2020 · 8 comments
Open

Investigate effective storage limits for completed runs #3175

imjasonh opened this issue Sep 8, 2020 · 8 comments
Labels
area/performance Issues or PRs that are related to performance aspects. kind/design Categorizes issue or PR as related to design. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@imjasonh
Copy link
Member

imjasonh commented Sep 8, 2020

We know that storing details of completed Runs eventually results in too many stored resources/bytes and unresponsive behavior from etcd and the K8s API server. We don't have a good idea exactly how many resources/bytes it takes to start causing problems.

We should explore this on a standard GKE cluster and document (even if it's just in this issue) our findings about what symptoms we observed, roughly how many resources it took to see them, etc.

If anybody else has experienced this on their own clusters and could contribute data, even anecdata, that would be helpful.

Related #454
cc @wlynch

@wlynch
Copy link
Member

wlynch commented Sep 8, 2020

tektoncd/experimental#479 is also related for a cronjob to clean up these completed resources.

@bobcatfish bobcatfish added the kind/design Categorizes issue or PR as related to design. label Sep 15, 2020
@psschwei
Copy link
Contributor

psschwei commented Oct 8, 2020

Anecdotally, I started seeing some minor sluggishness with about 1600 completed taskruns (this was doing an artificial test on minikube, where I just spammed it with new taskruns of the "hello world" task from the tutorial).

Going to spend some time over the next couple of days doing some more rigorous testing and documenting the results in this issue.

@psschwei
Copy link
Contributor

psschwei commented Oct 8, 2020

/assign @psschwei

@afrittoli afrittoli added the area/performance Issues or PRs that are related to performance aspects. label Oct 9, 2020
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 7, 2021
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 6, 2021
@psschwei psschwei removed their assignment Feb 11, 2021
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@imjasonh
Copy link
Member Author

/reopen
/lifecycle frozen

Excessive etcd storage is still an issue when results and pruning aren't configured.

We should run some tests to get a rough idea how many resources can be reliably stored in etcd storage at different resource levels, both as documentated guidance to operators and as a sales pitch for enabling results and/or pruning to avoid these issues.

While we're doing this, we should collect some symptoms of an overloaded cluster (what behavior does the cluster exhibit under excessive etcd load, what error messages can people google to find our docs)

@imjasonh imjasonh reopened this Jul 28, 2021
@tekton-robot tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Issues or PRs that are related to performance aspects. kind/design Categorizes issue or PR as related to design. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
Status: Todo
Development

No branches or pull requests

6 participants