-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UI becomes unresponsive when there are a large number of completed/running taskruns #2324
Comments
Thanks for reporting this @rouke-broersma. It's a known problem when there are a large number of resources on the cluster and has been discussed many times in the past both in GitHub issues and on Slack. We're currently facing some limitations of the Kubernetes API although we do have some plans to address this. See #1978 (comment) for some information on how we currently manage this in our own dogfooding cluster by removing older runs as well as plans for integration with Tekton Results which we hope will allow us to address some of the concerns about pagination, large numbers of resources, etc. |
Sorry, I looked through the open issues and could not find anything. |
No problem at all. It took me a while to find them and I already knew what I was looking for 😅 |
On my end it seems like the issue is not necessarily that we have to wait a long time to get the response from the k8s api but rather it seems that the rendering of all the items is the problem. Could the pagination not be performed client side instead, so that at least the items are not all rendered? |
We tried that in the past but it didn't make a difference in terms of the perceived performance, but may be worth investigating again since we have made a large number of other performance related and client architectural changes in the meantime. Can you share some more details about the number/size of the resources you're working with, the total response size for the list, or some examples? This could be helpful when we try to reproduce the issue and test any potential improvements. If you could include request/response time and time to load in the UI that would be great. It will likely be towards the end of next week before I can dedicate much time to this. |
The size of the taskrun response is 3MB according to google chrome and takes 12 seconds to load. This contains about 2000 items I think. I don't think it causes much problems when there are no taskruns running, but when the taskruns are running the tab becomes unresponsive and chrome tries to kill the tab. We have about 20 taskruns running at the same time as a maximum, but 10 at the same time is more common. My coworker has tried to workaround this with a client side script that throws away all table rows other than the last 100, and he claims he does not have any tab freezing issues. |
Interesting. Do your Tasks have many steps? Are they short-lived or longer running? Would it be possible to share your colleague's script or provide more details about what it does / how often it runs? This would help narrow down the types of change that could be most beneficial for your use case, and provide a baseline for comparison of any performance improvements we might make. |
The taskruns have about 5-10 steps and run depending on the type (split by namespace) of taskrun between 1 and 30 minutes. The majority of taskruns are of a shorter duration (couple minutes at most) and only once in a while do we have a taskrun that runs for up to 30 minutes. @maartengo could you provide the details about your modifications |
The script can be found here: https://pastebin.com/Mi2HNwT9 In short it:
I haven't tested what the performance would be if there was some actual pagination instead of removing the elements. |
Thanks, this should be very helpful in trying to reproduce the issue and testing any potential improvements. |
Initial testing with #2327 against our dogfooding cluster shaves ~.5s off 3s load time for 1300 TaskRuns (<1MB) so at least it's heading in the right direction… I'll need to increase the number of resources and have some in progress to get a proper feel for what impact this might have for your use case but at least it's not slower 😄 I'll see if I can publish a test release later today containing the change, otherwise feel free to pull my branch and build it locally. If this works out I'll need to clean up the change a bit and we'll likely apply it to all pages (or at least TaskRuns + PipelineRuns to start with). |
Published test release |
Good news! Although the page load time is still ~5 seconds, the page is actually responsive afterwards! This is a huge improvement over the regular freezes we used to have. |
@AlanGreene Is there a way to move forward with this change? It would be very useful to us :) |
I haven't had time to work on this recently but will pick it up again soon. The existing PR needs a bit of cleanup and some tests, then we should be able to apply the same changes to PipelineRuns. I'll update the PR by middle of next week 🤞 |
I've updated the PR with a quick first pass at adding pagination to all list pages using a slightly different approach. I'll finish cleaning it up early next week and make sure all tests are passing before marking it ready for review. |
Client-side pagination is now available on all list pages in the latest nightly release, e.g. https://storage.googleapis.com/tekton-releases-nightly/dashboard/previous/v20220426-14cce744d6/tekton-dashboard-release.yaml Thanks again @rouke-broersma @maartengo for reporting the issue and helping to validate the change. This will be included in the next Dashboard release, v0.26.0 due May 5 - 10. |
That's awesome thank you! |
Describe the bug
The browser crashes due to the amount of TaskRuns loaded at the same time
Expected behaviour
The browser does not crash simply from opening the page
Steps to reproduce the bug
Have ~1500-2000 completed and/or running taskruns, might probably also happen for other types
Environment details
Kubernetes
AKS 1.22.6
Helm chart
AKS
v0.24.1
v0.33.2
tekton-pipelines
tekton-pipelines
Additional Info
Some pagination would probably be helpful
The text was updated successfully, but these errors were encountered: