Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UI becomes unresponsive when there are a large number of completed/running taskruns #2324

Closed
rouke-broersma opened this issue Mar 11, 2022 · 18 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@rouke-broersma
Copy link

Describe the bug

The browser crashes due to the amount of TaskRuns loaded at the same time

Expected behaviour

The browser does not crash simply from opening the page

Steps to reproduce the bug

Have ~1500-2000 completed and/or running taskruns, might probably also happen for other types

Environment details

  • Kubernetes Platform:
    Kubernetes
  • Kubernetes or OpenShift version:
    AKS 1.22.6
  • Install mode (if on OpenShift):
    Helm chart
  • Cloud-provider/provisioner:
    AKS
  • Versions:
    • Tekton Dashboard:
      v0.24.1
    • Tekton Pipelines:
      v0.33.2
  • Install namespaces:
    • Tekton Dashboard:
      tekton-pipelines
    • Tekton Pipelines:
      tekton-pipelines

Additional Info

Some pagination would probably be helpful

@rouke-broersma rouke-broersma added the kind/bug Categorizes issue or PR as related to a bug. label Mar 11, 2022
@AlanGreene
Copy link
Member

Thanks for reporting this @rouke-broersma. It's a known problem when there are a large number of resources on the cluster and has been discussed many times in the past both in GitHub issues and on Slack. We're currently facing some limitations of the Kubernetes API although we do have some plans to address this.

See #1978 (comment) for some information on how we currently manage this in our own dogfooding cluster by removing older runs as well as plans for integration with Tekton Results which we hope will allow us to address some of the concerns about pagination, large numbers of resources, etc.

@rouke-broersma
Copy link
Author

Sorry, I looked through the open issues and could not find anything.

@AlanGreene
Copy link
Member

No problem at all. It took me a while to find them and I already knew what I was looking for 😅

@rouke-broersma
Copy link
Author

On my end it seems like the issue is not necessarily that we have to wait a long time to get the response from the k8s api but rather it seems that the rendering of all the items is the problem. Could the pagination not be performed client side instead, so that at least the items are not all rendered?

@AlanGreene
Copy link
Member

AlanGreene commented Mar 11, 2022

We tried that in the past but it didn't make a difference in terms of the perceived performance, but may be worth investigating again since we have made a large number of other performance related and client architectural changes in the meantime.

Can you share some more details about the number/size of the resources you're working with, the total response size for the list, or some examples? This could be helpful when we try to reproduce the issue and test any potential improvements.

If you could include request/response time and time to load in the UI that would be great.

It will likely be towards the end of next week before I can dedicate much time to this.

@rouke-broersma
Copy link
Author

The size of the taskrun response is 3MB according to google chrome and takes 12 seconds to load. This contains about 2000 items I think. I don't think it causes much problems when there are no taskruns running, but when the taskruns are running the tab becomes unresponsive and chrome tries to kill the tab. We have about 20 taskruns running at the same time as a maximum, but 10 at the same time is more common.

My coworker has tried to workaround this with a client side script that throws away all table rows other than the last 100, and he claims he does not have any tab freezing issues.

@AlanGreene
Copy link
Member

Interesting. Do your Tasks have many steps? Are they short-lived or longer running?

Would it be possible to share your colleague's script or provide more details about what it does / how often it runs? This would help narrow down the types of change that could be most beneficial for your use case, and provide a baseline for comparison of any performance improvements we might make.

@rouke-broersma
Copy link
Author

The taskruns have about 5-10 steps and run depending on the type (split by namespace) of taskrun between 1 and 30 minutes. The majority of taskruns are of a shorter duration (couple minutes at most) and only once in a while do we have a taskrun that runs for up to 30 minutes.

@maartengo could you provide the details about your modifications

@maartengo
Copy link

The script can be found here: https://pastebin.com/Mi2HNwT9
We currently have 2382 elements shown on the TaskRuns page, with at most 7 steps per run.

In short it:

  • Removes all table rows after the first 100 table rows
  • Hides any table row after the first 30 table rows, this reduces the amount of errors
  • Keeps doing the above actions every 1-5 seconds, depending on the content of the page

I haven't tested what the performance would be if there was some actual pagination instead of removing the elements.

@AlanGreene
Copy link
Member

Thanks, this should be very helpful in trying to reproduce the issue and testing any potential improvements.

@AlanGreene
Copy link
Member

AlanGreene commented Mar 18, 2022

Initial testing with #2327 against our dogfooding cluster shaves ~.5s off 3s load time for 1300 TaskRuns (<1MB) so at least it's heading in the right direction… I'll need to increase the number of resources and have some in progress to get a proper feel for what impact this might have for your use case but at least it's not slower 😄

I'll see if I can publish a test release later today containing the change, otherwise feel free to pull my branch and build it locally. If this works out I'll need to clean up the change a bit and we'll likely apply it to all pages (or at least TaskRuns + PipelineRuns to start with).

@AlanGreene
Copy link
Member

Published test release pagination-test-20220318:

@maartengo
Copy link

Good news! Although the page load time is still ~5 seconds, the page is actually responsive afterwards! This is a huge improvement over the regular freezes we used to have.

@rouke-broersma
Copy link
Author

@AlanGreene Is there a way to move forward with this change? It would be very useful to us :)

@AlanGreene
Copy link
Member

I haven't had time to work on this recently but will pick it up again soon. The existing PR needs a bit of cleanup and some tests, then we should be able to apply the same changes to PipelineRuns. I'll update the PR by middle of next week 🤞

@AlanGreene AlanGreene self-assigned this Apr 22, 2022
@AlanGreene
Copy link
Member

AlanGreene commented Apr 22, 2022

I've updated the PR with a quick first pass at adding pagination to all list pages using a slightly different approach. I'll finish cleaning it up early next week and make sure all tests are passing before marking it ready for review.

@AlanGreene
Copy link
Member

Client-side pagination is now available on all list pages in the latest nightly release, e.g. https://storage.googleapis.com/tekton-releases-nightly/dashboard/previous/v20220426-14cce744d6/tekton-dashboard-release.yaml

Thanks again @rouke-broersma @maartengo for reporting the issue and helping to validate the change.

This will be included in the next Dashboard release, v0.26.0 due May 5 - 10.

@rouke-broersma
Copy link
Author

That's awesome thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants