-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.5 ListWorkflows causes server to hang when there are lots of archived workflows #12025
Comments
Thanks for this issue. This is a known issue if you have a lot of archived workflows. It's caused by the pagination method that first loads all live workflows and archived workflows and then performs pagination. cc @sunyeongchoi who worked on this in #11761. |
Hello. I will test the issue as soon as possible and think about a solution. |
For posterity, this was actually discussed yesterday at the Contributors Meeting. (I was involved as I made substantial refactors to the UI for 3.5 -- #11891 in particular -- in case those were the cause, but the UI is actually unrelated in this case, and the refactor actually decreased the number of network requests. |
@sjhewitt thank you for filing this. I'm having the same issue. 30+ second page loads. I'm gathering data now and will post here once obtained. @sunyeongchoi thanks for taking a look. Can confirm this isn't related to the "past month" default on the UI being removed from the search box like @agilgur5 states. That is only client side. |
There are some optimizations we can do, e.g. #12030. However, this issue is more specific to argo-server/backend. The root cause is that the pagination method we implemented for |
@suryaoruganti every visit. |
@jmeridth Ok! Thanks for your answer. |
@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.). I propose that we add a dropdown to allow users to select whether to display:
Additional requirements:
Motivations for this proposal:
Any thoughts? |
Just reverted back to |
I am reverting related change #12068 for now since otherwise the UI is not usable when there are many workflows. In the meantime, we can continue discussing proposal #12025 (comment) here. We can probably release a patch version next week #11997 (comment). |
If you're referring to #11840, it is mentioned above that that it is unrelated to this issue, as that filter is entirely client side (as the k8s API server has no native way of doing date filtering, though that logic also pre-dates me and that PR) |
What are some of those edge cases? We can still over/under fetch so long as it does not overload the server. Did the previous Archived Workflows page not have pagination? If so, I would think it would have been similarly susceptible to this, just not as frequently hit since it was a separate page.
I feel like separate pages is a better UX than a drop-down. If the APIs are identical, some careful refactoring could make them share a UI implementation. |
I'm similarly curious... something like:
|
Hello. I kept thinking of a way to merge However, there are two problems that continue to hinder solving this problem. before this issue occurred, Logic removed duplicates by searching 20 However, in this case, some This is because kubernetes' own pagination does not allow us to decide which data to start the next page with. This problem can be solved if Kubernetes' own pagination logic can be used as is. (Kubernetes code analysis is required) But if that's not possible, I don't know if there's any other better way. In conclusion, I thought that in order to know what data to start with on the next page, it would be impossible for us to control this unless we knew exactly the Do you have any other good suggestions? |
ahh, I see - I didn't have much knowledge of the k8s api, so didn't realize it doesn't really support filtering/ordering/pagination. The 3 options I see are:
|
There is an idea that suddenly came to mind while writing a comment. I think this problem can be solved by knowing which data to start with on the next page. In that case, I thought I could solve the problem by using a function that performs pagination based on the Existing: Use the Suggestion: If 20 pieces of data are needed on one page, fetch 20 However, even with this method, all I think it will be more efficient than the previous method. I'll listen to your opinions and if you think it's a good method, I'll try it and share it with you. |
it is not only breaking and making unusable UI on v3.5.0, it is also crashing badly with OOM on 3200mb guaranteed deployed pod, with a lot of archived workflows (postgresql) and few at etcd (6 hours TTL on worfkflows with working GC). The issue is at the main view where all workflows are listed. Also probably on this pagination, it would be useful to change the defaults on time ranges to show. Currently, it is one month, but probably it would be better to have a default on 1 or 2 days, to free that argo-server list workflows. This, together with the flags "show archived" that you are commenting, would help a lot. |
@sunyeongchoi That might be a good optimization for the third option I listed in #12025 (comment). It could help a lot when there are a lot more archived workflows vs live workflows. Although we still need to fetch the entire list of live workflows on all pages.
@Guillermogsjc Thanks for the suggestion. However, we still need to make the same list query in the backend and then filter in the front-end. |
With the unified workflow list API for both live + archived workflows we have to do the following:
Number 1 is a challenge because performing LIST all workflows for every request will be taxing on K8s API server as users use the UI. Listing all workflows for the purposes of only showing 20 is extremely inefficient. To get around this, I propose a technique we use in Argo CD: Since LIST all workflows is the requirement and also the problem, we can use an informer cache on Argo API server to avoid the additional LIST calls to k8s and have all workflows ready to be returned by API server from in-memory informer cache. When API server is returning a unified list of live + archive, it would call List() against the informer cache rather than List against K8s API, and then filter/merge/sort before returning results back to the caller. Note that this does balloon the memory requirements of argo-server because of the informer cache. But I feel this is manageable with archive policy and gc. And the benefits of a unified live+archive API, as well as reducing load on K8s API server, outweigh the extra memory requirements of argo-server. If we are worried about expensive processing / sorting of the results of List() calls we make against the informer cache, we could consider maintaining our own ordered workflow list (by creationTimestamp) that is automatically updated with UpdateFunc, AddFunc, DeleteFunc registered to the informer. But I consider this an optimization. |
a daily partition label on etcd for the workflow objects would be a mad idea? `
Partitioning indexes to allow performant queries on database objects, is often the way to go to avoid monolithic queries that are devastating, as the one you comment on against k8s API/etcd, that does not support by creation timestamp sort and limit. By having a daily partition label you would be able to build the interested partitions to filter on the Argo server against k8s API. Anyway, the problem needs to be on that monolithic query against backend database, if it is bringing everything instead of filtering, the load can be enormous into index page. In our case at least ( following the documentation recommended good practice of short TTLs for workflow objects), where we have tons and tons of archived workflows on postgresql and only last 6 hours workflows from etcd, it is clear that the described full query against database backend is the killer. |
Thank you so much for so many people suggest good ideas. First, I will start with optimizing Archived Workflows first.
After that I will investigate the informer cache :)
|
@sunyeongchoi Great! Let me know if you need any help. |
@ryancurrah What bulk review? I don’t see any blocking review comments from @agilgur5 in the PR. |
Typically the reviewer's approval is required before merging a pull request. I'm a bit confused about why this one was merged without Anton's okay, since he did the review. Would you mind elaborating on that decision? |
Please correct me if I missed anything:
If you see any additional issues, feel free to follow up by commenting here or submitting a new issue. |
I was the main analyzer here in the issue, but I haven't had time to give #12736 an in-depth review unfortunately. Getting the approach right here was always my top priority though and I think we ironed the heck out of it. We otherwise don't have firm rules on the process outside of the branch protection requirements (we don't have many Approvers either, see also the Sustainability Effort). So I think Terry made the right call there (and also gave some more time than strictly necessary for others to take a look too, which I'm pretty sure was intentional). Would be great if I could have taken a deeper look too, but unfortunately we have bottlenecks and this is the top P1, so we all have to use our best available judgment often. |
Thank you both for helping me gain a clearer understanding of the current review process. |
…Fixes argoproj#12025 (argoproj#12736)" This reverts commit f1ab5aa. Signed-off-by: Alan Clucas <alan@clucas.org>
This fix has been reverted in #13018, so re-opening this until a new PR is merged. |
just upgraded. |
We did have a user find some bugs due to the change but haven't been able to root cause it yet: #13140 I'm not sure if that's related to your issue, you'd have to check and would need more details.
my test instances don't have a ton of load in general, so no 😅 Pipekit also couldn't repro the issue above |
I'm not sure it is the same. haven't seen the same crash behaviour as 3.5, it's just slow to respond if I have workflows in multiple namespaces, when using the UI, it is slow to respond when changing the namespace. |
We are running 3.5.7 with a few 1000 archived workflows. The UI is still loading awfully slow, compared to the 3.4 branch. The page loads in 20-30 seconds. How can I verify that this fix is working as intended? What other things can I tinker with to get faster loading times. |
I'd try 3.5.8 (for everyone here as well), since it includes a pretty important memory corruption fix #13166. Prior to that you could get crashes, which could slow things down substantially, but hard to tell without logs |
@agilgur5 , I switched to 3.5.8 just now. I still see lag when switching namespaces in argo workflows UI. |
I'm experiencing the same. In the logs I see 2 queries being executed that are considered slow. The first count is scanning thousands of rows.
Should this be a new issue? The server does not hang, but it's continuously timing out and almost unusable. |
Thanks for providing the slow query log! Yes can you make a new issue? That would be easier to track, get upvotes, etc especially since the main part of this issue is resolved and remaining pieces are already in separate issues. Can link from here and vice-versa. |
Wait a minute, the
That lack of
What's the second one? I'd still recommend a new issue either way though |
Seems to predate 3.5 and be from 3.4.0-rc1 from #9118. If I had to guess, the Archived Workflows page would've also had this slow query in 3.4 then, it just got more impactful with the merge in 3.5 |
Let's follow-up on any slow queries in #13295. In general, for any lingering issues in 3.5.8+, please open a new issue with specifics |
Pre-requisites
:latest
What happened/what you expected to happen?
We had >200,000 rows in the workflow archive table, and when trying to view the new combined workflow/archived workflow list page in the UI, the server times out
scanning the code, it looks like the
LoadWorkflows
code loads all rows from the archive table, combines them with the k8s results and then applies sorting and limiting.as a workaround, we've reduced the archive ttl from 14 days to 1 day, and the endpoint now responds before timing out, but is still pretty slow.
Version
v3.5.0
--- edits below by agilgur5 to add updates since this is a (very) popular issue ---
Updates
The text was updated successfully, but these errors were encountered: