3.5 ListWorkflows causes server to hang when there are lots of archived workflows #12025

sjhewitt · 2023-10-17T23:22:16Z

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

We had >200,000 rows in the workflow archive table, and when trying to view the new combined workflow/archived workflow list page in the UI, the server times out

scanning the code, it looks like the LoadWorkflows code loads all rows from the archive table, combines them with the k8s results and then applies sorting and limiting.

as a workaround, we've reduced the archive ttl from 14 days to 1 day, and the endpoint now responds before timing out, but is still pretty slow.

Version

v3.5.0

--- edits below by agilgur5 to add updates since this is a (very) popular issue ---

Updates

Broadly caused by feat: Unified workflows list UI and API #11121, although some specific regressions were introduced later, see below
Most of the performance regression part of this issue should have been solved by fix: Revert #11761 to avoid argo-server performance issue #12068, which was released in v3.5.1
- That did re-instate a different bug, however: 3.5 Pagination may not work correctly for archived workflows #11715
  - This bug was resolved by feat: add sqlite-based memory store for live workflows. Fixes #12025 #13021 / feat: add sqlite-based memory store for live workflows. Fixes #12025 #12736, which was released in v3.5.7
Another performance regression was fixed in fix: don't load entire archived workflow into memory in list APIs #12912, which was released in v3.5.6
Performance was also improved by feat: add sqlite-based memory store for live workflows. Fixes #12025 #13021 / feat: add sqlite-based memory store for live workflows. Fixes #12025 #12736, which was released in v3.5.7
There are still some remaining regressions due to the Archived + Live merge in 3.5, which should be fixed in later patches now that Live Workflows can be filtered via the SQLite DB:

The text was updated successfully, but these errors were encountered:

terrytangyuan · 2023-10-18T00:38:14Z

Thanks for this issue. This is a known issue if you have a lot of archived workflows. It's caused by the pagination method that first loads all live workflows and archived workflows and then performs pagination. cc @sunyeongchoi who worked on this in #11761.

sunyeongchoi · 2023-10-18T00:50:42Z

Hello. I will test the issue as soon as possible and think about a solution.
thank you.

agilgur5 · 2023-10-18T16:16:47Z

For posterity, this was actually discussed yesterday at the Contributors Meeting.
@jmeridth had been looking into it as it is blocking his team from upgrading as well and had eventually traced it to this PR discussion: #11761 (comment).

(I was involved as I made substantial refactors to the UI for 3.5 -- #11891 in particular -- in case those were the cause, but the UI is actually unrelated in this case, and the refactor actually decreased the number of network requests.
Also #11840 removed a default date filter, but that was entirely client-side anyway, so did not impact any networking.)

jmeridth · 2023-10-18T16:34:00Z

@sjhewitt thank you for filing this. I'm having the same issue. 30+ second page loads. I'm gathering data now and will post here once obtained.

@sunyeongchoi thanks for taking a look.

Can confirm this isn't related to the "past month" default on the UI being removed from the search box like @agilgur5 states. That is only client side.

terrytangyuan · 2023-10-18T19:14:17Z

There are some optimizations we can do, e.g. #12030.

However, this issue is more specific to argo-server/backend. The root cause is that the pagination method we implemented for ListWorkflows() requires retrieval of the entire list of workflows at once.

sunyeongchoi · 2023-10-22T06:11:42Z

@sjhewitt @jmeridth
Hello. I have a question about the circumstances under which this bug occurred.
Does this issue occur as soon as you first enter the Workflows list page? Or does when moving the page?

jmeridth · 2023-10-22T06:19:53Z

@suryaoruganti every visit.

sunyeongchoi · 2023-10-22T06:23:03Z

@jmeridth Ok! Thanks for your answer.

terrytangyuan · 2023-10-23T02:43:31Z

@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.).

I propose that we add a dropdown to allow users to select whether to display:

Only live workflows;
Only archived workflows;
Both live and archived workflows (with notice/warning that this is only suitable when the number of workflows is not too large);

Additional requirements:

This dropdown box is only available if there are both archived workflows. UI should be smart enough to figure this out.
We should also consider the ability to disable the third option as admin to avoid bringing down cluster performance.
The "archived" column should only be displayed when appropriate. For example, it's evident that a workflow is archived if the user is only viewing archived workflows.

Motivations for this proposal:

Some users only care about one of these types of workflows;
Since there are performance issues that we cannot get around in order to view both types of workflows, we should only provide this option with caution;
Keep using the original pagination implementation for live workflows or archived workflows where the logic is much more precise while keeping the front-end codebase simple;
The first two options are almost identical to previous versions, but the UI should be less buggy since they now share most of the implementation; the third option is an addition to previous versions.

Any thoughts?

rwong2888 · 2023-10-23T17:02:18Z

Just reverted back to v3.5.0-rc1 due to this. Perhaps we can add the default started time back?

terrytangyuan · 2023-10-23T17:23:34Z

I am reverting related change #12068 for now since otherwise the UI is not usable when there are many workflows. In the meantime, we can continue discussing proposal #12025 (comment) here. We can probably release a patch version next week #11997 (comment).

agilgur5 · 2023-10-23T23:12:47Z

Just reverted back to v3.5.0-rc1 due to this. Perhaps we can add the default started time back?

If you're referring to #11840, it is mentioned above that that it is unrelated to this issue, as that filter is entirely client side (as the k8s API server has no native way of doing date filtering, though that logic also pre-dates me and that PR)

agilgur5 · 2023-10-23T23:26:43Z

@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.).

What are some of those edge cases? We can still over/under fetch so long as it does not overload the server.
For instance, in the worst-case, if a user has 20 Workflows per page set, we can retrieve 20 from k8s and 20 from the Archive DB, which is not horrendous (but for sure could be optimized).

Did the previous Archived Workflows page not have pagination? If so, I would think it would have been similarly susceptible to this, just not as frequently hit since it was a separate page.

4. The first two options are almost identical to previous versions, but the UI should be less buggy since they now share most of the implementation; the third option is an addition to previous versions.

I feel like separate pages is a better UX than a drop-down. If the APIs are identical, some careful refactoring could make them share a UI implementation.

sjhewitt · 2023-10-23T23:43:38Z

@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.).

What are some of those edge cases? We can still over/under fetch so long as it does not overload the server. For instance, in the worst-case, if a user has 20 Workflows per page set, we can retrieve 20 from k8s and 20 from the Archive DB, which is not horrendous (but for sure could be optimized).

I'm similarly curious...
I wonder if it would be possible to use a cursor that encodes 2 offsets - one for the k8s api and one for the db, then fetches limit rows from both sources with the given offset, merges the results together and applies the limit to that combined list.

something like:

orderBy = ...
filters = ...
limit = 20
cursor = 0, 0
k8sResults = fetchK8s(cursor[0], limit, filters, orderBy)
dbResults = fetchDB(cursor[1], limit, filters, orderBy)
results = mergeResults(k8sResults, dbResults).slice(0, limit)

newK8sOffset = getLastK8sResult(results)
newDBOffset = getLastDbResult(results)
newCursor = (newK8sOffset, newDBOffset)

sunyeongchoi · 2023-10-24T00:06:36Z

@agilgur5 @sjhewitt

Hello. I kept thinking of a way to merge Archived and Workflows into one page instead of splitting them into two pages.

However, there are two problems that continue to hinder solving this problem.
First, Archived and Workflows may overlap with each other, so deduplication is necessary.
Second, we have no control over the pagination of Kubernetes resources(Workflows).

before this issue occurred, Logic removed duplicates by searching 20 Archived Workflows and 20 Workflows each when 20 pieces of data were needed on one page.

However, in this case, some Workflows data may be missing when viewing the next page.

This is because kubernetes' own pagination does not allow us to decide which data to start the next page with.

This problem can be solved if Kubernetes' own pagination logic can be used as is. (Kubernetes code analysis is required)

But if that's not possible, I don't know if there's any other better way.

In conclusion, I thought that in order to know what data to start with on the next page, it would be impossible for us to control this unless we knew exactly the continue logic(for pagination) of Kubernetes and could implement it in the same way, unless we retrieved the entire Workflows data.

Do you have any other good suggestions?

sjhewitt · 2023-10-24T01:01:32Z

ahh, I see - I didn't have much knowledge of the k8s api, so didn't realize it doesn't really support filtering/ordering/pagination.

The 3 options I see are:

separate the k8s backed api/ui from the db backed api/ui (reverting to previous behaviour)
fetch the whole k8s data and merge it with a subset of the db data
persist workflows in the db from the moment they are submitted, updating their state as they are scheduled/change state. then (if the db archive is enabled) make the UI solely reliant on querying from the db. In this case, the data could be augmented with data from the k8s API if the workflows are still running...

sunyeongchoi · 2023-10-24T01:21:55Z

There is an idea that suddenly came to mind while writing a comment.

I think this problem can be solved by knowing which data to start with on the next page.

In that case, I thought I could solve the problem by using a function that performs pagination based on the resourceVersion that I implemented previously.

Existing: Use the cursorPaginationByResourceVersion function after merging all Archived Workflows and all Workflows data.

Suggestion: If 20 pieces of data are needed on one page, fetch 20 Archived Workflows, fetch 20 Workflows, and then merge them. Use the cursorPaginationByResourceVersion function when searching Workflows data for the next page.

However, even with this method, all Workflows will be fetched on every page. (But not fetch all Archived Workflows)

I think it will be more efficient than the previous method.

I'll listen to your opinions and if you think it's a good method, I'll try it and share it with you.

Guillermogsjc · 2023-10-24T09:07:14Z

it is not only breaking and making unusable UI on v3.5.0, it is also crashing badly with OOM on 3200mb guaranteed deployed pod, with a lot of archived workflows (postgresql) and few at etcd (6 hours TTL on worfkflows with working GC).

The issue is at the main view where all workflows are listed.

Also probably on this pagination, it would be useful to change the defaults on time ranges to show. Currently, it is one month, but probably it would be better to have a default on 1 or 2 days, to free that argo-server list workflows. This, together with the flags "show archived" that you are commenting, would help a lot.

terrytangyuan · 2023-10-24T13:42:26Z

Suggestion: If 20 pieces of data are needed on one page, fetch 20 Archived Workflows, fetch 20 Workflows, and then merge them. Use the cursorPaginationByResourceVersion function when searching Workflows data for the next page.

@sunyeongchoi That might be a good optimization for the third option I listed in #12025 (comment). It could help a lot when there are a lot more archived workflows vs live workflows. Although we still need to fetch the entire list of live workflows on all pages.

Also probably on this pagination, it would be useful to change the defaults on time ranges to show. Currently, it is one month, but probably it would be better to have a default on 1 or 2 days, to free that argo-server list workflows. This, together with the flags "show archived" that you are commenting, would help a lot.

@Guillermogsjc Thanks for the suggestion. However, we still need to make the same list query in the backend and then filter in the front-end.

jessesuen · 2023-10-24T19:26:14Z

With the unified workflow list API for both live + archived workflows we have to do the following:

we need all workflows in the cluster because kubernetes API server does not support sorting by creation timestamp, but argo-server does.
we should only query X workflows from the archive, where X is the requested page size. The underlying database does support filtering and sorting, so this is efficient. The fact that we query everything from archive is nonsensical.

Number 1 is a challenge because performing LIST all workflows for every request will be taxing on K8s API server as users use the UI. Listing all workflows for the purposes of only showing 20 is extremely inefficient. To get around this, I propose a technique we use in Argo CD:

Since LIST all workflows is the requirement and also the problem, we can use an informer cache on Argo API server to avoid the additional LIST calls to k8s and have all workflows ready to be returned by API server from in-memory informer cache. When API server is returning a unified list of live + archive, it would call List() against the informer cache rather than List against K8s API, and then filter/merge/sort before returning results back to the caller.

Note that this does balloon the memory requirements of argo-server because of the informer cache. But I feel this is manageable with archive policy and gc. And the benefits of a unified live+archive API, as well as reducing load on K8s API server, outweigh the extra memory requirements of argo-server.

If we are worried about expensive processing / sorting of the results of List() calls we make against the informer cache, we could consider maintaining our own ordered workflow list (by creationTimestamp) that is automatically updated with UpdateFunc, AddFunc, DeleteFunc registered to the informer. But I consider this an optimization.

Guillermogsjc · 2023-10-24T19:46:17Z

a daily partition label on etcd for the workflow objects would be a mad idea? `

workflows.argoproj.io/daily_index: {{workflow.creationTimestamp.%Y-%m-%d}}

Partitioning indexes to allow performant queries on database objects, is often the way to go to avoid monolithic queries that are devastating, as the one you comment on against k8s API/etcd, that does not support by creation timestamp sort and limit.

By having a daily partition label you would be able to build the interested partitions to filter on the Argo server against k8s API.

Anyway, the problem needs to be on that monolithic query against backend database, if it is bringing everything instead of filtering, the load can be enormous into index page.

In our case at least ( following the documentation recommended good practice of short TTLs for workflow objects), where we have tons and tons of archived workflows on postgresql and only last 6 hours workflows from etcd, it is clear that the described full query against database backend is the killer.

sunyeongchoi · 2023-10-25T14:20:53Z

Thank you so much for so many people suggest good ideas.

First, I will start with optimizing Archived Workflows first.

we should only query X workflows from the archive, where X is the requested page size. The underlying database does support filtering and sorting, so this is efficient. The fact that we query everything from archive is nonsensical.

After that I will investigate the informer cache :)

When API server is returning a unified list of live + archive, it would call List() against the informer cache rather than List against K8s API, and then filter/merge/sort before returning results back to the caller.

terrytangyuan · 2023-10-25T14:42:04Z

@sunyeongchoi Great! Let me know if you need any help.

terrytangyuan · 2024-04-29T03:48:46Z

@ryancurrah What bulk review? I don’t see any blocking review comments from @agilgur5 in the PR.

ryancurrah · 2024-04-29T04:03:31Z

Typically the reviewer's approval is required before merging a pull request. I'm a bit confused about why this one was merged without Anton's okay, since he did the review. Would you mind elaborating on that decision?

terrytangyuan · 2024-04-29T04:53:35Z

Please correct me if I missed anything:

All comments from @agilgur5 in feat: add sqlite-based memory store for live workflows. Fixes #12025 #12736 were addressed and marked as resolved by @agilgur5.
There were no unresolved comments. No requested changes from reviewers.
The PR was approved by 3 approvers. We actually only need one approval to merge a PR.

If you see any additional issues, feel free to follow up by commenting here or submitting a new issue.

agilgur5 · 2024-04-29T19:37:31Z

I was the main analyzer here in the issue, but I haven't had time to give #12736 an in-depth review unfortunately. Getting the approach right here was always my top priority though and I think we ironed the heck out of it.
I didn't see any glaring issues either after the CGO removal, otherwise I would've left a blocking comment.
If other Approver+ folks gave it a thorough look-over, I think it's had sufficient due diligence. If there are follow-up issues or follow-up code review comments, we can follow up on them (see also #12912 (comment) as an example of that). I wouldn't anticipate any substantial changes though; that's why I was trying to iron out the approach after all.

We otherwise don't have firm rules on the process outside of the branch protection requirements (we don't have many Approvers either, see also the Sustainability Effort).

So I think Terry made the right call there (and also gave some more time than strictly necessary for others to take a look too, which I'm pretty sure was intentional). Would be great if I could have taken a deeper look too, but unfortunately we have bottlenecks and this is the top P1, so we all have to use our best available judgment often.

ryancurrah · 2024-04-29T22:50:41Z

Thank you both for helping me gain a clearer understanding of the current review process.

…Fixes argoproj#12025 (argoproj#12736)" This reverts commit f1ab5aa. Signed-off-by: Alan Clucas <alan@clucas.org>

Joibel · 2024-05-08T08:14:44Z

This fix has been reverted in #13018, so re-opening this until a new PR is merged.

…13021)

…12736) Signed-off-by: Jiacheng Xu <xjcmaxwellcjx@gmail.com> Co-authored-by: Anton Gilgur <agilgur5@gmail.com> (cherry picked from commit f1ab5aa)

…13021) Signed-off-by: Jiacheng Xu <xjcmaxwellcjx@gmail.com> Co-authored-by: Anton Gilgur <agilgur5@gmail.com> (cherry picked from commit 0a096e6)

agilgur5 · 2024-05-27T07:15:44Z

For reference, v3.5.7 has been released with a backport of #13027

rwong2888 · 2024-06-04T19:11:28Z

just upgraded.
seems laggy when moving from different namespaces.
anyone else getting the same?
are you able to replicate @agilgur5 ?

agilgur5 · 2024-06-04T19:19:38Z

We did have a user find some bugs due to the change but haven't been able to root cause it yet: #13140

I'm not sure if that's related to your issue, you'd have to check and would need more details.

are you able to replicate @agilgur5 ?

my test instances don't have a ton of load in general, so no 😅

Pipekit also couldn't repro the issue above

rwong2888 · 2024-06-04T19:34:04Z

I'm not sure it is the same. haven't seen the same crash behaviour as 3.5, it's just slow to respond if I have workflows in multiple namespaces, when using the UI, it is slow to respond when changing the namespace.

theintz · 2024-06-19T13:54:04Z

We are running 3.5.7 with a few 1000 archived workflows. The UI is still loading awfully slow, compared to the 3.4 branch. The page loads in 20-30 seconds. How can I verify that this fix is working as intended? What other things can I tinker with to get faster loading times.

agilgur5 · 2024-06-19T14:27:05Z

I'd try 3.5.8 (for everyone here as well), since it includes a pretty important memory corruption fix #13166. Prior to that you could get crashes, which could slow things down substantially, but hard to tell without logs

rwong2888 · 2024-06-20T16:56:19Z

@agilgur5 , I switched to 3.5.8 just now. I still see lag when switching namespaces in argo workflows UI.

driv · 2024-06-28T13:59:50Z

I'm experiencing the same.

In the logs I see 2 queries being executed that are considered slow. The first count is scanning thousands of rows.

2024/06/28 14:01:18     Session ID:     00001                                                                                                                                                                                                                                         
    Query:          SELECT count(*) as total FROM `argo_archived_workflows` WHERE ((`clustername` = ? AND `instanceid` = ?) AND `namespace` = ? AND not exists (select 1 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_ar
chived_workflows.uid and name = 'workflows.argoproj.io/controller-instanceid') AND exists (select 1 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_archived_workflows.uid and name = 'workflows.argoproj.io/phase' and val
ue in ('Error', 'Failed', 'Pending', 'Running')))                                                                                                                                                                                                                                     
    Arguments:      []interface {}{"default", "", "workflows"}                                                                                                                                                                                                                             
    Stack:                                                                                                                                                                                                                                                                            
        fmt.(*pp).handleMethods@/usr/local/go/src/fmt/print.go:673                                                                                                                                                                                                                    
        fmt.(*pp).printArg@/usr/local/go/src/fmt/print.go:756                                                                                                                                                                                                                         
        fmt.(*pp).doPrint@/usr/local/go/src/fmt/print.go:1211                                                                                                                                                                                                                         
        fmt.Append@/usr/local/go/src/fmt/print.go:289                                                                                                                                                                                                                                 
        log.(*Logger).Print.func1@/usr/local/go/src/log/log.go:261                                                                                                                                                                                                                    
        log.(*Logger).output@/usr/local/go/src/log/log.go:238                                                                                                                                                                                                                         
        log.(*Logger).Print@/usr/local/go/src/log/log.go:260                                                                                                                                                                                                                          
        github.com/argoproj/argo-workflows/v3/persist/sqldb.(*workflowArchive).CountWorkflows@/go/src/github.com/argoproj/argo-workflows/persist/sqldb/workflow_archive.go:241                                                                                                        
        github.com/argoproj/argo-workflows/v3/server/workflow.(*workflowServer).ListWorkflows@/go/src/github.com/argoproj/argo-workflows/server/workflow/workflow_server.go:193                                                                                                       
        github.com/argoproj/argo-workflows/v3/pkg/apiclient/workflow._WorkflowService_ListWorkflows_Handler.func1@/go/src/github.com/argoproj/argo-workflows/pkg/apiclient/workflow/workflow.pb.go:1826                                                                               
        github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.RatelimitUnaryServerInterceptor.func5@/go/src/github.com/argoproj/argo-workflows/util/grpc/interceptor.go:65                                                                               
        github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.ChainUnaryServer.func6.1.1@/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25                                                                                     
        github.com/argoproj/argo-workflows/v3/server/auth.(*gatekeeper).UnaryServerInterceptor.func1@/go/src/github.com/argoproj/argo-workflows/server/auth/gatekeeper.go:98                                                                                                          
        github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.ChainUnaryServer.func6.1.1@/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25                                                                                     
        github.com/argoproj/argo-workflows/v3/util/grpc.glob..func1@/go/src/github.com/argoproj/argo-workflows/util/grpc/interceptor.go:45                                                                                                                                            
        github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.ChainUnaryServer.func6.1.1@/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25                                                                                     
        github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.PanicLoggerUnaryServerInterceptor.func4@/go/src/github.com/argoproj/argo-workflows/util/grpc/interceptor.go:26                                                                             
        github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.ChainUnaryServer.func6.1.1@/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25                                                                                     
        github.com/grpc-ecosystem/go-grpc-middleware/logging/logrus.UnaryServerInterceptor.func1@/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/logrus/server_interceptors.go:31                                                                             
        github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.ChainUnaryServer.func6.1.1@/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25                                                                                     
        github.com/grpc-ecosystem/go-grpc-prometheus.init.(*ServerMetrics).UnaryServerInterceptor.func3@/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107                                                                                         
    Error:          upper: slow query                                                                                                                                                                                                                                                 
    Time taken:     16.07711s                                                                                                                                                                                                                                                         
    Context:        context.Background

Should this be a new issue? The server does not hang, but it's continuously timing out and almost unusable.

agilgur5 · 2024-06-29T03:26:17Z

Thanks for providing the slow query log! ~~At a glance I don't see anything too off~~ EDIT: nvm the lack of limit does seem sus, see below comment (also I am surprised by the subqueries). Maybe we're missing a namespace index or another index? 🤔 Although that shouldn't have changed between 3.4 and 3.5

Yes can you make a new issue? That would be easier to track, get upvotes, etc especially since the main part of this issue is resolved and remaining pieces are already in separate issues. Can link from here and vice-versa.

agilgur5 · 2024-06-29T03:33:13Z

The first count is scanning thousands of rows.

SELECT count(*)

At a glance I don't see anything too off

Wait a minute, the count(*) has no limit, that's probably the issue in that specific case? 🤔 And it might be more efficient to not use * in this case as well (since status can be huge).

Although that shouldn't have changed between 3.4 and 3.5

That lack of limit might be a 3.5 change too, although I haven't checked the history yet

In the logs I see 2 queries being executed that are considered slow.

What's the second one? I'd still recommend a new issue either way though

agilgur5 · 2024-06-29T03:47:23Z

That lack of limit might be a 3.5 change too, although I haven't checked the history yet

Seems to predate 3.5 and be from 3.4.0-rc1 from #9118. If I had to guess, the Archived Workflows page would've also had this slow query in 3.4 then, it just got more impactful with the merge in 3.5

agilgur5 · 2024-07-03T15:31:53Z

Let's follow-up on any slow queries in #13295.

In general, for any lingering issues in 3.5.8+, please open a new issue with specifics

sjhewitt added the type/bug label Oct 17, 2023

agilgur5 added type/regression Regression from previous behavior (a specific type of bug) P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority area/api Argo Server API labels Oct 18, 2023

terrytangyuan mentioned this issue Oct 18, 2023

Optimize the content of the list of archived workflows sent to front-end #12030

Closed

terrytangyuan mentioned this issue Oct 23, 2023

Release v3.5 patch releases discussion #11997

Open

terrytangyuan mentioned this issue Oct 23, 2023

fix: Revert #11761 to avoid argo-server performance issue #12068

Merged

agilgur5 changed the title ~~ListWorkflows causes server to hang when there are lots of archived workflows~~ 3.5 ListWorkflows causes server to hang when there are lots of archived workflows Apr 29, 2024

Joibel added a commit to Joibel/argo-workflows that referenced this issue May 7, 2024

chore: Revert "feat: add sqlite-based memory store for live workflows. …

584d51e

…Fixes argoproj#12025 (argoproj#12736)" This reverts commit f1ab5aa. Signed-off-by: Alan Clucas <alan@clucas.org>

Joibel reopened this May 8, 2024

jiachengxu mentioned this issue May 8, 2024

feat: add sqlite-based memory store for live workflows. Fixes #12025 #13021

Merged

terrytangyuan closed this as completed in #13021 May 11, 2024

terrytangyuan pushed a commit that referenced this issue May 11, 2024

feat: add sqlite-based memory store for live workflows. Fixes #12025 (#…

0a096e6

…13021)

This was referenced Jun 6, 2024

UI: 3.5 missing name + name prefix filters for archived workflows #12161

Closed

UI: "Finished before" field only affects visible Workflows #13151

Open

UI: Filter workflows by archived/unarchived #13171

Open

argoproj locked as resolved and limited conversation to collaborators Sep 20, 2024

3.5 ListWorkflows causes server to hang when there are lots of archived workflows #12025

3.5 ListWorkflows causes server to hang when there are lots of archived workflows #12025

Comments

sjhewitt commented Oct 17, 2023 • edited by agilgur5 Loading

Pre-requisites

What happened/what you expected to happen?

Version

Updates

terrytangyuan commented Oct 18, 2023

sunyeongchoi commented Oct 18, 2023

agilgur5 commented Oct 18, 2023 • edited Loading

jmeridth commented Oct 18, 2023 • edited Loading

terrytangyuan commented Oct 18, 2023 • edited Loading

sunyeongchoi commented Oct 22, 2023

jmeridth commented Oct 22, 2023

sunyeongchoi commented Oct 22, 2023

terrytangyuan commented Oct 23, 2023 • edited Loading

rwong2888 commented Oct 23, 2023 • edited Loading

terrytangyuan commented Oct 23, 2023 • edited Loading

agilgur5 commented Oct 23, 2023 • edited Loading

agilgur5 commented Oct 23, 2023

sjhewitt commented Oct 23, 2023 • edited Loading

sunyeongchoi commented Oct 24, 2023 • edited Loading

sjhewitt commented Oct 24, 2023

sunyeongchoi commented Oct 24, 2023 • edited Loading

Guillermogsjc commented Oct 24, 2023 • edited Loading

terrytangyuan commented Oct 24, 2023 • edited Loading

jessesuen commented Oct 24, 2023 • edited Loading

Guillermogsjc commented Oct 24, 2023 • edited Loading

sunyeongchoi commented Oct 25, 2023

terrytangyuan commented Oct 25, 2023

terrytangyuan commented Apr 29, 2024 • edited Loading

ryancurrah commented Apr 29, 2024

terrytangyuan commented Apr 29, 2024

agilgur5 commented Apr 29, 2024 • edited Loading

ryancurrah commented Apr 29, 2024

Joibel commented May 8, 2024

agilgur5 commented May 27, 2024

rwong2888 commented Jun 4, 2024

agilgur5 commented Jun 4, 2024

rwong2888 commented Jun 4, 2024

theintz commented Jun 19, 2024

agilgur5 commented Jun 19, 2024

rwong2888 commented Jun 20, 2024

driv commented Jun 28, 2024 • edited Loading

agilgur5 commented Jun 29, 2024 • edited Loading

agilgur5 commented Jun 29, 2024 • edited Loading

agilgur5 commented Jun 29, 2024

agilgur5 commented Jul 3, 2024

sjhewitt commented Oct 17, 2023 •

edited by agilgur5

Loading

agilgur5 commented Oct 18, 2023 •

edited

Loading

jmeridth commented Oct 18, 2023 •

edited

Loading

terrytangyuan commented Oct 18, 2023 •

edited

Loading

terrytangyuan commented Oct 23, 2023 •

edited

Loading

rwong2888 commented Oct 23, 2023 •

edited

Loading

terrytangyuan commented Oct 23, 2023 •

edited

Loading

agilgur5 commented Oct 23, 2023 •

edited

Loading

sjhewitt commented Oct 23, 2023 •

edited

Loading

sunyeongchoi commented Oct 24, 2023 •

edited

Loading

sunyeongchoi commented Oct 24, 2023 •

edited

Loading

Guillermogsjc commented Oct 24, 2023 •

edited

Loading

terrytangyuan commented Oct 24, 2023 •

edited

Loading

jessesuen commented Oct 24, 2023 •

edited

Loading

Guillermogsjc commented Oct 24, 2023 •

edited

Loading

terrytangyuan commented Apr 29, 2024 •

edited

Loading

agilgur5 commented Apr 29, 2024 •

edited

Loading

driv commented Jun 28, 2024 •

edited

Loading

agilgur5 commented Jun 29, 2024 •

edited

Loading

agilgur5 commented Jun 29, 2024 •

edited

Loading