Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.5 ListWorkflows causes server to hang when there are lots of archived workflows #12025

Closed
2 of 3 tasks
sjhewitt opened this issue Oct 17, 2023 · 123 comments · Fixed by #12736, #12912 or #13021
Closed
2 of 3 tasks

3.5 ListWorkflows causes server to hang when there are lots of archived workflows #12025

sjhewitt opened this issue Oct 17, 2023 · 123 comments · Fixed by #12736, #12912 or #13021
Assignees
Labels
area/api Argo Server API area/workflow-archive P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@sjhewitt
Copy link

sjhewitt commented Oct 17, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

We had >200,000 rows in the workflow archive table, and when trying to view the new combined workflow/archived workflow list page in the UI, the server times out

scanning the code, it looks like the LoadWorkflows code loads all rows from the archive table, combines them with the k8s results and then applies sorting and limiting.

as a workaround, we've reduced the archive ttl from 14 days to 1 day, and the endpoint now responds before timing out, but is still pretty slow.

Version

v3.5.0

--- edits below by agilgur5 to add updates since this is a (very) popular issue ---

Updates

@terrytangyuan
Copy link
Member

Thanks for this issue. This is a known issue if you have a lot of archived workflows. It's caused by the pagination method that first loads all live workflows and archived workflows and then performs pagination. cc @sunyeongchoi who worked on this in #11761.

@sunyeongchoi
Copy link
Member

Hello. I will test the issue as soon as possible and think about a solution.
thank you.

@agilgur5 agilgur5 added type/regression Regression from previous behavior (a specific type of bug) P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority area/api Argo Server API labels Oct 18, 2023
@agilgur5
Copy link
Contributor

agilgur5 commented Oct 18, 2023

For posterity, this was actually discussed yesterday at the Contributors Meeting.
@jmeridth had been looking into it as it is blocking his team from upgrading as well and had eventually traced it to this PR discussion: #11761 (comment).

(I was involved as I made substantial refactors to the UI for 3.5 -- #11891 in particular -- in case those were the cause, but the UI is actually unrelated in this case, and the refactor actually decreased the number of network requests.
Also #11840 removed a default date filter, but that was entirely client-side anyway, so did not impact any networking.)

@jmeridth
Copy link
Member

jmeridth commented Oct 18, 2023

@sjhewitt thank you for filing this. I'm having the same issue. 30+ second page loads. I'm gathering data now and will post here once obtained.

@sunyeongchoi thanks for taking a look.

Can confirm this isn't related to the "past month" default on the UI being removed from the search box like @agilgur5 states. That is only client side.

@terrytangyuan
Copy link
Member

terrytangyuan commented Oct 18, 2023

There are some optimizations we can do, e.g. #12030.

However, this issue is more specific to argo-server/backend. The root cause is that the pagination method we implemented for ListWorkflows() requires retrieval of the entire list of workflows at once.

@sunyeongchoi
Copy link
Member

@sjhewitt @jmeridth
Hello. I have a question about the circumstances under which this bug occurred.
Does this issue occur as soon as you first enter the Workflows list page? Or does when moving the page?

@jmeridth
Copy link
Member

@suryaoruganti every visit.

@sunyeongchoi
Copy link
Member

@jmeridth Ok! Thanks for your answer.

@terrytangyuan
Copy link
Member

terrytangyuan commented Oct 23, 2023

@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.).

I propose that we add a dropdown to allow users to select whether to display:

  1. Only live workflows;
  2. Only archived workflows;
  3. Both live and archived workflows (with notice/warning that this is only suitable when the number of workflows is not too large);

Additional requirements:

  1. This dropdown box is only available if there are both archived workflows. UI should be smart enough to figure this out.
  2. We should also consider the ability to disable the third option as admin to avoid bringing down cluster performance.
  3. The "archived" column should only be displayed when appropriate. For example, it's evident that a workflow is archived if the user is only viewing archived workflows.

Motivations for this proposal:

  1. Some users only care about one of these types of workflows;
  2. Since there are performance issues that we cannot get around in order to view both types of workflows, we should only provide this option with caution;
  3. Keep using the original pagination implementation for live workflows or archived workflows where the logic is much more precise while keeping the front-end codebase simple;
  4. The first two options are almost identical to previous versions, but the UI should be less buggy since they now share most of the implementation; the third option is an addition to previous versions.

Any thoughts?

@rwong2888
Copy link
Contributor

rwong2888 commented Oct 23, 2023

Just reverted back to v3.5.0-rc1 due to this. Perhaps we can add the default started time back?

@terrytangyuan
Copy link
Member

terrytangyuan commented Oct 23, 2023

I am reverting related change #12068 for now since otherwise the UI is not usable when there are many workflows. In the meantime, we can continue discussing proposal #12025 (comment) here. We can probably release a patch version next week #11997 (comment).

@agilgur5
Copy link
Contributor

agilgur5 commented Oct 23, 2023

Just reverted back to v3.5.0-rc1 due to this. Perhaps we can add the default started time back?

If you're referring to #11840, it is mentioned above that that it is unrelated to this issue, as that filter is entirely client side (as the k8s API server has no native way of doing date filtering, though that logic also pre-dates me and that PR)

@agilgur5
Copy link
Contributor

@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.).

What are some of those edge cases? We can still over/under fetch so long as it does not overload the server.
For instance, in the worst-case, if a user has 20 Workflows per page set, we can retrieve 20 from k8s and 20 from the Archive DB, which is not horrendous (but for sure could be optimized).

Did the previous Archived Workflows page not have pagination? If so, I would think it would have been similarly susceptible to this, just not as frequently hit since it was a separate page.

4. The first two options are almost identical to previous versions, but the UI should be less buggy since they now share most of the implementation; the third option is an addition to previous versions.

I feel like separate pages is a better UX than a drop-down. If the APIs are identical, some careful refactoring could make them share a UI implementation.

@sjhewitt
Copy link
Author

sjhewitt commented Oct 23, 2023

@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.).

What are some of those edge cases? We can still over/under fetch so long as it does not overload the server. For instance, in the worst-case, if a user has 20 Workflows per page set, we can retrieve 20 from k8s and 20 from the Archive DB, which is not horrendous (but for sure could be optimized).

I'm similarly curious...
I wonder if it would be possible to use a cursor that encodes 2 offsets - one for the k8s api and one for the db, then fetches limit rows from both sources with the given offset, merges the results together and applies the limit to that combined list.

something like:

orderBy = ...
filters = ...
limit = 20
cursor = 0, 0
k8sResults = fetchK8s(cursor[0], limit, filters, orderBy)
dbResults = fetchDB(cursor[1], limit, filters, orderBy)
results = mergeResults(k8sResults, dbResults).slice(0, limit)

newK8sOffset = getLastK8sResult(results)
newDBOffset = getLastDbResult(results)
newCursor = (newK8sOffset, newDBOffset)

@sunyeongchoi
Copy link
Member

sunyeongchoi commented Oct 24, 2023

@agilgur5 @sjhewitt

Hello. I kept thinking of a way to merge Archived and Workflows into one page instead of splitting them into two pages.

However, there are two problems that continue to hinder solving this problem.
First, Archived and Workflows may overlap with each other, so deduplication is necessary.
Second, we have no control over the pagination of Kubernetes resources(Workflows).

before this issue occurred, Logic removed duplicates by searching 20 Archived Workflows and 20 Workflows each when 20 pieces of data were needed on one page.

However, in this case, some Workflows data may be missing when viewing the next page.

This is because kubernetes' own pagination does not allow us to decide which data to start the next page with.

This problem can be solved if Kubernetes' own pagination logic can be used as is. (Kubernetes code analysis is required)

But if that's not possible, I don't know if there's any other better way.

In conclusion, I thought that in order to know what data to start with on the next page, it would be impossible for us to control this unless we knew exactly the continue logic(for pagination) of Kubernetes and could implement it in the same way, unless we retrieved the entire Workflows data.

Do you have any other good suggestions?

@sjhewitt
Copy link
Author

ahh, I see - I didn't have much knowledge of the k8s api, so didn't realize it doesn't really support filtering/ordering/pagination.

The 3 options I see are:

  • separate the k8s backed api/ui from the db backed api/ui (reverting to previous behaviour)
  • fetch the whole k8s data and merge it with a subset of the db data
  • persist workflows in the db from the moment they are submitted, updating their state as they are scheduled/change state. then (if the db archive is enabled) make the UI solely reliant on querying from the db. In this case, the data could be augmented with data from the k8s API if the workflows are still running...

@sunyeongchoi
Copy link
Member

sunyeongchoi commented Oct 24, 2023

There is an idea that suddenly came to mind while writing a comment.

I think this problem can be solved by knowing which data to start with on the next page.

In that case, I thought I could solve the problem by using a function that performs pagination based on the resourceVersion that I implemented previously.

Existing: Use the cursorPaginationByResourceVersion function after merging all Archived Workflows and all Workflows data.

Suggestion: If 20 pieces of data are needed on one page, fetch 20 Archived Workflows, fetch 20 Workflows, and then merge them. Use the cursorPaginationByResourceVersion function when searching Workflows data for the next page.

However, even with this method, all Workflows will be fetched on every page. (But not fetch all Archived Workflows)

I think it will be more efficient than the previous method.

I'll listen to your opinions and if you think it's a good method, I'll try it and share it with you.

@Guillermogsjc
Copy link

Guillermogsjc commented Oct 24, 2023

it is not only breaking and making unusable UI on v3.5.0, it is also crashing badly with OOM on 3200mb guaranteed deployed pod, with a lot of archived workflows (postgresql) and few at etcd (6 hours TTL on worfkflows with working GC).

The issue is at the main view where all workflows are listed.

Also probably on this pagination, it would be useful to change the defaults on time ranges to show. Currently, it is one month, but probably it would be better to have a default on 1 or 2 days, to free that argo-server list workflows. This, together with the flags "show archived" that you are commenting, would help a lot.

@terrytangyuan
Copy link
Member

terrytangyuan commented Oct 24, 2023

Suggestion: If 20 pieces of data are needed on one page, fetch 20 Archived Workflows, fetch 20 Workflows, and then merge them. Use the cursorPaginationByResourceVersion function when searching Workflows data for the next page.

@sunyeongchoi That might be a good optimization for the third option I listed in #12025 (comment). It could help a lot when there are a lot more archived workflows vs live workflows. Although we still need to fetch the entire list of live workflows on all pages.

Also probably on this pagination, it would be useful to change the defaults on time ranges to show. Currently, it is one month, but probably it would be better to have a default on 1 or 2 days, to free that argo-server list workflows. This, together with the flags "show archived" that you are commenting, would help a lot.

@Guillermogsjc Thanks for the suggestion. However, we still need to make the same list query in the backend and then filter in the front-end.

@jessesuen
Copy link
Member

jessesuen commented Oct 24, 2023

With the unified workflow list API for both live + archived workflows we have to do the following:

  1. we need all workflows in the cluster because kubernetes API server does not support sorting by creation timestamp, but argo-server does.
  2. we should only query X workflows from the archive, where X is the requested page size. The underlying database does support filtering and sorting, so this is efficient. The fact that we query everything from archive is nonsensical.

Number 1 is a challenge because performing LIST all workflows for every request will be taxing on K8s API server as users use the UI. Listing all workflows for the purposes of only showing 20 is extremely inefficient. To get around this, I propose a technique we use in Argo CD:

Since LIST all workflows is the requirement and also the problem, we can use an informer cache on Argo API server to avoid the additional LIST calls to k8s and have all workflows ready to be returned by API server from in-memory informer cache. When API server is returning a unified list of live + archive, it would call List() against the informer cache rather than List against K8s API, and then filter/merge/sort before returning results back to the caller.

Note that this does balloon the memory requirements of argo-server because of the informer cache. But I feel this is manageable with archive policy and gc. And the benefits of a unified live+archive API, as well as reducing load on K8s API server, outweigh the extra memory requirements of argo-server.

If we are worried about expensive processing / sorting of the results of List() calls we make against the informer cache, we could consider maintaining our own ordered workflow list (by creationTimestamp) that is automatically updated with UpdateFunc, AddFunc, DeleteFunc registered to the informer. But I consider this an optimization.

@Guillermogsjc
Copy link

Guillermogsjc commented Oct 24, 2023

a daily partition label on etcd for the workflow objects would be a mad idea? `

workflows.argoproj.io/daily_index: {{workflow.creationTimestamp.%Y-%m-%d}}

Partitioning indexes to allow performant queries on database objects, is often the way to go to avoid monolithic queries that are devastating, as the one you comment on against k8s API/etcd, that does not support by creation timestamp sort and limit.

By having a daily partition label you would be able to build the interested partitions to filter on the Argo server against k8s API.

Anyway, the problem needs to be on that monolithic query against backend database, if it is bringing everything instead of filtering, the load can be enormous into index page.

In our case at least ( following the documentation recommended good practice of short TTLs for workflow objects), where we have tons and tons of archived workflows on postgresql and only last 6 hours workflows from etcd, it is clear that the described full query against database backend is the killer.

@sunyeongchoi
Copy link
Member

Thank you so much for so many people suggest good ideas.

First, I will start with optimizing Archived Workflows first.

we should only query X workflows from the archive, where X is the requested page size. The underlying database does support filtering and sorting, so this is efficient. The fact that we query everything from archive is nonsensical.

After that I will investigate the informer cache :)

When API server is returning a unified list of live + archive, it would call List() against the informer cache rather than List against K8s API, and then filter/merge/sort before returning results back to the caller.

@terrytangyuan
Copy link
Member

@sunyeongchoi Great! Let me know if you need any help.

@terrytangyuan
Copy link
Member

terrytangyuan commented Apr 29, 2024

@ryancurrah What bulk review? I don’t see any blocking review comments from @agilgur5 in the PR.

@ryancurrah
Copy link
Contributor

Typically the reviewer's approval is required before merging a pull request. I'm a bit confused about why this one was merged without Anton's okay, since he did the review. Would you mind elaborating on that decision?

@terrytangyuan
Copy link
Member

Please correct me if I missed anything:

  1. All comments from @agilgur5 in feat: add sqlite-based memory store for live workflows. Fixes #12025 #12736 were addressed and marked as resolved by @agilgur5.
  2. There were no unresolved comments. No requested changes from reviewers.
  3. The PR was approved by 3 approvers. We actually only need one approval to merge a PR.

If you see any additional issues, feel free to follow up by commenting here or submitting a new issue.

@agilgur5 agilgur5 changed the title ListWorkflows causes server to hang when there are lots of archived workflows 3.5 ListWorkflows causes server to hang when there are lots of archived workflows Apr 29, 2024
@agilgur5
Copy link
Contributor

agilgur5 commented Apr 29, 2024

I was the main analyzer here in the issue, but I haven't had time to give #12736 an in-depth review unfortunately. Getting the approach right here was always my top priority though and I think we ironed the heck out of it.
I didn't see any glaring issues either after the CGO removal, otherwise I would've left a blocking comment.
If other Approver+ folks gave it a thorough look-over, I think it's had sufficient due diligence. If there are follow-up issues or follow-up code review comments, we can follow up on them (see also #12912 (comment) as an example of that). I wouldn't anticipate any substantial changes though; that's why I was trying to iron out the approach after all.

We otherwise don't have firm rules on the process outside of the branch protection requirements (we don't have many Approvers either, see also the Sustainability Effort).

So I think Terry made the right call there (and also gave some more time than strictly necessary for others to take a look too, which I'm pretty sure was intentional). Would be great if I could have taken a deeper look too, but unfortunately we have bottlenecks and this is the top P1, so we all have to use our best available judgment often.

@ryancurrah
Copy link
Contributor

Thank you both for helping me gain a clearer understanding of the current review process.

Joibel added a commit to Joibel/argo-workflows that referenced this issue May 7, 2024
…Fixes argoproj#12025 (argoproj#12736)"

This reverts commit f1ab5aa.

Signed-off-by: Alan Clucas <alan@clucas.org>
@Joibel
Copy link
Member

Joibel commented May 8, 2024

This fix has been reverted in #13018, so re-opening this until a new PR is merged.

@Joibel Joibel reopened this May 8, 2024
agilgur5 pushed a commit that referenced this issue May 27, 2024
…12736)

Signed-off-by: Jiacheng Xu <xjcmaxwellcjx@gmail.com>
Co-authored-by: Anton Gilgur <agilgur5@gmail.com>
(cherry picked from commit f1ab5aa)
agilgur5 pushed a commit that referenced this issue May 27, 2024
…13021)

Signed-off-by: Jiacheng Xu <xjcmaxwellcjx@gmail.com>
Co-authored-by: Anton Gilgur <agilgur5@gmail.com>
(cherry picked from commit 0a096e6)
@agilgur5
Copy link
Contributor

For reference, v3.5.7 has been released with a backport of #13027

@rwong2888
Copy link
Contributor

just upgraded.
seems laggy when moving from different namespaces.
anyone else getting the same?
are you able to replicate @agilgur5 ?

@agilgur5
Copy link
Contributor

agilgur5 commented Jun 4, 2024

We did have a user find some bugs due to the change but haven't been able to root cause it yet: #13140

I'm not sure if that's related to your issue, you'd have to check and would need more details.

are you able to replicate @agilgur5 ?

my test instances don't have a ton of load in general, so no 😅

Pipekit also couldn't repro the issue above

@rwong2888
Copy link
Contributor

I'm not sure it is the same. haven't seen the same crash behaviour as 3.5, it's just slow to respond if I have workflows in multiple namespaces, when using the UI, it is slow to respond when changing the namespace.

@theintz
Copy link

theintz commented Jun 19, 2024

We are running 3.5.7 with a few 1000 archived workflows. The UI is still loading awfully slow, compared to the 3.4 branch. The page loads in 20-30 seconds. How can I verify that this fix is working as intended? What other things can I tinker with to get faster loading times.

@agilgur5
Copy link
Contributor

I'd try 3.5.8 (for everyone here as well), since it includes a pretty important memory corruption fix #13166. Prior to that you could get crashes, which could slow things down substantially, but hard to tell without logs

@rwong2888
Copy link
Contributor

@agilgur5 , I switched to 3.5.8 just now. I still see lag when switching namespaces in argo workflows UI.

@driv
Copy link

driv commented Jun 28, 2024

I'm experiencing the same.

In the logs I see 2 queries being executed that are considered slow. The first count is scanning thousands of rows.

2024/06/28 14:01:18     Session ID:     00001                                                                                                                                                                                                                                         
    Query:          SELECT count(*) as total FROM `argo_archived_workflows` WHERE ((`clustername` = ? AND `instanceid` = ?) AND `namespace` = ? AND not exists (select 1 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_ar
chived_workflows.uid and name = 'workflows.argoproj.io/controller-instanceid') AND exists (select 1 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_archived_workflows.uid and name = 'workflows.argoproj.io/phase' and val
ue in ('Error', 'Failed', 'Pending', 'Running')))                                                                                                                                                                                                                                     
    Arguments:      []interface {}{"default", "", "workflows"}                                                                                                                                                                                                                             
    Stack:                                                                                                                                                                                                                                                                            
        fmt.(*pp).handleMethods@/usr/local/go/src/fmt/print.go:673                                                                                                                                                                                                                    
        fmt.(*pp).printArg@/usr/local/go/src/fmt/print.go:756                                                                                                                                                                                                                         
        fmt.(*pp).doPrint@/usr/local/go/src/fmt/print.go:1211                                                                                                                                                                                                                         
        fmt.Append@/usr/local/go/src/fmt/print.go:289                                                                                                                                                                                                                                 
        log.(*Logger).Print.func1@/usr/local/go/src/log/log.go:261                                                                                                                                                                                                                    
        log.(*Logger).output@/usr/local/go/src/log/log.go:238                                                                                                                                                                                                                         
        log.(*Logger).Print@/usr/local/go/src/log/log.go:260                                                                                                                                                                                                                          
        github.com/argoproj/argo-workflows/v3/persist/sqldb.(*workflowArchive).CountWorkflows@/go/src/github.com/argoproj/argo-workflows/persist/sqldb/workflow_archive.go:241                                                                                                        
        github.com/argoproj/argo-workflows/v3/server/workflow.(*workflowServer).ListWorkflows@/go/src/github.com/argoproj/argo-workflows/server/workflow/workflow_server.go:193                                                                                                       
        github.com/argoproj/argo-workflows/v3/pkg/apiclient/workflow._WorkflowService_ListWorkflows_Handler.func1@/go/src/github.com/argoproj/argo-workflows/pkg/apiclient/workflow/workflow.pb.go:1826                                                                               
        github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.RatelimitUnaryServerInterceptor.func5@/go/src/github.com/argoproj/argo-workflows/util/grpc/interceptor.go:65                                                                               
        github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.ChainUnaryServer.func6.1.1@/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25                                                                                     
        github.com/argoproj/argo-workflows/v3/server/auth.(*gatekeeper).UnaryServerInterceptor.func1@/go/src/github.com/argoproj/argo-workflows/server/auth/gatekeeper.go:98                                                                                                          
        github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.ChainUnaryServer.func6.1.1@/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25                                                                                     
        github.com/argoproj/argo-workflows/v3/util/grpc.glob..func1@/go/src/github.com/argoproj/argo-workflows/util/grpc/interceptor.go:45                                                                                                                                            
        github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.ChainUnaryServer.func6.1.1@/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25                                                                                     
        github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.PanicLoggerUnaryServerInterceptor.func4@/go/src/github.com/argoproj/argo-workflows/util/grpc/interceptor.go:26                                                                             
        github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.ChainUnaryServer.func6.1.1@/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25                                                                                     
        github.com/grpc-ecosystem/go-grpc-middleware/logging/logrus.UnaryServerInterceptor.func1@/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/logrus/server_interceptors.go:31                                                                             
        github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.ChainUnaryServer.func6.1.1@/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25                                                                                     
        github.com/grpc-ecosystem/go-grpc-prometheus.init.(*ServerMetrics).UnaryServerInterceptor.func3@/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107                                                                                         
    Error:          upper: slow query                                                                                                                                                                                                                                                 
    Time taken:     16.07711s                                                                                                                                                                                                                                                         
    Context:        context.Background                                                                     

Should this be a new issue? The server does not hang, but it's continuously timing out and almost unusable.

@agilgur5
Copy link
Contributor

agilgur5 commented Jun 29, 2024

Thanks for providing the slow query log! At a glance I don't see anything too off EDIT: nvm the lack of limit does seem sus, see below comment (also I am surprised by the subqueries). Maybe we're missing a namespace index or another index? 🤔 Although that shouldn't have changed between 3.4 and 3.5

Yes can you make a new issue? That would be easier to track, get upvotes, etc especially since the main part of this issue is resolved and remaining pieces are already in separate issues. Can link from here and vice-versa.

@agilgur5
Copy link
Contributor

agilgur5 commented Jun 29, 2024

The first count is scanning thousands of rows.

SELECT count(*)

At a glance I don't see anything too off

Wait a minute, the count(*) has no limit, that's probably the issue in that specific case? 🤔 And it might be more efficient to not use * in this case as well (since status can be huge).

Although that shouldn't have changed between 3.4 and 3.5

That lack of limit might be a 3.5 change too, although I haven't checked the history yet

In the logs I see 2 queries being executed that are considered slow.

What's the second one? I'd still recommend a new issue either way though

@agilgur5
Copy link
Contributor

That lack of limit might be a 3.5 change too, although I haven't checked the history yet

Seems to predate 3.5 and be from 3.4.0-rc1 from #9118. If I had to guess, the Archived Workflows page would've also had this slow query in 3.4 then, it just got more impactful with the merge in 3.5

@agilgur5
Copy link
Contributor

agilgur5 commented Jul 3, 2024

Let's follow-up on any slow queries in #13295.

In general, for any lingering issues in 3.5.8+, please open a new issue with specifics

@argoproj argoproj locked as resolved and limited conversation to collaborators Sep 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.