-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.5: Improve Archived API/DB read performance #13295
Comments
i tried to trace it and seem like below query is culprit . As per query performance insight(Azure Postgres flexi server) , this query execution mean time is 9.304s, Which explains, over all api time is 12-13 seconds select
name,
namespace,
uid,
phase,
startedat,
finishedat,
coalesce((workflow::json)->'metadata'->>'labels',
'{}') as labels,
coalesce((workflow::json)->'metadata'->>'annotations',
'{}') as annotations,
coalesce((workflow::json)->'status'->>'progress',
'') as progress,
coalesce((workflow::json)->'metadata'->>'creationTimestamp',
'') as creationtimestamp,
(workflow::json)->'spec'->>'suspend' as suspend,
coalesce((workflow::json)->'status'->>'message',
'') as message,
coalesce((workflow::json)->'status'->>'estimatedDuration',
'0') as estimatedduration,
coalesce((workflow::json)->'status'->>'resourcesDuration',
'{}') as resourcesduration
from
"argo_archived_workflows"
where
(("clustername" = $1
and "namespace" = $2
and "instanceid" = $3)
and "namespace" = $4
and not exists (
select
1
from
argo_archived_workflows_labels
where
clustername = argo_archived_workflows.clustername
and uid = argo_archived_workflows.uid
and name = 'workflows.argoproj.io/controller-instanceid'))
order by
"startedat" desc
limit 50 |
These json functions are culprit, coalesce((workflow::json)->'metadata'->>'labels',
'{}') as labels,
coalesce((workflow::json)->'metadata'->>'annotations',
'{}') as annotations,
coalesce((workflow::json)->'status'->>'progress',
'') as progress,
coalesce((workflow::json)->'metadata'->>'creationTimestamp',
'') as creationtimestamp,
(workflow::json)->'spec'->>'suspend' as suspend,
coalesce((workflow::json)->'status'->>'message',
'') as message,
coalesce((workflow::json)->'status'->>'estimatedDuration',
'0') as estimatedduration,
coalesce((workflow::json)->'status'->>'resourcesDuration', |
I've been testing the queries from the logs of The queries execution plans are different. On 8 it executes more than 2 orders of magnitude faster. I'm attaching the queries EXPLAIN. MySQL 8 is able to use range. |
hmm, i am using Postgres Flexi server , so not sure if it comparable. table DDL CREATE TABLE argo_archived_workflows (
uid varchar(128) NOT NULL,
"name" varchar(256) NOT NULL,
phase varchar(25) NOT NULL,
"namespace" varchar(256) NOT NULL,
workflow json NOT NULL,
startedat timestamp DEFAULT CURRENT_TIMESTAMP NOT NULL,
finishedat timestamp DEFAULT CURRENT_TIMESTAMP NOT NULL,
clustername varchar(64) NOT NULL,
instanceid varchar(64) NOT NULL,
labels text GENERATED ALWAYS AS ((workflow::json)->'metadata'->>'labels') stored,
annotations text GENERATED ALWAYS as ( (workflow::json)->'metadata'->>'annotations') stored,
progress text GENERATED ALWAYS AS ((workflow::json)->'status'->>'progress') stored,
creationtimestamp text GENERATED ALWAYS AS ((workflow::json)->'metadata'->>'creationTimestamp') stored,
suspend text GENERATED ALWAYS AS ((workflow::json)->'spec'->>'suspend') stored,
message text GENERATED ALWAYS AS ((workflow::json)->'status'->>'message') stored,
estimatedduration text GENERATED ALWAYS AS ((workflow::json)->'status'->>'estimatedDuration') stored,
resourcesduration text GENERATED ALWAYS AS ((workflow::json)->'status'->>'resourcesDuration') stored,
CONSTRAINT argo_archived_workflows_pkey PRIMARY KEY (clustername, uid)
);
CREATE INDEX argo_archived_workflows_i1 ON argo_archived_workflows USING btree (clustername, instanceid, namespace);
CREATE INDEX argo_archived_workflows_i2 ON argo_archived_workflows USING btree (clustername, instanceid, finishedat);
CREATE INDEX argo_archived_workflows_i3 ON argo_archived_workflows USING btree (clustername, instanceid, name);
CREATE INDEX argo_archived_workflows_i4 ON argo_archived_workflows USING btree (startedat); Revised Query select
name,
namespace,
uid,
phase,
startedat,
finishedat,
coalesce(labels,'{}') as labels,
coalesce(annotations,'{}') as annotations,
coalesce(progress,'') as progress,
coalesce(creationtimestamp,'') as creationtimestamp,
suspend,
coalesce(message,'') as message,
coalesce(estimatedduration,'0') as estimatedduration,
coalesce(resourcesduration,'{}') as resourcesduration
from
argo_archived_workflows
where
(("clustername" = $1
and "namespace" = $2
and "instanceid" = $3)
and "namespace" = $4
and not exists (
select
1
from
argo_archived_workflows_labels
where
clustername = 'default'
and uid = argo_archived_workflows.uid
and name = 'workflows.argoproj.io/controller-instanceid'))
order by startedat desc
limit 45; |
Given that the query has a limit 50 and those fields are not used in filtering or sorting, I doubt they are responsible for the long query duration. Can you share an EXPLAIN? |
in the beginning i though that too, but trying different strategy to switch different explain plan was not helping at all . Then i comment out those json function column in select, and suddenly query was super fast. Even Though they are not part of sorting and filtering, but these json are very big in size(If workflows is large , at least thats the case in my situation) |
Yea they can be very large, though I'm surprised it takes that long, since it's not part of the filter. I'm wondering if the engine is extracting the JSON multiple times, once for each field?
We don't currently filter on those, so it's a little unnecessary as individual columns, but otherwise that makes sense to me if it's compatible with MySQL as well. I'm wondering if we can just use a materialized view instead for similar effect? |
We are experiencing the same problem using PostgreSQL. There are nearly 1 million archived workflows in the database, which are quite large, making the query extremely expensive to run. As a result, the UI becomes unusable due to its slow response time. I have always wondered why the controller does not create a JSONB column to store the workflow information. |
Would it be possible to change the |
I think this is a good point, currently for Postgres, we use json but not JSONB. In the archived workflow case(heavy read), I think it makes sense to use JSONB since:
https://www.postgresql.org/docs/current/datatype-json.html
@agilgur5 To your point, from the above quote "The json data type stores an exact copy of the input text, which processing functions must reparse on each execution", I think for current implementation, most likely it is reparsing the json for each field. |
We don't need indexes on it currently as the JSON is not used in filtering.
Otherwise, it does sound like JSONB is the better choice for our read queries. This is only supported for Postgres though, and we'd have to add the min version of Postgres required for JSONB to the docs. I imagine lack of MySQL support and legacy code is the reason but 🤷 Is it straightforward to write a migration to JSONB? We'd also still have to solve this for MySQL, so that alone wouldn't be enough. A materialized view or generated columns are probably needed, and compressed nodes (#13313) might help too.
That would explain why it may have gotten slower. That is odd to me that it's reparsed (as the engine should be able to store it in memory and then reuse it for the query), but materialized views / generated columns should solve that relatively cleanly, assuming that writes are not heavily slowed down (e.g. if it reparses each field during derivation/generation too) |
I would say yes. My only concern is how long it might take to convert the workflow column for a large number of rows. ALTER TABLE argo_archived_workflows
ALTER COLUMN workflow
SET DATA TYPE JSONB
USING workflow::JSONB; |
changing to type JSONB solves the performance issue completely ? someone has validated it ? |
…13295 Signed-off-by: linzhengen <linzhengen@yahoo.co.jp>
Ok thanks for clarifying. I imagine that without a limit any query is going to be slow. JSONB should still be faster than the JSON variant regardless though.
Is there a separate query for JSONB specifically? I actually don't know
Yea the generated columns approach originally suggested by OP and then me above is more or less equivalent to that and was significantly faster (order of magnitude or two) per testing in OP's comment. |
the current implementation appears to be causing postgres always unmarshalling workflow json payload for all the records in the table. by adopting a subquery approach, we are able to optimise the query from a runtime of 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows and 1794624 argo_archived_workflows_labels. this new change is backward compatible and it is tested on our prodction env (using postgresql). realated issue: argoproj#13295 query change example: previous: ```sql SELECT name, namespace, UID, phase, startedat, finishedat, coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels, coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations, coalesce((workflow::JSON)->'status'->>'progress', '') AS progress, coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp, (workflow::JSON)->'spec'->>'suspend' AS suspend, coalesce((workflow::JSON)->'status'->>'message', '') AS message, coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration, coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration FROM "argo_archived_workflows" WHERE (("clustername" = 'default' AND "instanceid" = '') AND "namespace" = 'argo-map' AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/phase' AND value = 'Succeeded') AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/workflow-template' AND value = 'mapping1-pipeline-template-with-nfs')) ORDER BY "startedat" DESC LIMIT 1; ``` now: ```sql SELECT name, namespace, UID, phase, startedat, finishedat, coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels, coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations, coalesce((workflow::JSON)->'status'->>'progress', '') AS progress, coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp, (workflow::JSON)->'spec'->>'suspend' AS suspend, coalesce((workflow::JSON)->'status'->>'message', '') AS message, coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration, coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration FROM "argo_archived_workflows" WHERE "clustername" = 'default' AND UID IN (SELECT UID FROM "argo_archived_workflows" WHERE (("clustername" = 'default' AND "instanceid" = '') AND "namespace" = 'argo-map' AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/phase' AND value = 'Succeeded') AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/workflow-template' AND value = 'mapping1-pipeline-template-with-nfs')) ORDER BY "startedat" DESC LIMIT 1); ```
the current implementation appears to be causing postgres always unmarshalling workflow json payload for all the records in the table. by adopting a subquery approach, we are able to optimise the query from a runtime of 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows and 1794624 argo_archived_workflows_labels. this new change is backward compatible and it is tested on our prodction env (using postgresql). realated issue: argoproj#13295 query change example: previous: ```sql SELECT name, namespace, UID, phase, startedat, finishedat, coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels, coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations, coalesce((workflow::JSON)->'status'->>'progress', '') AS progress, coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp, (workflow::JSON)->'spec'->>'suspend' AS suspend, coalesce((workflow::JSON)->'status'->>'message', '') AS message, coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration, coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration FROM "argo_archived_workflows" WHERE (("clustername" = 'default' AND "instanceid" = '') AND "namespace" = 'argo-map' AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/phase' AND value = 'Succeeded') AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/workflow-template' AND value = 'mapping1-pipeline-template-with-nfs')) ORDER BY "startedat" DESC LIMIT 1; ``` now: ```sql SELECT name, namespace, UID, phase, startedat, finishedat, coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels, coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations, coalesce((workflow::JSON)->'status'->>'progress', '') AS progress, coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp, (workflow::JSON)->'spec'->>'suspend' AS suspend, coalesce((workflow::JSON)->'status'->>'message', '') AS message, coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration, coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration FROM "argo_archived_workflows" WHERE "clustername" = 'default' AND UID IN (SELECT UID FROM "argo_archived_workflows" WHERE (("clustername" = 'default' AND "instanceid" = '') AND "namespace" = 'argo-map' AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/phase' AND value = 'Succeeded') AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/workflow-template' AND value = 'mapping1-pipeline-template-with-nfs')) ORDER BY "startedat" DESC LIMIT 1); ``` Signed-off-by: Xiaofan Hu <bom.d.van@gmail.com>
So I've been thinking about this after discussing in the Aug 6th Contributor Meeting with @sarabala1979. All of the options we discussed to support this had trade-offs, so no optimal solution. I was thinking about the behavior of the Archived Workflows API prior to #11121, which takes me back to my own comment, #11121 (comment). The fix to that comment was #12912 which added the We could remove those fields entirely from the But perhaps #13566 has the root cause fix if all the errors are worked out and we won't have to think about any more options |
Not sure if I fully understand this comment. We search archived workflows by labels frequently because we archive workflows almost immediately. Would that mean we could not do that anymore? |
I did forget a piece in my initial comment, which was that labels actually have their own table already, so it would only be other forms of information in the JSON blob that would not be filterable, like annotations, |
For other folks here who haven't been following the PR, I pushed up Please try it out and comment here with your query time improvement results (before/after) and DB type and version (MySQL vXXX)! Number of rows (45000) and resources given (2cpu 4GB memory) would also be helpful for proportional estimation. |
From @Danny5487401 in #13563 (comment)
It sounds like the |
@agilgur5 , Before i test, will there be any breaking changes(in DB side) . will i be able to rollback ? |
No breaking changes, that's what makes it a viable choice for a patch. It's just a change to the list query itself. Can rollback without issue. But ofc I would try in staging first as regular precaution. If you have DB access, you can also test by running the query yourself, it's 1. above for MySQL and #13566's opening comment has the Postgres variant |
@agilgur5 the image name is |
No, you only need to replace the Server, since this query happens on the Server |
I can confirm this change reduces query times. Loading time in the UI on filtering labels was reduced by 94.44%. Query time Before: 18 seconds |
Added a new issue for further optimizations to this query in #13601 |
…j#13566) Signed-off-by: Xiaofan Hu <bom.d.van@gmail.com> Co-authored-by: Xiaofan Hu <xiaofanhu@deeproute.ai>
@agilgur5 , sorry for delay from my side in testing, but even though with |
It's already released in 3.5.11.
But you'd really have to detail why you're not seeing any improvement in API performance when other people are. Did you run the image for your Server properly? What kinds of queries are you using to the API and seeing in the DB? etc Per #13601 (comment), unpaginated queries can still be slow (indexing doesn't help that of course) |
@agilgur5 , yes, i applied the image ( |
edited by agilgur5:
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened/what did you expect to happen?
Currently , when large number of workflows are running , Workflows view of Argo workflows UI is taking up to 12 seconds to load. I don't find any way to further optimise it. CPU and Memory uses of server pod are normal. I am using Postgres to archive the workflows and cpu and memory usage of database as well are normal. i don't see any spikes anywhere. So looks like its not resource contention but something else, What can be done to improve the load time .
Version
3.5.0, 3.5.6, 3.5.8,
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Its UI load time (workflow view)
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: