-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] ActionId for Tag removal completes but the tag is still in the list of tags #144161
Comments
@pjbertels Could you give some more info about the order of events here? which tags were added/removed? |
We add the tag test_tag_d54827c0-508d-11ed-8ccc-39cdadc9db68 (test_tag_) to all the agents. When the actionId completes we verify that the tag is also in the list of tags. We remove the tag and wait for the actionId to complete and then we check the list of tags to check if it is also removed. |
An example from a 50K run on 8.5.0-bdb8ff4d. https://apm-ci.elastic.co/job/perf/job/observability-perf-mbp/job/main/191/consoleText |
@pjbertels I wouldn't reproduce this so far. Could you share the admin link to this cluster and the password to log in to kibana? I would like to check if there are any errors in kibana logs. One thing that could be happening that the ack count calculation in I found an issue locally that the ack count is not correct (showed less than the real count), submitted a pr for that. |
@pjbertels Could you try to reproduce again with the latest snapshot that includes the linked fix? If my theory is correct, the issue shouldn't be happening again. |
Will retest. Based on some checking... I think 8.5.1 is where we want to pick this up. |
As discussed - This should rather be tested on 8.6.0. |
Closing this for now as fixed. Will reopen if it occurs again. |
I retested it with 8.6.0-8cf9e954, the issue is still reproducible |
I could reproduce this on a cloud instance with horde. I found one problem where the retry task keeps retrying the action even after 3rd retry is failed (kibana task manager retries the task after 5m if it is not removed or not throwing an error). |
## Summary Related to #144161 Found that on a bulk update tags task failure, the task didn't stop after 3 retries (should be over in less then a minute), the retries kept happening for 2 hours. This change removes the retry task if 3 retries are reached. Also testing in cloud deployment to see if the tags error can be reproduced with this fix. I could reproduce the reported error locally, and seeing it goes away with this fix. To verify: - Add at least 50k agents with the `create_agents` script in kibana repo - open Kibana, select the 50k agents, and open Actions / Add tags - Try this in a few seconds: add 2 new tags, and remove one of them - Wait about 30s, the agents should reflect the changes - Check the logs to see that the tasks are removed after 3rd retry is reached or successful. - Check that there are no more running tasks. Any running task can be found in Kibana Console by running this query: `GET .kibana_task_manager/_search?q=task.taskType:"fleet:update_agent_tags:retry"` Locally simulated an error to test that the retry (and check) task is removed: ``` [2022-12-07T15:52:16.415+01:00][ERROR][plugins.fleet] Retry #3 of task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b failed: failing task [2022-12-07T15:52:16.416+01:00][WARN ][plugins.fleet] Stopping after 3rd retry. Error: failing task [2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:848984ab-c11d-4ebe-8d1f-606143dd656b [2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b ```
## Summary Related to elastic#144161 Found that on a bulk update tags task failure, the task didn't stop after 3 retries (should be over in less then a minute), the retries kept happening for 2 hours. This change removes the retry task if 3 retries are reached. Also testing in cloud deployment to see if the tags error can be reproduced with this fix. I could reproduce the reported error locally, and seeing it goes away with this fix. To verify: - Add at least 50k agents with the `create_agents` script in kibana repo - open Kibana, select the 50k agents, and open Actions / Add tags - Try this in a few seconds: add 2 new tags, and remove one of them - Wait about 30s, the agents should reflect the changes - Check the logs to see that the tasks are removed after 3rd retry is reached or successful. - Check that there are no more running tasks. Any running task can be found in Kibana Console by running this query: `GET .kibana_task_manager/_search?q=task.taskType:"fleet:update_agent_tags:retry"` Locally simulated an error to test that the retry (and check) task is removed: ``` [2022-12-07T15:52:16.415+01:00][ERROR][plugins.fleet] Retry elastic#3 of task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b failed: failing task [2022-12-07T15:52:16.416+01:00][WARN ][plugins.fleet] Stopping after 3rd retry. Error: failing task [2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:848984ab-c11d-4ebe-8d1f-606143dd656b [2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b ``` (cherry picked from commit 431c32b)
Merged a bugfix, though I could still reproduce the issue when I add/remove multiple tags quickly, but it happens less frequently. Will test more to see if I can fix the remaining issue. |
# Backport This will backport the following commits from `main` to `8.6`: - [[Fleet] cancel tasks when 3rd retry failed (#147190)](#147190) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Julia Bardi","email":"90178898+juliaElastic@users.noreply.github.com"},"sourceCommit":{"committedDate":"2022-12-08T08:14:33Z","message":"[Fleet] cancel tasks when 3rd retry failed (#147190)\n\n## Summary\r\n\r\nRelated to https://github.com/elastic/kibana/issues/144161\r\n\r\nFound that on a bulk update tags task failure, the task didn't stop\r\nafter 3 retries (should be over in less then a minute), the retries kept\r\nhappening for 2 hours.\r\nThis change removes the retry task if 3 retries are reached.\r\n\r\nAlso testing in cloud deployment to see if the tags error can be\r\nreproduced with this fix.\r\nI could reproduce the reported error locally, and seeing it goes away\r\nwith this fix.\r\n\r\nTo verify:\r\n- Add at least 50k agents with the `create_agents` script in kibana repo\r\n- open Kibana, select the 50k agents, and open Actions / Add tags\r\n- Try this in a few seconds: add 2 new tags, and remove one of them\r\n- Wait about 30s, the agents should reflect the changes\r\n- Check the logs to see that the tasks are removed after 3rd retry is\r\nreached or successful.\r\n- Check that there are no more running tasks. Any running task can be\r\nfound in Kibana Console by running this query: `GET\r\n.kibana_task_manager/_search?q=task.taskType:\"fleet:update_agent_tags:retry\"`\r\n\r\nLocally simulated an error to test that the retry (and check) task is\r\nremoved:\r\n\r\n```\r\n[2022-12-07T15:52:16.415+01:00][ERROR][plugins.fleet] Retry #3 of task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b failed: failing task\r\n[2022-12-07T15:52:16.416+01:00][WARN ][plugins.fleet] Stopping after 3rd retry. Error: failing task\r\n[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n```","sha":"431c32b894077fc5910380252086442083734fce","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","v8.7.0","v8.6.1"],"number":147190,"url":"https://github.com/elastic/kibana/pull/147190","mergeCommit":{"message":"[Fleet] cancel tasks when 3rd retry failed (#147190)\n\n## Summary\r\n\r\nRelated to https://github.com/elastic/kibana/issues/144161\r\n\r\nFound that on a bulk update tags task failure, the task didn't stop\r\nafter 3 retries (should be over in less then a minute), the retries kept\r\nhappening for 2 hours.\r\nThis change removes the retry task if 3 retries are reached.\r\n\r\nAlso testing in cloud deployment to see if the tags error can be\r\nreproduced with this fix.\r\nI could reproduce the reported error locally, and seeing it goes away\r\nwith this fix.\r\n\r\nTo verify:\r\n- Add at least 50k agents with the `create_agents` script in kibana repo\r\n- open Kibana, select the 50k agents, and open Actions / Add tags\r\n- Try this in a few seconds: add 2 new tags, and remove one of them\r\n- Wait about 30s, the agents should reflect the changes\r\n- Check the logs to see that the tasks are removed after 3rd retry is\r\nreached or successful.\r\n- Check that there are no more running tasks. Any running task can be\r\nfound in Kibana Console by running this query: `GET\r\n.kibana_task_manager/_search?q=task.taskType:\"fleet:update_agent_tags:retry\"`\r\n\r\nLocally simulated an error to test that the retry (and check) task is\r\nremoved:\r\n\r\n```\r\n[2022-12-07T15:52:16.415+01:00][ERROR][plugins.fleet] Retry #3 of task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b failed: failing task\r\n[2022-12-07T15:52:16.416+01:00][WARN ][plugins.fleet] Stopping after 3rd retry. Error: failing task\r\n[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n```","sha":"431c32b894077fc5910380252086442083734fce"}},"sourceBranch":"main","suggestedTargetBranches":["8.6"],"targetPullRequestStates":[{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/147190","number":147190,"mergeCommit":{"message":"[Fleet] cancel tasks when 3rd retry failed (#147190)\n\n## Summary\r\n\r\nRelated to https://github.com/elastic/kibana/issues/144161\r\n\r\nFound that on a bulk update tags task failure, the task didn't stop\r\nafter 3 retries (should be over in less then a minute), the retries kept\r\nhappening for 2 hours.\r\nThis change removes the retry task if 3 retries are reached.\r\n\r\nAlso testing in cloud deployment to see if the tags error can be\r\nreproduced with this fix.\r\nI could reproduce the reported error locally, and seeing it goes away\r\nwith this fix.\r\n\r\nTo verify:\r\n- Add at least 50k agents with the `create_agents` script in kibana repo\r\n- open Kibana, select the 50k agents, and open Actions / Add tags\r\n- Try this in a few seconds: add 2 new tags, and remove one of them\r\n- Wait about 30s, the agents should reflect the changes\r\n- Check the logs to see that the tasks are removed after 3rd retry is\r\nreached or successful.\r\n- Check that there are no more running tasks. Any running task can be\r\nfound in Kibana Console by running this query: `GET\r\n.kibana_task_manager/_search?q=task.taskType:\"fleet:update_agent_tags:retry\"`\r\n\r\nLocally simulated an error to test that the retry (and check) task is\r\nremoved:\r\n\r\n```\r\n[2022-12-07T15:52:16.415+01:00][ERROR][plugins.fleet] Retry #3 of task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b failed: failing task\r\n[2022-12-07T15:52:16.416+01:00][WARN ][plugins.fleet] Stopping after 3rd retry. Error: failing task\r\n[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n[2022-12-07T15:52:16.416+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:848984ab-c11d-4ebe-8d1f-606143dd656b\r\n```","sha":"431c32b894077fc5910380252086442083734fce"}},{"branch":"8.6","label":"v8.6.1","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}] BACKPORT--> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
Retested this against the most recent RC build of 8.6.0 (2207fc20). Adding or removal of tags from 75k agents took a very long time (>2 hours) so I keep this ticket open. @juliaElastic |
The fix is not included in this build (BC6), we have to wait for BC7 or check in the latest SNAPSHOT build. |
As discussed with @ablnk , the issue is still reproducible on the latest snapshot build. I think we need a different implementation here. Previously I changed the logic to abort on version conflict, because that fixed the concurrent update tags scenario. Do we want to go down this path? I think there are plans to change how the https://admin.found.no/deployments/828e5ad0562547d1bf29647c925b35fd/kibana |
I defer to Josh on the long term plan here but I believe that a retry mechanism might solve most of our issues for now without having to change the entire logic on which we rely today. |
Just an FYI I'm seeing this in the latest 8.6.0 BC(8.6.0-75d87829). On both add and remove with 5000 agents and the issue is that I never get back actionIds on add or remove. |
I am working on a fix here that solves the version conflict errors that Andrei reported a few days ago on 75k agents. @pjbertels can you share the admin link/kibana logs where you experinced this issue? it might be the same root cause as Andrei reported. EDIT: I could reproduce the issue with 5k agents, there can be conflicts and the logic currently doesn't retry on <10k. I can change this to retry update tags even on smaller agent count. |
## Summary Fixes #144161 As discussed [here](#144161 (comment)), the existing implementation of update tags doesn't work well with real agents, as there are many conflicts with checkin, even when trying to add/remove one tag. Refactored the logic to make retries more efficient: - Instead of aborting the whole bulk action on conflicts, changed the conflict strategy to 'proceed'. This means, if an action of 50k agents has 1k conflicts, not all 50k is retried, but only the 1k conflicts, this makes it less likely to conflict on retry. - Because of this, on retry we have to know which agents don't yet have the tag added/removed. For this, added an additional filter to the `updateByQuery` request. Only adding the filter if there is exactly one `tagsToAdd` or one `tagsToRemove`. This is the main use case from the UI, and handling other cases would complicate the logic more (each additional tag to add/remove would result in another OR query, which would match more agents, making conflicts more likely). - Added this additional query on the initial request as well (not only retries) to save on unnecessary work e.g. if the user tries to add a tag on 50k agents, but 48k already have it, it is enough to update the remaining 2k agents. - This improvement has the effect that 'Agent activity' shows the real updated agent count, not the total selected. I think this is not really a problem for update tags. - Cleaned up some of the UI logic, because the conflicts are fully handled now on the backend. - Locally I couldn't reproduce the conflict with agent checkins, even with 1k horde agents. I'll try to test in cloud with more real agents. To verify: - Enroll 50k agents (I used 50k with create_agents script, and 1k with horde). Enroll 50k with horde if possible. - Select all on UI and try to add/remove one or more tags - Expect the changes to propagate quickly (up to 1m). It might take a few refreshes to see the result on agent list and tags list, because the UI polls the agents every 30s. It is expected that the tags list temporarily shows incorrect data because the action is async. E.g. removed `test3` tag and added `add` tag quickly: <img width="1776" alt="image" src="https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png"> <img width="422" alt="image" src="https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png"> The logs show the details of how many `version_conflicts` were there, and it decreased with retries. ``` [2022-12-15T10:32:12.937+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000 [2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:16.477+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000 [2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet] {"took":9886,"timed_out":false,"total":52000,"updated":41143,"deleted":0,"batches":52,"version_conflicts":10857,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]} [2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet] {"took":9518,"timed_out":false,"total":52000,"updated":25755,"deleted":0,"batches":52,"version_conflicts":26245,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]} [2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet] Action failed: version conflict of 10857 agents [2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:27.462+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet] Action failed: version conflict of 26245 agents [2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:31.480+01:00][INFO ][plugins.fleet] Running bulk action retry task [2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000 [2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed bulk action retry task [2022-12-15T10:32:31.485+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet] {"took":2347,"timed_out":false,"total":10857,"updated":9857,"deleted":0,"batches":11,"version_conflicts":1000,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]} [2022-12-15T10:32:34.556+01:00][INFO ][plugins.fleet] Running bulk action retry task [2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000 [2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed bulk action retry task [2022-12-15T10:32:34.560+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de failed: version conflict of 1000 agents [2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de {"took":5509,"timed_out":false,"total":26245,"updated":26245,"deleted":0,"batches":27,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]} [2022-12-15T10:32:42.722+01:00][INFO ][plugins.fleet] processed 26245 agents, took 5509ms [2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:46.705+01:00][INFO ][plugins.fleet] Running bulk action retry task [2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000 [2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed bulk action retry task [2022-12-15T10:32:46.711+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet] {"took":379,"timed_out":false,"total":1000,"updated":1000,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]} [2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] processed 1000 agents, took 379ms [2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de ``` ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
## Summary Fixes elastic#144161 As discussed [here](elastic#144161 (comment)), the existing implementation of update tags doesn't work well with real agents, as there are many conflicts with checkin, even when trying to add/remove one tag. Refactored the logic to make retries more efficient: - Instead of aborting the whole bulk action on conflicts, changed the conflict strategy to 'proceed'. This means, if an action of 50k agents has 1k conflicts, not all 50k is retried, but only the 1k conflicts, this makes it less likely to conflict on retry. - Because of this, on retry we have to know which agents don't yet have the tag added/removed. For this, added an additional filter to the `updateByQuery` request. Only adding the filter if there is exactly one `tagsToAdd` or one `tagsToRemove`. This is the main use case from the UI, and handling other cases would complicate the logic more (each additional tag to add/remove would result in another OR query, which would match more agents, making conflicts more likely). - Added this additional query on the initial request as well (not only retries) to save on unnecessary work e.g. if the user tries to add a tag on 50k agents, but 48k already have it, it is enough to update the remaining 2k agents. - This improvement has the effect that 'Agent activity' shows the real updated agent count, not the total selected. I think this is not really a problem for update tags. - Cleaned up some of the UI logic, because the conflicts are fully handled now on the backend. - Locally I couldn't reproduce the conflict with agent checkins, even with 1k horde agents. I'll try to test in cloud with more real agents. To verify: - Enroll 50k agents (I used 50k with create_agents script, and 1k with horde). Enroll 50k with horde if possible. - Select all on UI and try to add/remove one or more tags - Expect the changes to propagate quickly (up to 1m). It might take a few refreshes to see the result on agent list and tags list, because the UI polls the agents every 30s. It is expected that the tags list temporarily shows incorrect data because the action is async. E.g. removed `test3` tag and added `add` tag quickly: <img width="1776" alt="image" src="https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png"> <img width="422" alt="image" src="https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png"> The logs show the details of how many `version_conflicts` were there, and it decreased with retries. ``` [2022-12-15T10:32:12.937+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000 [2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:16.477+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000 [2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet] {"took":9886,"timed_out":false,"total":52000,"updated":41143,"deleted":0,"batches":52,"version_conflicts":10857,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]} [2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet] {"took":9518,"timed_out":false,"total":52000,"updated":25755,"deleted":0,"batches":52,"version_conflicts":26245,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]} [2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet] Action failed: version conflict of 10857 agents [2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:27.462+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet] Action failed: version conflict of 26245 agents [2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:31.480+01:00][INFO ][plugins.fleet] Running bulk action retry task [2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry elastic#1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000 [2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed bulk action retry task [2022-12-15T10:32:31.485+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet] {"took":2347,"timed_out":false,"total":10857,"updated":9857,"deleted":0,"batches":11,"version_conflicts":1000,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]} [2022-12-15T10:32:34.556+01:00][INFO ][plugins.fleet] Running bulk action retry task [2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry elastic#1 of task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000 [2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed bulk action retry task [2022-12-15T10:32:34.560+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet] Retry elastic#1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de failed: version conflict of 1000 agents [2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de {"took":5509,"timed_out":false,"total":26245,"updated":26245,"deleted":0,"batches":27,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]} [2022-12-15T10:32:42.722+01:00][INFO ][plugins.fleet] processed 26245 agents, took 5509ms [2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5 [2022-12-15T10:32:46.705+01:00][INFO ][plugins.fleet] Running bulk action retry task [2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry elastic#2 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000 [2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed bulk action retry task [2022-12-15T10:32:46.711+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de [2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet] {"took":379,"timed_out":false,"total":1000,"updated":1000,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]} [2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] processed 1000 agents, took 379ms [2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de ``` ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit 687987a)
Merged the latest fix and backported to 8.6. I think it will be available in the BC on dec 27, as we missed the today's build: https://github.com/elastic/dev/issues/2162 |
# Backport This will backport the following commits from `main` to `8.6`: - [[Fleet] refactored bulk update tags retry (#147594)](#147594) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Julia Bardi","email":"90178898+juliaElastic@users.noreply.github.com"},"sourceCommit":{"committedDate":"2022-12-20T09:36:36Z","message":"[Fleet] refactored bulk update tags retry (#147594)\n\n## Summary\r\n\r\nFixes https://github.com/elastic/kibana/issues/144161\r\n\r\nAs discussed\r\n[here](https://github.com/elastic/kibana/issues/144161#issuecomment-1348668610),\r\nthe existing implementation of update tags doesn't work well with real\r\nagents, as there are many conflicts with checkin, even when trying to\r\nadd/remove one tag.\r\nRefactored the logic to make retries more efficient:\r\n- Instead of aborting the whole bulk action on conflicts, changed the\r\nconflict strategy to 'proceed'. This means, if an action of 50k agents\r\nhas 1k conflicts, not all 50k is retried, but only the 1k conflicts,\r\nthis makes it less likely to conflict on retry.\r\n- Because of this, on retry we have to know which agents don't yet have\r\nthe tag added/removed. For this, added an additional filter to the\r\n`updateByQuery` request. Only adding the filter if there is exactly one\r\n`tagsToAdd` or one `tagsToRemove`. This is the main use case from the\r\nUI, and handling other cases would complicate the logic more (each\r\nadditional tag to add/remove would result in another OR query, which\r\nwould match more agents, making conflicts more likely).\r\n- Added this additional query on the initial request as well (not only\r\nretries) to save on unnecessary work e.g. if the user tries to add a tag\r\non 50k agents, but 48k already have it, it is enough to update the\r\nremaining 2k agents.\r\n- This improvement has the effect that 'Agent activity' shows the real\r\nupdated agent count, not the total selected. I think this is not really\r\na problem for update tags.\r\n- Cleaned up some of the UI logic, because the conflicts are fully\r\nhandled now on the backend.\r\n- Locally I couldn't reproduce the conflict with agent checkins, even\r\nwith 1k horde agents. I'll try to test in cloud with more real agents.\r\n\r\nTo verify:\r\n- Enroll 50k agents (I used 50k with create_agents script, and 1k with\r\nhorde). Enroll 50k with horde if possible.\r\n- Select all on UI and try to add/remove one or more tags\r\n- Expect the changes to propagate quickly (up to 1m). It might take a\r\nfew refreshes to see the result on agent list and tags list, because the\r\nUI polls the agents every 30s. It is expected that the tags list\r\ntemporarily shows incorrect data because the action is async.\r\n\r\nE.g. removed `test3` tag and added `add` tag quickly:\r\n<img width=\"1776\" alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png\">\r\n<img width=\"422\" alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png\">\r\n\r\nThe logs show the details of how many `version_conflicts` were there,\r\nand it decreased with retries.\r\n\r\n```\r\n[2022-12-15T10:32:12.937+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:16.477+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000\r\n[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet] {\"took\":9886,\"timed_out\":false,\"total\":52000,\"updated\":41143,\"deleted\":0,\"batches\":52,\"version_conflicts\":10857,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet] {\"took\":9518,\"timed_out\":false,\"total\":52000,\"updated\":25755,\"deleted\":0,\"batches\":52,\"version_conflicts\":26245,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet] Action failed: version conflict of 10857 agents\r\n[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:27.462+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet] Action failed: version conflict of 26245 agents\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:31.480+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:31.485+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet] {\"took\":2347,\"timed_out\":false,\"total\":10857,\"updated\":9857,\"deleted\":0,\"batches\":11,\"version_conflicts\":1000,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:34.556+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:34.560+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de failed: version conflict of 1000 agents\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n{\"took\":5509,\"timed_out\":false,\"total\":26245,\"updated\":26245,\"deleted\":0,\"batches\":27,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:42.722+01:00][INFO ][plugins.fleet] processed 26245 agents, took 5509ms\r\n[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:46.705+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:46.711+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet] {\"took\":379,\"timed_out\":false,\"total\":1000,\"updated\":1000,\"deleted\":0,\"batches\":1,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] processed 1000 agents, took 379ms\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n```\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\nCo-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>","sha":"687987aa9ce56ce359f722485330179a4807d79a","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","v8.7.0","v8.6.1"],"number":147594,"url":"https://github.com/elastic/kibana/pull/147594","mergeCommit":{"message":"[Fleet] refactored bulk update tags retry (#147594)\n\n## Summary\r\n\r\nFixes https://github.com/elastic/kibana/issues/144161\r\n\r\nAs discussed\r\n[here](https://github.com/elastic/kibana/issues/144161#issuecomment-1348668610),\r\nthe existing implementation of update tags doesn't work well with real\r\nagents, as there are many conflicts with checkin, even when trying to\r\nadd/remove one tag.\r\nRefactored the logic to make retries more efficient:\r\n- Instead of aborting the whole bulk action on conflicts, changed the\r\nconflict strategy to 'proceed'. This means, if an action of 50k agents\r\nhas 1k conflicts, not all 50k is retried, but only the 1k conflicts,\r\nthis makes it less likely to conflict on retry.\r\n- Because of this, on retry we have to know which agents don't yet have\r\nthe tag added/removed. For this, added an additional filter to the\r\n`updateByQuery` request. Only adding the filter if there is exactly one\r\n`tagsToAdd` or one `tagsToRemove`. This is the main use case from the\r\nUI, and handling other cases would complicate the logic more (each\r\nadditional tag to add/remove would result in another OR query, which\r\nwould match more agents, making conflicts more likely).\r\n- Added this additional query on the initial request as well (not only\r\nretries) to save on unnecessary work e.g. if the user tries to add a tag\r\non 50k agents, but 48k already have it, it is enough to update the\r\nremaining 2k agents.\r\n- This improvement has the effect that 'Agent activity' shows the real\r\nupdated agent count, not the total selected. I think this is not really\r\na problem for update tags.\r\n- Cleaned up some of the UI logic, because the conflicts are fully\r\nhandled now on the backend.\r\n- Locally I couldn't reproduce the conflict with agent checkins, even\r\nwith 1k horde agents. I'll try to test in cloud with more real agents.\r\n\r\nTo verify:\r\n- Enroll 50k agents (I used 50k with create_agents script, and 1k with\r\nhorde). Enroll 50k with horde if possible.\r\n- Select all on UI and try to add/remove one or more tags\r\n- Expect the changes to propagate quickly (up to 1m). It might take a\r\nfew refreshes to see the result on agent list and tags list, because the\r\nUI polls the agents every 30s. It is expected that the tags list\r\ntemporarily shows incorrect data because the action is async.\r\n\r\nE.g. removed `test3` tag and added `add` tag quickly:\r\n<img width=\"1776\" alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png\">\r\n<img width=\"422\" alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png\">\r\n\r\nThe logs show the details of how many `version_conflicts` were there,\r\nand it decreased with retries.\r\n\r\n```\r\n[2022-12-15T10:32:12.937+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:16.477+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000\r\n[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet] {\"took\":9886,\"timed_out\":false,\"total\":52000,\"updated\":41143,\"deleted\":0,\"batches\":52,\"version_conflicts\":10857,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet] {\"took\":9518,\"timed_out\":false,\"total\":52000,\"updated\":25755,\"deleted\":0,\"batches\":52,\"version_conflicts\":26245,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet] Action failed: version conflict of 10857 agents\r\n[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:27.462+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet] Action failed: version conflict of 26245 agents\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:31.480+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:31.485+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet] {\"took\":2347,\"timed_out\":false,\"total\":10857,\"updated\":9857,\"deleted\":0,\"batches\":11,\"version_conflicts\":1000,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:34.556+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:34.560+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de failed: version conflict of 1000 agents\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n{\"took\":5509,\"timed_out\":false,\"total\":26245,\"updated\":26245,\"deleted\":0,\"batches\":27,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:42.722+01:00][INFO ][plugins.fleet] processed 26245 agents, took 5509ms\r\n[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:46.705+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:46.711+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet] {\"took\":379,\"timed_out\":false,\"total\":1000,\"updated\":1000,\"deleted\":0,\"batches\":1,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] processed 1000 agents, took 379ms\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n```\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\nCo-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>","sha":"687987aa9ce56ce359f722485330179a4807d79a"}},"sourceBranch":"main","suggestedTargetBranches":["8.6"],"targetPullRequestStates":[{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/147594","number":147594,"mergeCommit":{"message":"[Fleet] refactored bulk update tags retry (#147594)\n\n## Summary\r\n\r\nFixes https://github.com/elastic/kibana/issues/144161\r\n\r\nAs discussed\r\n[here](https://github.com/elastic/kibana/issues/144161#issuecomment-1348668610),\r\nthe existing implementation of update tags doesn't work well with real\r\nagents, as there are many conflicts with checkin, even when trying to\r\nadd/remove one tag.\r\nRefactored the logic to make retries more efficient:\r\n- Instead of aborting the whole bulk action on conflicts, changed the\r\nconflict strategy to 'proceed'. This means, if an action of 50k agents\r\nhas 1k conflicts, not all 50k is retried, but only the 1k conflicts,\r\nthis makes it less likely to conflict on retry.\r\n- Because of this, on retry we have to know which agents don't yet have\r\nthe tag added/removed. For this, added an additional filter to the\r\n`updateByQuery` request. Only adding the filter if there is exactly one\r\n`tagsToAdd` or one `tagsToRemove`. This is the main use case from the\r\nUI, and handling other cases would complicate the logic more (each\r\nadditional tag to add/remove would result in another OR query, which\r\nwould match more agents, making conflicts more likely).\r\n- Added this additional query on the initial request as well (not only\r\nretries) to save on unnecessary work e.g. if the user tries to add a tag\r\non 50k agents, but 48k already have it, it is enough to update the\r\nremaining 2k agents.\r\n- This improvement has the effect that 'Agent activity' shows the real\r\nupdated agent count, not the total selected. I think this is not really\r\na problem for update tags.\r\n- Cleaned up some of the UI logic, because the conflicts are fully\r\nhandled now on the backend.\r\n- Locally I couldn't reproduce the conflict with agent checkins, even\r\nwith 1k horde agents. I'll try to test in cloud with more real agents.\r\n\r\nTo verify:\r\n- Enroll 50k agents (I used 50k with create_agents script, and 1k with\r\nhorde). Enroll 50k with horde if possible.\r\n- Select all on UI and try to add/remove one or more tags\r\n- Expect the changes to propagate quickly (up to 1m). It might take a\r\nfew refreshes to see the result on agent list and tags list, because the\r\nUI polls the agents every 30s. It is expected that the tags list\r\ntemporarily shows incorrect data because the action is async.\r\n\r\nE.g. removed `test3` tag and added `add` tag quickly:\r\n<img width=\"1776\" alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png\">\r\n<img width=\"422\" alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png\">\r\n\r\nThe logs show the details of how many `version_conflicts` were there,\r\nand it decreased with retries.\r\n\r\n```\r\n[2022-12-15T10:32:12.937+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:16.477+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000\r\n[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet] {\"took\":9886,\"timed_out\":false,\"total\":52000,\"updated\":41143,\"deleted\":0,\"batches\":52,\"version_conflicts\":10857,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet] {\"took\":9518,\"timed_out\":false,\"total\":52000,\"updated\":25755,\"deleted\":0,\"batches\":52,\"version_conflicts\":26245,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet] Action failed: version conflict of 10857 agents\r\n[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:27.462+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet] Action failed: version conflict of 26245 agents\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:31.480+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:31.485+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet] {\"took\":2347,\"timed_out\":false,\"total\":10857,\"updated\":9857,\"deleted\":0,\"batches\":11,\"version_conflicts\":1000,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:34.556+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 29e9da70-7194-4e52-8004-2c1b19f6dfd5, total agents: 52000\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:34.560+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet] Retry #1 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de failed: version conflict of 1000 agents\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet] Retrying in task: fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n{\"took\":5509,\"timed_out\":false,\"total\":26245,\"updated\":26245,\"deleted\":0,\"batches\":27,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:42.722+01:00][INFO ][plugins.fleet] processed 26245 agents, took 5509ms\r\n[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:46.705+01:00][INFO ][plugins.fleet] Running bulk action retry task\r\n[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2 of task fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Running action asynchronously, actionId: 90acd541-19ac-4738-b3d3-db32789233de, total agents: 52000\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed bulk action retry task\r\n[2022-12-15T10:32:46.711+01:00][INFO ][plugins.fleet] Scheduling task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet] {\"took\":379,\"timed_out\":false,\"total\":1000,\"updated\":1000,\"deleted\":0,\"batches\":1,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] processed 1000 agents, took 379ms\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing task fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n```\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n\r\nCo-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>","sha":"687987aa9ce56ce359f722485330179a4807d79a"}},{"branch":"8.6","label":"v8.6.1","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}] BACKPORT--> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
Retested against the latest snapshot 8.7.0-19f3018, no changes observed. Will recheck when the next build is available |
It looks like this morning's snapshot failed which would explain this. Let's wait until a new build is available. |
Retested in the latest 8.6.0 BC 0410b9b5 From logs: |
Yes, so I expected this to happen eventually. The logic retries 3 times on version conflict, and updates the remaining agents. It can happen that some conflicts still happen on the last retry. One thing we can do is increase the number of retries to let's say 5. It is still not guaranteed that there will be no more conflicts after n retries. At least the improvement works, so most of the agents are updated successfully. |
@juliaElastic to test that theory I conducted another test on QA environment, where checkin time increased to 30 minutes. Can confirm that helped for cases of adding/removing a single tag to 100k. In case of adding/removing multiple tag at once, I still see the problem that tags are not applied to some of the agents. However, it seems like increasing retry attempts could really help - on each retry, the number of agents to which tag was not applied is reduced, and if there were more retries, perhaps tag would be applied to all agents. |
## Summary Increase retry count to 5 to help retry on agent doc version conflict. It looks like 3 retries are not enough for 100k agents update tags. #144161 This can be tested on an ECE high memory instance with 100k horde agents. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
## Summary Increase retry count to 5 to help retry on agent doc version conflict. It looks like 3 retries are not enough for 100k agents update tags. elastic#144161 This can be tested on an ECE high memory instance with 100k horde agents. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios (cherry picked from commit a9ac5ae)
# Backport This will backport the following commits from `main` to `8.6`: - [increase bulk action retry to 5 (#148169)](#148169) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Julia Bardi","email":"90178898+juliaElastic@users.noreply.github.com"},"sourceCommit":{"committedDate":"2022-12-29T14:28:36Z","message":"increase bulk action retry to 5 (#148169)\n\n## Summary\r\n\r\nIncrease retry count to 5 to help retry on agent doc version conflict.\r\nIt looks like 3 retries are not enough for 100k agents update tags.\r\nhttps://github.com//issues/144161\r\n\r\nThis can be tested on an ECE high memory instance with 100k horde\r\nagents.\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios","sha":"a9ac5aeb1eac631a2c365004c3f38fdca5c33291","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","ci:cloud-deploy","v8.7.0","v8.6.1"],"number":148169,"url":"https://github.com/elastic/kibana/pull/148169","mergeCommit":{"message":"increase bulk action retry to 5 (#148169)\n\n## Summary\r\n\r\nIncrease retry count to 5 to help retry on agent doc version conflict.\r\nIt looks like 3 retries are not enough for 100k agents update tags.\r\nhttps://github.com//issues/144161\r\n\r\nThis can be tested on an ECE high memory instance with 100k horde\r\nagents.\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios","sha":"a9ac5aeb1eac631a2c365004c3f38fdca5c33291"}},"sourceBranch":"main","suggestedTargetBranches":["8.6"],"targetPullRequestStates":[{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/148169","number":148169,"mergeCommit":{"message":"increase bulk action retry to 5 (#148169)\n\n## Summary\r\n\r\nIncrease retry count to 5 to help retry on agent doc version conflict.\r\nIt looks like 3 retries are not enough for 100k agents update tags.\r\nhttps://github.com//issues/144161\r\n\r\nThis can be tested on an ECE high memory instance with 100k horde\r\nagents.\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios","sha":"a9ac5aeb1eac631a2c365004c3f38fdca5c33291"}},{"branch":"8.6","label":"v8.6.1","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}] BACKPORT--> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
Found a UI bug, reported here. It causes bulk update tags to not work with default status filters. |
Closing this issue in order to keep only: #148233 |
@ablnk please test with the latest snapshot if the increase to 5 retries helped. |
@juliaElastic will retest as soon as #148233 fixed |
image_tag:8.5.0-792499b4-SNAPSHOT
When doing a 50k run and adding and removing tags we found that the ActionId for Tag removal completes but the tag is still in the list of tags and this causes the test to fail.
The text was updated successfully, but these errors were encountered: