Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleet Rest API /agents/current_upgrades with very large numbers of agents #139404

Closed
pjbertels opened this issue Aug 24, 2022 · 9 comments
Closed
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Project:FleetScaling Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@pjbertels
Copy link

Kibana version:
8.4

Elasticsearch version:

Server OS version:

Browser version:

Browser OS version:

Original install method (e.g. download page, yum, from source, etc.):

Describe the bug:
The way we check if upgrades are complete(Fleet Rest API /agents/current_upgrades) doesn't seem to work well with large numbers of agents. The issue seems to be a combination of batches of 10,000 finishing and the API reporting to be done as soon as the last upgrades are scheduled in the batch instead of after they complete.

Steps to reproduce:
We have automation to reproduce this issue. Get in touch with us via #fleet-scaling

  1. Run an upgrade on a large number of agents and poll the /agents/current_upgrades to see when it says you are done.
  2. Use /agents_status to see if a bunch of agents are still updating after the /agents/current_upgrades response is empty.

Expected behavior:
Until the agents are all upgraded the Rest Call should report the upgrade is in process.

Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

These are logs of the automation polling current upgrades which demonstrates the issue. Unfortunately this set of logs doesn't show the /agents_status with ~ 6000 agents still updating but that was discovered after a number of runs with debug code added to find the root cause of the issue.

[14:35:39] WARNING -------------------------------------------------------------------------------------------- harness.py:133
WARNING label=harness_0_step_6_iteration_1 description=test step 6: Upgrade drones message=executing harness.py:134
WARNING -------------------------------------------------------------------------------------------- harness.py:135
INFO Rollout duration is 600 test_perf02.py:206
[14:35:40] INFO FleetAgentStatus(total=75000, inactive=0, online=73532, error=0, offline=1468, updating=0, other=0, events=0, doc_id=None, run_id=None, test_perf02.py:210
timestamp=None, kuery='local_metadata.elastic.agent.version : 8.2.0 and local_metadata.elastic.agent.upgradeable : true', cluster_name=None)
[14:37:15] INFO [FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=2660, version='8.2.1', test_perf02.py:215
startTime='2022-08-23T18:36:53.824Z')]
[14:37:26] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=3176, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:37:37] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=3654, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:37:47] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=4011, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:37:58] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=4375, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:38:08] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=4584, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:38:19] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=4604, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:38:29] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=4680, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:38:40] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=4731, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:38:50] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=4812, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:39:01] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=5033, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:39:11] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=5163, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:39:22] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=5301, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:39:32] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=5403, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:39:43] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=5448, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:39:53] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=5470, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:40:04] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=5494, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:40:15] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=5559, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:40:25] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=5662, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:40:36] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=5831, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:40:46] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=6106, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:40:57] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=6447, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:41:07] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=6835, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:41:18] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=7163, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:41:28] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=7645, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:41:39] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=8197, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:41:50] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=8879, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:42:00] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=9758, version='8.2.1', perf_lib.py:146
startTime='2022-08-23T18:36:53.824Z')
[14:42:11] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=11289, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:42:21] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=12868, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:42:32] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=14161, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:42:42] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=14904, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:42:53] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=15670, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:43:03] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=16881, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:43:14] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=18622, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:43:25] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=20518, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:43:35] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=22351, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:43:46] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=23510, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:43:56] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=24645, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:44:07] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=25650, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:44:17] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=26573, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:44:28] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=27542, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:44:38] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=28521, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:44:49] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=29752, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:45:00] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=31189, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:45:10] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=32334, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:45:21] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=33283, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:45:31] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=75000, nbAgentsAck=34050, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:45:42] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=65000, nbAgentsAck=34595, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:45:53] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=65000, nbAgentsAck=35164, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:46:03] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=55000, nbAgentsAck=35757, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:46:14] INFO Current upgrade FleetCurrentUpgrade(actionId='2fc0bb02-6b2c-4fed-83a3-8aa53188b2a6', complete=False, nbAgents=45000, nbAgentsAck=36384, perf_lib.py:146
version='8.2.1', startTime='2022-08-23T18:36:53.824Z')
[14:46:25] INFO Upgrade finished perf_lib.py:153

Any additional context:

@pjbertels pjbertels added bug Fixes for quality problems that affect the customer experience Project:FleetScaling labels Aug 24, 2022
@botelastic botelastic bot added the needs-team Issues missing a team label label Aug 24, 2022
@juliaElastic juliaElastic added the Team:Fleet Team label for Observability Data Collection Fleet team label Aug 25, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Aug 25, 2022
@juliaElastic
Copy link
Contributor

The response in /current_upgrades returns 2 counts to report on the status of upgrades: nbAgents=75000, nbAgentsAck=34050.

  • nbAgents is calculated by querying .fleet-actions index and adding up the agents count that belong to one actionId (there is one doc per 10k batch created)
  • nbAgentsAck is calculated by querying .fleet-actions-results for the same actionId - this index is written by Fleet Server when an action is completed (sent to Agent).

I think the issue comes from the fact that the upgrade expiration is set to 30m that start Immediately (or 1h, 2h, depending on the window selected). So after 30m, the first batch starts to reach expiration, thus not be returned by current_upgrades, as the API filters out documents where expiration < now.

This is usually not a problem when the upgrade completes within the work window, but in case of large loads like this, it can happen.
I noticed this in my work in 8.5, and proposed to increase the Immediately window to 2h (or higher) #138870
Also, the /current_upgrades API could be updated to return expired actions as well.

@jlind23
Copy link
Contributor

jlind23 commented Aug 25, 2022

@joshdover @kpollich what would be your call here?

@joshdover
Copy link
Contributor

As part of elastic/elastic-agent#778, we've been discussing having the agent ack expired actions as expired/failed in some way. If we do this, then I think we can have the /current_upgrades API not filter these agents out until they've ack'd as an expiration. We'd still want to consider the whole operation as completed once the last batch's expiration time has passed though.

Another thing to consider is that as part of elastic/elastic-agent#778 we will be acking failure attempts, so there will be potentially more than one ack per agent. I think the /current_upgrades API will need to be able to handle this.

Anything else to add @michel-laterman?

@juliaElastic
Copy link
Contributor

juliaElastic commented Sep 20, 2022

@joshdover @kpollich As part of the new Agent activity feature, there is a new endpoint introduced that doesn't filter out expired actions.
Should we remove the /current_upgrades endpoint?

@kpollich
Copy link
Member

Should we remove the /current_upgrades endpoint?

Let's track down any external consumers of this API to be sure, but I'm in favor of removing this. We'll need to wait for the obs-robots folks to migrate off of the /current_upgrades API before we remove it, but we should take opportunities to remove endpoints before the API enters its GA period.

@juliaElastic
Copy link
Contributor

juliaElastic commented Sep 21, 2022

@pjbertels could you move to the new /api/fleet/agents/action_status API instead of /current_upgrades?
Here is a sample response:

[
        {
            "actionId": "1da2db14-4f43-4f16-b971-3c12c0bdacb5",
            "nbAgentsActionCreated": 20,
            "nbAgentsAck": 20,
            "type": "UPGRADE",
            "nbAgentsActioned": 20,
            "status": "COMPLETE",
            "expiration": "2022-10-21T08:16:33.963Z",
            "creationTime": "2022-09-21T08:16:33.963Z",
            "nbAgentsFailed": 0,
            "completionTime": "2022-09-21T08:16:34.070Z"
        }
]

@jlind23
Copy link
Contributor

jlind23 commented Sep 26, 2022

@jlind23 jlind23 closed this as completed Sep 26, 2022
@juliaElastic
Copy link
Contributor

Created #141894 to clean up the /current_upgrades API when the usages have been removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Project:FleetScaling Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

6 participants