Huge lag in provisioning workers, workers become suspended but not deleted #322

wosiu · 2022-02-08T19:42:31Z

Issue Details

Sometimes there is a huge lag before workers are provisioned. Our developers started to observe this few weeks ago.
So for example:

16:36:56  Still waiting to schedule task
16:36:56  All nodes of label ‘[integration-tests&&spot&&disable-resubmit](https://anonymised.it/label/integration-tests&&spot&&disable-resubmit/)’ are offline
18:16:54  Running on [ci-jenkins-executor--integration-tests-spot i-0fda2dca2c245b03a](https://anonymised.it/computer/i-0fda2dca2c245b03a/) in /mnt/jenkins/workspaces/workspace/sanity-check-on-stag

As you can see the machine was provisioned almost 2 hours later. And I guess it started only because we manually changed a minimum cluster size for this label to 1 to kinda kick it.

I don't have full log, but I was able to collect some logs from this period of time while the build was waiting for a node:

2022-02-08 16:45:42.495+0000 [id=42]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] setting stats
2022-02-08 16:45:42.495+0000 [id=42]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] Jenkins nodes: []
2022-02-08 16:45:42.495+0000 [id=42]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] Described instances: []
2022-02-08 16:45:42.495+0000 [id=42]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] Fleet instances: []
2022-02-08 16:45:42.392+0000 [id=42]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] start cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@26e970e3
2022-02-08 16:45:41.234+0000 [id=41]  FINE  c.a.j.ec2fleet.NoDelayProvisionStrategy: Provisioning completed
2022-02-08 16:45:41.234+0000 [id=41]  FINE  c.a.j.ec2fleet.NoDelayProvisionStrategy: label [integration-tests&&spot&&disable-resubmit]: currentDemand is less than 1, not provisioning
2022-02-08 16:45:41.234+0000 [id=41]  FINE  c.a.j.ec2fleet.NoDelayProvisionStrategy: label [integration-tests&&spot&&disable-resubmit]: currentDemand -7 availableCapacity 8 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 8 additionalPlannedCapacity 0)
2022-02-08 16:45:32.506+0000 [id=41]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] setting stats
2022-02-08 16:45:32.506+0000 [id=41]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] Jenkins nodes: []
2022-02-08 16:45:32.506+0000 [id=41]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] Described instances: []
2022-02-08 16:45:32.505+0000 [id=41]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] Fleet instances: []
2022-02-08 16:45:32.392+0000 [id=41]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] start cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@26e970e3
2022-02-08 16:45:31.234+0000 [id=40]  FINE  c.a.j.ec2fleet.NoDelayProvisionStrategy: Provisioning completed
2022-02-08 16:45:31.234+0000 [id=40]  FINE  c.a.j.ec2fleet.NoDelayProvisionStrategy: label [integration-tests&&spot&&disable-resubmit]: currentDemand is less than 1, not provisioning
2022-02-08 16:45:31.234+0000 [id=40]  FINE  c.a.j.ec2fleet.NoDelayProvisionStrategy: label [integration-tests&&spot&&disable-resubmit]: currentDemand -7 availableCapacity 8 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 8 additionalPlannedCapacity 0)
2022-02-08 16:45:22.476+0000 [id=37]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] setting stats

At 18:10 the status shown target: 0 so it looks like the plugin didn't realise that there is a build waiting for this node:

Around 18:15 we changed minimum cluster size for this to 1, and a new node was started.

What is interesting we noticed it on 2 separate jenkinses we have. We don't recall problem like this before.

We suspect it started to happen after migration from version 2.3.7 to 2.4.1, but we're not sure.
Also we migrated to the 2.4.1 around 21.12.2021, whereas we THINK the problem started some longer time after the upgrade and jenkins restart. Not sure though.

To Reproduce
I don't know. It just happens from time to time and as a result our jobs are timeouted. Looks like frequency of this problem increases from week to week.

Environment Details

Plugin Version?
2.4.1

Jenkins Version?
2.325

Spot Fleet or ASG?
ASG

Label based fleet?
No

Linux or Windows?
Linux

EC2Fleet Configuration as Code

    - eC2Fleet:
        name: "ci-jenkins-executor--integration-tests-spot"
        fleet: "ci-jenkins-executor--integration-tests-spot"
        labelString: "integration-tests spot disable-resubmit 1executors"
        minSize: 0
        maxSize: 120
        maxTotalUses: 50
        numExecutors: 1
        disableTaskResubmit: true
        addNodeOnlyIfRunning: false
        alwaysReconnect: true
        cloudStatusIntervalSec: 10
        computerConnector:
          sSHConnector:
            credentialsId: "standard-runner-ubuntu-user-private-key"
            launchTimeoutSeconds: 60
            maxNumRetries: 10
            port: 22
            retryWaitTime: 15
            sshHostKeyVerificationStrategy:
              manuallyTrustedKeyVerificationStrategy:
                requireInitialManualTrust: false
        fsRoot: "/mnt/jenkins/workspaces"
        idleMinutes: 3
        initOnlineCheckIntervalSec: 15
        initOnlineTimeoutSec: 600
        noDelayProvision: true
        privateIpUsed: true
        region: "us-west-2"
        restrictUsage: true
        scaleExecutorsByWeight: false

Anything else unique about your setup?

No

The text was updated successfully, but these errors were encountered:

fdaca · 2022-02-17T16:14:55Z

This can be found in logs:

currentDemand -1 availableCapacity 8 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 8 additionalPlannedCapacity 0)

and

currentDemand is less than 1, not provisioning

imuqtadir · 2022-02-17T19:23:58Z

From the limited logs you provide, I see that plannedCapacitySnapshot=8 which means that jenkins is waiting for the instances to come up. And also Described instances: [] which basically means that no instances were launched/running within the fleet. I don't see the target capacity printed in the logs so I can't really point to a particular issue. There could be multiple reasons for issue to happen.

When this happens, please check whether the target on your EC2 fleet/ASG was updated. If it was indeed updated correctly and you are not getting newer instances launched, it could mean that your chosen instance types NOT available for spot currently. I'm not sure what set of instances you have selected so expanding your instance type options could be something you should do. Other reason could be that you have a user-data script attached to your instances which is causing this huge lag for them to become available. From plugin's point of view, it is responsible for calculating correct target capacity.

If the target is not getting updated, then it probably requires additional debugging. However, since plannedCapacitySnapshot 8 has been updated correctly, I highly doubt this to be the case.

Side note, we recently launched minSpareInstances as part of 2.5.0. This will always keep the set amount of instances available which avoids the provisioning time. Please see #321

fdaca · 2022-02-18T11:39:42Z

@imuqtadir Thank you for the details. We're using spot nodes with EC2 Fleet plugin. What we're observing is having plannedCapacitySnapshot set to a desired amount for a long time, but:

no events to scale out on the ASG side visible
EC2 Fleet status reporting nodes: 0, target: 0
our ASG maximum capacity is high - around 120
our spot instance requests utilisation is within limits with a lot of room
blocks like this take 1h+ to self-resolve

We haven't bumped into this issue after we moved to EC2 Fleet 2.5.0 version and Jenkins 2.334 yet, which might be the possible cure (?)

imuqtadir · 2022-02-18T18:06:17Z

@fdaca Great, if it happens the next time around I want you to check the target capacity on EC2 side (along with EC2 Fleet status on the plugin). I'll close out this issue for now but feel free to reopen or create a new issue.

mwos-sl · 2022-03-05T11:05:47Z

@imuqtadir I don't have github persmission to reopen the issue.

But the issue is definitely still there (on 2 independent jenkinses we have). It seems like the plugin does not figure out that there is some demand for a given label.

Plugin version: 2.5.0
Jenkins: 2.336

Behaviour exactly the same as described above.
Pipeline is hanging for more than 1 hour on:

while plugin says target is 0, for whole that time:

Attaching logs from log recorder for this plugin:
EC2 Fleet Plugin.log

On AWS side, ASG desired count is 0. And there haven't been any events in Activity history for the last several hours. Last events are:

whereas the job that is hanging at the moment started ~4 hours after the last event in the Activity history.

mwos-sl · 2022-03-05T11:07:07Z

I noticed it gets unstuck only when new job arrives which needs the same label. So now I'm observing situation that a different build started which uses the same label (while the previous build is still hanging), and as a result "target" capacity on the plugin side has just increased to 1. And eventually both builds passed. So I wouldn't even call it that it "auto-resolves". If there isn't a new build, the hanging one gets stuck forever.

mwos-sl · 2022-03-10T21:04:57Z

Is there anything I could help to resolve this? Like provide some extra data?

imuqtadir · 2022-03-11T18:13:56Z

@mwos-sl Thanks for the details. This seems like a transient issue and difficult to reproduce specially since the new build with the same label is able to increment the total capacity later. How often do you see this issue happening and does it affect other labels or it is specific to a single label always? If you see a pattern that helps us reproduce the bug, it will be super helpful.

mwos-sl · 2022-03-14T18:04:52Z

does it affect other labels

Seems like all the labels managed by the plugin are affected.
Labels managed by the other plugin (https://plugins.jenkins.io/ec2/) are just fine.
I'll try to gather more data with time.

mwos-sl · 2022-03-21T15:19:27Z

I just saw situation, when 2 jobs were waiting for tens of minutes for a single label, yet the target capacity was set to 0 by the plugin, and there were no scaling events on ASG side. So it seems:

since the new build with the same label is able to increment the total capacity later.

is not always the case :( We even implemented a "workaround" job, which runs every hour to provision 1 machine from every label to unblock the queue, but the improvement is questionable.

xocasdashdash · 2022-06-07T08:16:27Z

So i think we've hit the same issue. Somehow Jenkins is under the impression that we have an executor with a label but then it's not showing up on the UI.
The solution for now has been to retry the jobs (manually), which is something that we want to avoid.

From my limited java skills it seems that the issue comes from jenkins as the data that comes in here (

ec2-fleet-plugin/src/main/java/com/amazon/jenkins/ec2fleet/NoDelayProvisionStrategy.java

Line 30 in 93a0317

    
           public NodeProvisioner.StrategyDecision apply(final NodeProvisioner.StrategyState strategyState) {

) is not good.

I'm trying to see if i can find out where this data is coming from and why does jenkins think this is the case, i'm suspecting some internal caching that once it expires the situation is fixed.

@mwos-sl have you found a valid workaround? I'm tempted to try the straight ec2 plugin instead of this one if there're no issues there

mwos-sl · 2022-06-07T11:23:50Z

@mwos-sl have you found a valid workaround?

Nope, unfortunately not. The issue described here was causing serious lags on our jenkinses, so we rolled back to ec2 plugin you mentioned. We don't have any issue with ec2 plugin, except it is very poor for spot instances, and we use on-demand only there. This plugin (ec2-fleet-plugin) is waaaay better for handling spots (possibility to declare multiple spot pools).

@xocasdashdash Can you experiment with setting to false "no delay provision strategy"? Maybe the bug is there and all we need is to disable it? Unfortunately I'm not able myself to test it anymore, but I wish to go back to this plugin at some point :(

xocasdashdash · 2022-06-07T11:58:54Z

I'll try to check it out, but it seems to be a deeper issue in how jenkins assigns jobs to labels, not much this plugin can do if that fails 😕

wosiu · 2022-10-18T21:05:02Z

I gave one more shot with the recent release of the plugin (2.5.2), and so far it works fine!!! 🤞
It seems @h-okon fixed the issue with: #343 (many thanks!!!)
Let me soak for few weeks more and I close this ticket.

EDIT: Nope, the issue is still there :(

igtsekov · 2022-10-20T06:58:44Z

I am using 2.5.2 and it still happens. :(

avikivity · 2022-10-20T12:28:46Z

/cc @benipeled

mwos-sl · 2022-11-10T16:04:30Z

Ok, so indeed there is still some issue. It is better than the last time though, but still happens.
Basically "Idle time" was not respected. Machine was marked for termination, still it was listed on jenkins (but with small red cross). And even though there were some jobs waiting for a particular label, the existing workers weren't picked, because there were previously marked for termination due to being idle (some time before). But it seems they still participate in the available capacity, because new workers weren't provisioned. As a result - builds are waiting for executors, while no new workers are provisioned :(

Checking if node 'i-07d22ecd9c1ddc6af' is idle 
Nov 10, 2022 6:16:28 AM INFO com.amazon.jenkins.ec2fleet.EC2RetentionStrategy isIdleForTooLong
Instance: build-jenkins-fleet-cse-general-spot i-07d22ecd9c1ddc6af Builds left: 46  Age: 42362305 Max Age:180000
Nov 10, 2022 6:16:28 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
build-jenkins-fleet-cse-general-spot [cse general spot] Scheduling instance 'i-07d22ecd9c1ddc6af' for termination on cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@6f3c3f8d with force: false
Nov 10, 2022 6:16:28 AM FINE com.amazon.jenkins.ec2fleet.EC2FleetCloud fine
build-jenkins-fleet-cse-general-spot [cse general spot] InstanceIdsToTerminate: [i-07d22ecd9c1ddc6af, i-07228fb5b5499afa6, i-0632d358da022654c]
Nov 10, 2022 6:16:28 AM FINE com.amazon.jenkins.ec2fleet.EC2RetentionStrategy check
Checking if node 'i-07d3ac95b1dd1c768' is idle 
Nov 10, 2022 6:16:28 AM INFO com.amazon.jenkins.ec2fleet.EC2RetentionStrategy isIdleForTooLong

mwos-sl · 2022-11-15T11:40:11Z

After some investigation, it seems that plannedCapacitySnapshot become drifted after configuration reloading (without jenkins restart, which is no go for us):

label [cse&&general]: currentDemand -3 availableCapacity 4 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 4 additionalPlannedCapacity 0)

There was 1 job in queue at the time, that's why
current demand = queueSize - availableCapacity = 1 - 4 = -3

Some of the already closed issues might be related:
#261
#172
They were "solved" by restarting jenkins, which sounds like a bug is there all the time.

mwos-sl · 2022-12-13T13:56:19Z

After several days of no problem we hit an issue again. I >think< the problem starts after configuration reload.

HDatPaddle · 2022-12-13T14:13:05Z

After several days of no problem we hit an issue again. I >think< the problem starts after configuration reload.

Agree, it also appears to happen to us after a configuration reload. A reboot clears is.

naorc123 · 2022-12-20T08:36:07Z

Is there any planned fix for it? or maybe other workaround?
We encounter it as well but restarting Jenkins after configuration reload make no sense.

mwos-sl · 2023-05-23T12:07:19Z

@pdk27 any updates? Are there any chances the plugin will be maintained again?

mtn-boblloyd · 2023-05-23T12:51:45Z

I wound up fixing this myself and using a custom version of the plugin. It looks like either Jenkins changed the purpose of some of the NodeProvisioner.StrategyState meanings, or something else changed. In the NoDelayProvisionStrategy.java file, I changed the functionality to get the current number of functional nodes to this:

        final LoadStatistics.LoadStatisticsSnapshot snapshot = strategyState.getSnapshot();
        final int availableCapacity = snapshot.getOnlineExecutors() + snapshot.getConnectingExecutors() - snapshot.getBusyExecutors();

        int currentDemand = snapshot.getQueueLength() - availableCapacity;
        LOGGER.log(currentDemand < 1 ? Level.FINE : Level.INFO,
                "label [{0}]: currentDemand {1} availableCapacity {2} (onlineExecutors {3} connectingExecutors {4} busyExecutors {5} queueLength {6})",
                new Object[]{label, currentDemand, availableCapacity, snapshot.getOnlineExecutors(),
                        snapshot.getConnectingExecutors(), snapshot.getBusyExecutors(), snapshot.getQueueLength()});

This caused the check to find the actual number of nodes that were either already provisioned or in the process of being provisioned, and find the available capacity without busy executors. We only use 1 executor per node, so this may not fly if you're using multiple executors (I haven't tested).

This seems to keep our nodes provisioning and scaling down regularly, and I'm happy with it for our purposes, now.

I don't have branch push permissions on this repo, so I can't push my changes up for a pull request, unfortunately :(

xocasdashdash · 2023-05-23T12:58:51Z

@mtn-boblloyd can't you open a PR from your fork?

pdk27 · 2023-05-24T20:20:51Z

@mtn-boblloyd Thanks so much for your PR. I am also working on a fix for currentDemand computation. I will get in touch about it soon.

@mwos-sl Apologies for the delay. We have been short-staffed since the developers who maintained this plugin moved on.

I have been looking into reproducing the scenarios detailed here and in various previous (related) issues in order get the full context. I wasn't able to reproduce it intentionally but I did come across the scenario below after running the plugin for long hours:

1 (sometimes more than 1 🤷‍♀️ ) active/connected node takes incoming builds and runs them successfully
As the queue gets bigger, new instances / nodes are launched but Jenkins fails to connect to them, during which they get suspended for a while, eventually getting terminated.
Changing plugin configurations like noDelayProvision didn't seem to impact the suspended instances situation

Some observations:

Agent/Slave logs on the active node showed LinkageError while performing (stacktrace below) with NO impact on the node's ability to take on new builds.

LinkageError while performing

After the LinkageError is logged, subsequent node connections (some of which were in the process of establishing connection) just hang , according to their logs. Logging on the plugin side showed exceptions like java.lang.IllegalStateException: Failed to provision node. Could not connect to node 'i- XXXXX' before timeout or java.io.IOException: Agent failed to connect, even though the launcher didn't report it. See the log output for details. after which new instance was launched to keep up with the demand, and this cycle repeated.
The logs for suspended nodes would hang at different stages in the log. So I made sure that I was able to SSH into the new EC2 instance and also check for java version - this was not a problem.

Explanation:

This is the only explanation I could think of. Please share thoughts if any. The combination of things below seems to be the problem.

Jenkins being unable to connect to new capacity +
Extended time when connections are attempted without luck(depends on various configurations like number of retries, wait time between retries, timeout, etc) +
plannedCapacitySnapshot being decremented ONLY after the connection to node is established or timeout (code) +
Available capacity computation double counting executors for such (connecting) nodes with connectingExecutors (because of this the availableCapacity could differ wildly from reality, esp. when huge number of nodes are affected)

Fix that seemed to have worked for me:

LinkageError was fixed in Jenkins 2.277.2. I upgraded to Jenkins 2.277.2 and I have not come across the above suspended nodes situation since.

@mwos-sl Looks like you are running a more recent version of Jenkins than 2.277.2? Have you seen different reasons why Jenkins might not be able to connect to the nodes? Is there a way for you to test with Jenkins version 2.277.2?
FYI, the nodes that are suspended have shouldAcceptTasks set to false which happens when the plugin attempts to terminate the instance (code).

I am working on fixing the computations.
In the mean time, we will be upgrading the minimum Jenkins version for the plugin.

ikaakkola · 2023-05-26T10:44:04Z

To clear plannedCapacitySnapshot (Technically these are PlannedNode instances in the NodeProvisioner, here) you can use a Groovy script through the Script Console.

Once you do this, scaling will again work for a while, until you accumulate enough "pending launches", causing plannedCapacitySnapshot to grow. When plannedCapacitySnapshot grows, the plugin will only scale up when there are more items in queue than there are "pending launches" stuck..

List and optionally delete pending launches for all labels

// If set to true, all pending launches are deleted
def deletePending = false;

for (label in Jenkins.instance.getLabels()) {
  println("Label " + label.name);
  println("   Load statistics: " + label.loadStatistics.computeSnapshot());
  println("   Nodeprovisioner state: " + label.nodeProvisioner.provisioningState);
  println("   Nodeprovisioner pending launches:");
  for (launch in label.nodeProvisioner.pendingLaunches) {
    println("      pending launch: " + launch.displayName + ", hashCode=" + launch.hashCode());
    if (deletePending) {
      println("      -> terminating!");
      launch.future.setException(new java.lang.RuntimeException("Terminating pending launch!"));
    } 
  }
  println("");
}

I have not been able to figure out what is causing these stuck pending launches, but clearly there is some code path where a PlannedNode is created but the Future for it is never completed.

pdk27 · 2023-05-26T13:37:21Z

@ikaakkola Thats correct! It makes sense to include planned nodes in available capacity to avoid over provisioning.

The plugin (specifically this class) controls when the planned nodes' futures are resolved - after Jenkins is able to connect to the node successfully or if that attempt times out.
In the scenario I described above I saw that Jenkins would connect to an nodes successfully until the LinkageError occurred, after which subsequent attempts to connect to new nodes would just hang and timeout. Hence this PR to upgrade minimum Jenkins version.

What version of Jenkins are you running? Do you see any errors in agent logs?

ikaakkola · 2023-05-26T13:40:29Z

@pdk27 we are currently on ancient versions, hence I did not add my findings here. Waiting to update to latest versions of Jenkins and the plugin and if there still are 'pending launches' that do not get completed (either done, cancelled or due to an exception) I'll dig deeper. Just wanted to share the workaround we currently use (clear the pending launches manually every now and then) which doesn't need a Jenkins restart.

(For the record, we are not seeing LinkageErrors)

jenkinsci#322 jenkinsci#359

* [Fix] Fix computation of excess workload and available capacity #322 #359 * Update src/main/java/com/amazon/jenkins/ec2fleet/NoDelayProvisionStrategy.java Co-authored-by: Jerad C <jeradc@amazon.com> --------- Co-authored-by: Jerad C <jeradc@amazon.com>

pdk27 · 2023-06-01T22:43:28Z

Opened a discussion in the release 2.6.0 which includes some fixes/ other changes, after which I don't see lag in provisioning in my environment. Please share details relevant to the release in the discussion.

[fix] Fix maxtotaluses decrement logic add logs in post job action to expose tasks terminated with problems jenkinsci#322 add and fix tests

* [fix] Terminate scheduled instances ONLY IF idle #363 * [fix] leave maxTotalUses alone and track remainingUses correctly add a flag to track termination of agents by plugin * [fix] Fix lost state (instanceIdsToTerminate) on configuration change [fix] Fix maxtotaluses decrement logic add logs in post job action to expose tasks terminated with problems #322 add and fix tests * add integration tests for configuration change leading to lost state and rebuilding lost state to terminate instances previously marked for termination

pdk27 · 2023-06-28T19:07:47Z

@wosiu I was finally able to reproduce this issue and here is what I think is happening:

Your maxTotalUses is set to 50, let's say your instances finish 50 builds and are scheduled for termination. i.e. suspended.
A cloud config change is initiated, leading to:
- recreation of EC2FleetCloud object - hence loosing state like instanceIdsToTerminate (problem#1)
EC2RetentionStrategy checks for instances to terminate when idle but this doesn't terminate the suspended instances because:
- it only tracks 2 cases: hasExcessCapacity and idleForTooLong. See other cases [here]. (problem#2)(https://github.com/jenkinsci/ec2-fleet-plugin/blob/master/src/main/java/com/amazon/jenkins/ec2fleet/EC2AgentTerminationReason.java).
- My suspicion is that the buggy cloud object reassignments might be the reason why idle time is not respected (should be fixed with EC2FleetCloud constructor blindly reassigns fleets #360) (problem#3)
suspended nodes remain hanging

Fixes:

problem#1 & problem#2: fixed in release ec2-fleet-2.7.0
Fix 322 and 363 #376 (specifically this commit). See details and integration test result here.
problem#3: Fixed in ec2-fleet-3.0.0

… tracking of cloud objects [fix] Remove plannedNodeScheduledFutures [refactor] Added instanceId to FleetNode for clarity, added getDescriptor to return sub type [refactor] Dont provision if Jenkins is quieting down and terminating [refactor] Replace more occurences of 'slave' with 'agent' jenkinsci#360 jenkinsci#322

… tracking of cloud instance [refactor] Added instanceId to FleetNode for clarity, added getDescriptor to return sub type [refactor] Dont provision if Jenkins is quieting down and terminating Fix jenkinsci#360 Fix jenkinsci#322

wosiu added the bug label Feb 8, 2022

imuqtadir closed this as completed Feb 18, 2022

mwos-sl mentioned this issue Mar 5, 2022

Huge lag in provisioning workers - please reopen #325

Closed

imuqtadir reopened this Mar 5, 2022

imuqtadir added the stalebot-ignore To NOT let the stalebot update or close the Issue / PR label Mar 8, 2022

jillmon assigned haugenj Mar 16, 2022

mwos-sl mentioned this issue Apr 11, 2022

Plenty of workers suspended for several days. #331

Closed

mwos-sl mentioned this issue Nov 10, 2022

alwaysReconnect what it exactly does #349

Closed

pdk27 unassigned haugenj Nov 16, 2022

mwos-sl mentioned this issue May 23, 2023

Is this plugin still actively maintained? #366

Closed

pdk27 self-assigned this May 24, 2023

pdk27 mentioned this issue May 24, 2023

Upgrading the minimum required Jenkins version. #368

Merged

mwos-sl mentioned this issue May 30, 2023

EC2FleetCloud constructor blindly reassigns fleets #360

Closed

pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 1, 2023

[Fix] Fix computation of excess workload and available capacity

de5085a

jenkinsci#322 jenkinsci#359

pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 1, 2023

[Fix] Fix computation of excess workload and available capacity

7403c5f

jenkinsci#322 jenkinsci#359

pdk27 mentioned this issue Jun 1, 2023

[Fix] Fix computation of excess workload and available capacity #371

Merged

pdk27 removed the stalebot-ignore To NOT let the stalebot update or close the Issue / PR label Jun 2, 2023

pdk27 mentioned this issue Jun 28, 2023

Fix 322 and 363 #376

Merged

pdk27 closed this as completed in #376 Jun 28, 2023

pdk27 reopened this Jun 28, 2023

pdk27 mentioned this issue Jul 11, 2023

Fix cloud tracking #383

Merged

pdk27 closed this as completed in a0d67cf Jul 17, 2023

pdk27 mentioned this issue Aug 3, 2023

If a Jenkins agent crashes, ec2-fleet doesn't terminate the agent, and we're stuck with a dead agent #358

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge lag in provisioning workers, workers become suspended but not deleted #322

Huge lag in provisioning workers, workers become suspended but not deleted #322

wosiu commented Feb 8, 2022 •

edited

Loading

fdaca commented Feb 17, 2022

imuqtadir commented Feb 17, 2022

fdaca commented Feb 18, 2022

imuqtadir commented Feb 18, 2022

mwos-sl commented Mar 5, 2022 •

edited

Loading

mwos-sl commented Mar 5, 2022 •

edited

Loading

mwos-sl commented Mar 10, 2022

imuqtadir commented Mar 11, 2022

mwos-sl commented Mar 14, 2022 •

edited

Loading

mwos-sl commented Mar 21, 2022 •

edited

Loading

xocasdashdash commented Jun 7, 2022

mwos-sl commented Jun 7, 2022

xocasdashdash commented Jun 7, 2022

wosiu commented Oct 18, 2022 •

edited

Loading

igtsekov commented Oct 20, 2022

avikivity commented Oct 20, 2022

mwos-sl commented Nov 10, 2022 •

edited

Loading

mwos-sl commented Nov 15, 2022 •

edited

Loading

mwos-sl commented Dec 13, 2022

HDatPaddle commented Dec 13, 2022

naorc123 commented Dec 20, 2022

mwos-sl commented May 23, 2023 •

edited

Loading

mtn-boblloyd commented May 23, 2023 •

edited

Loading

xocasdashdash commented May 23, 2023

pdk27 commented May 24, 2023 •

edited

Loading

ikaakkola commented May 26, 2023 •

edited

Loading

pdk27 commented May 26, 2023

ikaakkola commented May 26, 2023

pdk27 commented Jun 1, 2023

pdk27 commented Jun 28, 2023 •

edited

Loading

Huge lag in provisioning workers, workers become suspended but not deleted #322

Huge lag in provisioning workers, workers become suspended but not deleted #322

Comments

wosiu commented Feb 8, 2022 • edited Loading

Issue Details

Environment Details

fdaca commented Feb 17, 2022

imuqtadir commented Feb 17, 2022

fdaca commented Feb 18, 2022

imuqtadir commented Feb 18, 2022

mwos-sl commented Mar 5, 2022 • edited Loading

mwos-sl commented Mar 5, 2022 • edited Loading

mwos-sl commented Mar 10, 2022

imuqtadir commented Mar 11, 2022

mwos-sl commented Mar 14, 2022 • edited Loading

mwos-sl commented Mar 21, 2022 • edited Loading

xocasdashdash commented Jun 7, 2022

mwos-sl commented Jun 7, 2022

xocasdashdash commented Jun 7, 2022

wosiu commented Oct 18, 2022 • edited Loading

igtsekov commented Oct 20, 2022

avikivity commented Oct 20, 2022

mwos-sl commented Nov 10, 2022 • edited Loading

mwos-sl commented Nov 15, 2022 • edited Loading

mwos-sl commented Dec 13, 2022

HDatPaddle commented Dec 13, 2022

naorc123 commented Dec 20, 2022

mwos-sl commented May 23, 2023 • edited Loading

mtn-boblloyd commented May 23, 2023 • edited Loading

xocasdashdash commented May 23, 2023

pdk27 commented May 24, 2023 • edited Loading

Some observations:

Explanation:

Fix that seemed to have worked for me:

ikaakkola commented May 26, 2023 • edited Loading

pdk27 commented May 26, 2023

ikaakkola commented May 26, 2023

pdk27 commented Jun 1, 2023

pdk27 commented Jun 28, 2023 • edited Loading

wosiu commented Feb 8, 2022 •

edited

Loading

mwos-sl commented Mar 5, 2022 •

edited

Loading

mwos-sl commented Mar 5, 2022 •

edited

Loading

mwos-sl commented Mar 14, 2022 •

edited

Loading

mwos-sl commented Mar 21, 2022 •

edited

Loading

wosiu commented Oct 18, 2022 •

edited

Loading

mwos-sl commented Nov 10, 2022 •

edited

Loading

mwos-sl commented Nov 15, 2022 •

edited

Loading

mwos-sl commented May 23, 2023 •

edited

Loading

mtn-boblloyd commented May 23, 2023 •

edited

Loading

pdk27 commented May 24, 2023 •

edited

Loading

ikaakkola commented May 26, 2023 •

edited

Loading

pdk27 commented Jun 28, 2023 •

edited

Loading