-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge lag in provisioning workers, workers become suspended but not deleted #322
Comments
This can be found in logs:
and
|
From the limited logs you provide, I see that When this happens, please check whether the target on your EC2 fleet/ASG was updated. If it was indeed updated correctly and you are not getting newer instances launched, it could mean that your chosen instance types NOT available for spot currently. I'm not sure what set of instances you have selected so expanding your instance type options could be something you should do. Other reason could be that you have a user-data script attached to your instances which is causing this huge lag for them to become available. From plugin's point of view, it is responsible for calculating correct target capacity. If the target is not getting updated, then it probably requires additional debugging. However, since Side note, we recently launched minSpareInstances as part of 2.5.0. This will always keep the set amount of instances available which avoids the provisioning time. Please see #321 |
@imuqtadir Thank you for the details. We're using spot nodes with EC2 Fleet plugin. What we're observing is having
We haven't bumped into this issue after we moved to EC2 Fleet |
@fdaca Great, if it happens the next time around I want you to check the target capacity on EC2 side (along with EC2 Fleet status on the plugin). I'll close out this issue for now but feel free to reopen or create a new issue. |
@imuqtadir I don't have github persmission to reopen the issue. But the issue is definitely still there (on 2 independent jenkinses we have). It seems like the plugin does not figure out that there is some demand for a given label. Plugin version: 2.5.0 Behaviour exactly the same as described above. while plugin says target is 0, for whole that time: Attaching logs from log recorder for this plugin: On AWS side, ASG desired count is 0. And there haven't been any events in Activity history for the last several hours. Last events are: whereas the job that is hanging at the moment started ~4 hours after the last event in the Activity history. |
I noticed it gets unstuck only when new job arrives which needs the same label. So now I'm observing situation that a different build started which uses the same label (while the previous build is still hanging), and as a result "target" capacity on the plugin side has just increased to 1. And eventually both builds passed. So I wouldn't even call it that it "auto-resolves". If there isn't a new build, the hanging one gets stuck forever. |
Is there anything I could help to resolve this? Like provide some extra data? |
@mwos-sl Thanks for the details. This seems like a transient issue and difficult to reproduce specially since the new build with the same label is able to increment the total capacity later. How often do you see this issue happening and does it affect other labels or it is specific to a single label always? If you see a pattern that helps us reproduce the bug, it will be super helpful. |
Seems like all the labels managed by the plugin are affected. |
I just saw situation, when 2 jobs were waiting for tens of minutes for a single label, yet the target capacity was set to 0 by the plugin, and there were no scaling events on ASG side. So it seems:
is not always the case :( We even implemented a "workaround" job, which runs every hour to provision 1 machine from every label to unblock the queue, but the improvement is questionable. |
So i think we've hit the same issue. Somehow Jenkins is under the impression that we have an executor with a label but then it's not showing up on the UI. From my limited java skills it seems that the issue comes from jenkins as the data that comes in here ( ec2-fleet-plugin/src/main/java/com/amazon/jenkins/ec2fleet/NoDelayProvisionStrategy.java Line 30 in 93a0317
I'm trying to see if i can find out where this data is coming from and why does jenkins think this is the case, i'm suspecting some internal caching that once it expires the situation is fixed. @mwos-sl have you found a valid workaround? I'm tempted to try the straight ec2 plugin instead of this one if there're no issues there |
Nope, unfortunately not. The issue described here was causing serious lags on our jenkinses, so we rolled back to ec2 plugin you mentioned. We don't have any issue with ec2 plugin, except it is very poor for spot instances, and we use on-demand only there. This plugin (ec2-fleet-plugin) is waaaay better for handling spots (possibility to declare multiple spot pools). @xocasdashdash Can you experiment with setting to false "no delay provision strategy"? Maybe the bug is there and all we need is to disable it? Unfortunately I'm not able myself to test it anymore, but I wish to go back to this plugin at some point :( |
I'll try to check it out, but it seems to be a deeper issue in how jenkins assigns jobs to labels, not much this plugin can do if that fails 😕 |
I am using 2.5.2 and it still happens. :( |
/cc @benipeled |
Ok, so indeed there is still some issue. It is better than the last time though, but still happens.
|
After some investigation, it seems that
There was 1 job in queue at the time, that's why Some of the already closed issues might be related: |
After several days of no problem we hit an issue again. I >think< the problem starts after configuration reload. |
Agree, it also appears to happen to us after a configuration reload. A reboot clears is. |
Is there any planned fix for it? or maybe other workaround? |
@pdk27 any updates? Are there any chances the plugin will be maintained again? |
I wound up fixing this myself and using a custom version of the plugin. It looks like either Jenkins changed the purpose of some of the
This caused the check to find the actual number of nodes that were either already provisioned or in the process of being provisioned, and find the available capacity without busy executors. We only use 1 executor per node, so this may not fly if you're using multiple executors (I haven't tested). This seems to keep our nodes provisioning and scaling down regularly, and I'm happy with it for our purposes, now. I don't have branch push permissions on this repo, so I can't push my changes up for a pull request, unfortunately :( |
@mtn-boblloyd can't you open a PR from your fork? |
@mtn-boblloyd Thanks so much for your PR. I am also working on a fix for currentDemand computation. I will get in touch about it soon. @mwos-sl Apologies for the delay. We have been short-staffed since the developers who maintained this plugin moved on. I have been looking into reproducing the scenarios detailed here and in various previous (related) issues in order get the full context. I wasn't able to reproduce it intentionally but I did come across the scenario below after running the plugin for long hours:
Some observations:
Explanation:This is the only explanation I could think of. Please share thoughts if any. The combination of things below seems to be the problem.
Fix that seemed to have worked for me:LinkageError was fixed in Jenkins 2.277.2. I upgraded to Jenkins 2.277.2 and I have not come across the above suspended nodes situation since. @mwos-sl Looks like you are running a more recent version of Jenkins than 2.277.2? Have you seen different reasons why Jenkins might not be able to connect to the nodes? Is there a way for you to test with Jenkins version 2.277.2? I am working on fixing the computations. |
To clear Once you do this, scaling will again work for a while, until you accumulate enough "pending launches", causing List and optionally delete pending launches for all labels
I have not been able to figure out what is causing these stuck pending launches, but clearly there is some code path where a PlannedNode is created but the Future for it is never completed. |
@ikaakkola Thats correct! It makes sense to include planned nodes in available capacity to avoid over provisioning. The plugin (specifically this class) controls when the planned nodes' futures are resolved - after Jenkins is able to connect to the node successfully or if that attempt times out. What version of Jenkins are you running? Do you see any errors in agent logs? |
@pdk27 we are currently on ancient versions, hence I did not add my findings here. Waiting to update to latest versions of Jenkins and the plugin and if there still are 'pending launches' that do not get completed (either done, cancelled or due to an exception) I'll dig deeper. Just wanted to share the workaround we currently use (clear the pending launches manually every now and then) which doesn't need a Jenkins restart. (For the record, we are not seeing LinkageErrors) |
Opened a discussion in the release 2.6.0 which includes some fixes/ other changes, after which I don't see lag in provisioning in my environment. Please share details relevant to the release in the discussion. |
[fix] Fix maxtotaluses decrement logic add logs in post job action to expose tasks terminated with problems jenkinsci#322 add and fix tests
[fix] Fix maxtotaluses decrement logic add logs in post job action to expose tasks terminated with problems jenkinsci#322 add and fix tests
[fix] Fix maxtotaluses decrement logic add logs in post job action to expose tasks terminated with problems jenkinsci#322 add and fix tests
* [fix] Terminate scheduled instances ONLY IF idle #363 * [fix] leave maxTotalUses alone and track remainingUses correctly add a flag to track termination of agents by plugin * [fix] Fix lost state (instanceIdsToTerminate) on configuration change [fix] Fix maxtotaluses decrement logic add logs in post job action to expose tasks terminated with problems #322 add and fix tests * add integration tests for configuration change leading to lost state and rebuilding lost state to terminate instances previously marked for termination
@wosiu I was finally able to reproduce this issue and here is what I think is happening:
Fixes:
|
… tracking of cloud objects [fix] Remove plannedNodeScheduledFutures [refactor] Added instanceId to FleetNode for clarity, added getDescriptor to return sub type [refactor] Dont provision if Jenkins is quieting down and terminating [refactor] Replace more occurences of 'slave' with 'agent' jenkinsci#360 jenkinsci#322
… tracking of cloud instance [refactor] Added instanceId to FleetNode for clarity, added getDescriptor to return sub type [refactor] Dont provision if Jenkins is quieting down and terminating Fix jenkinsci#360 Fix jenkinsci#322
Issue Details
Sometimes there is a huge lag before workers are provisioned. Our developers started to observe this few weeks ago.
So for example:
As you can see the machine was provisioned almost 2 hours later. And I guess it started only because we manually changed a minimum cluster size for this label to 1 to kinda kick it.
I don't have full log, but I was able to collect some logs from this period of time while the build was waiting for a node:
At 18:10 the status shown

target: 0
so it looks like the plugin didn't realise that there is a build waiting for this node:Around 18:15 we changed minimum cluster size for this to 1, and a new node was started.
What is interesting we noticed it on 2 separate jenkinses we have. We don't recall problem like this before.
We suspect it started to happen after migration from version 2.3.7 to 2.4.1, but we're not sure.
Also we migrated to the 2.4.1 around 21.12.2021, whereas we THINK the problem started some longer time after the upgrade and jenkins restart. Not sure though.
To Reproduce
I don't know. It just happens from time to time and as a result our jobs are timeouted. Looks like frequency of this problem increases from week to week.
Environment Details
Plugin Version?
2.4.1
Jenkins Version?
2.325
Spot Fleet or ASG?
ASG
Label based fleet?
No
Linux or Windows?
Linux
EC2Fleet Configuration as Code
Anything else unique about your setup?
No
The text was updated successfully, but these errors were encountered: