Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remediate and assess CircleCI config #943

Closed
1 of 2 tasks
Jkrzy opened this issue Dec 23, 2019 · 9 comments
Closed
1 of 2 tasks

Remediate and assess CircleCI config #943

Jkrzy opened this issue Dec 23, 2019 · 9 comments

Comments

@Jkrzy
Copy link
Contributor

Jkrzy commented Dec 23, 2019

It's been awhile since we've revisited our circleCI setup for releases, testing, and recycling of the production instance.

Let's take a look at config.yml to address two known issue and see if there are opportunities to simplify, modernize, or otherwise improve our usage of circleCI.

Known issues

  • The recycle_production job is executing on branches other than master
  • The recycle_production job intermittently fails

We'll know we're done when

  • The above issues have been resolved
  • config.yml has been reviewed and updated if necessary
@adunkman adunkman self-assigned this Dec 27, 2019
@adunkman
Copy link
Member

I can take a look here — it looks like the deployment docker image 18fgsa/cloud-foundry-cli hasn’t been published for 2 years, which might be leading to our intermittent failures.

@Jkrzy
Copy link
Contributor Author

Jkrzy commented Dec 30, 2019

This is the cloudfoundry plugin being used for the instance recycling: https://github.com/rogeruiz/cf-recycle-plugin

And here's what a failure, time-out after ~5 hours, looks like in CircleCI: https://circleci.com/gh/18F/tock/5391

@adunkman
Copy link
Member

adunkman commented Jan 3, 2020

The recycle_production job is executing on branches other than master

I can’t seem to find evidence of this in recent builds, and appears to have been fixed in #897 (f27fe76).

@adunkman
Copy link
Member

adunkman commented Jan 3, 2020

Confirmed the hunch — intermittent failures are due to conflicting recycle_production jobs.

Conflicting jobs happen often because the hjb/try-uswds branch (missing the commit from #897 above) and master both get triggered for scheduled runs at the same time.

Short-term fix: merge master into hjb/try-uswds or delete it? @hbillings, is this branch still useful?

Long-term fix: prevent concurrent running jobs from conflicting with each other. I’ll do a bit of poking about here, but will time box it.

@adunkman
Copy link
Member

adunkman commented Jan 6, 2020

I’ve merged master to hjb/try-uswds to address the immediate issue, and my 2 hour timebox has expired — the best possible way to prevent concurrent restarts from conflicting with each other is currently switching from a restart job to a deployment job, and performing a rolling app deployment, which would cause the latest re-deploy to win, which is acceptable:

If you push app before your previous push command for the same app has completed, your first push gets interrupted. Until the last deployment completes, there may be many versions of the app running at once. Eventually, the app runs the code from your most recent push.

Going to consider this issue complete after tomorrow morning’s restart completes without error (assuming it does).

@adunkman
Copy link
Member

adunkman commented Jan 7, 2020

🎉 No more conflicting job runs!
✖️ Recycle production is still having issues.

@adunkman
Copy link
Member

adunkman commented Jan 8, 2020

Starting down the process of upgrading the CF CLI to see if it fixes the issue, but cf-recycle-plugin doesn’t appear to be compatible with the latest CF CLI:

File is not a valid cf CLI plugin binary.

There is an equivalent and more recently maintained plugin, cf-rolling-restart, which works with the newer CLI as well as the older one. I’ll open a PR to switch over to this plugin to see if we have better results, and hold off upgrading the CF CLI for now in the spirit of only changing one thing at a time.

@tbaxter-18f
Copy link
Contributor

@adunkman I should have marked this "done", not "to-do", right?

@adunkman
Copy link
Member

adunkman commented Apr 9, 2020

I believe it is complete!

@adunkman adunkman closed this as completed Apr 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants