-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
self update job resets team permissions #91
Comments
Yeah we face a similar issue on our own Concourse where we have github auth on the main team. Main team auth is configured as part of the BOSH manifest when deploying Concourse and we don't expose the flags through Control Tower. This means every deploy will apply the manifest and wipe custom main team auth. Auth on other teams shouldn't be impacted. Given the current implementation of Control Tower this is expected behaviour. I made concourse-mgmt in an attempt to create tooling for managing Concourse teams from Concourse. Our mitigation for this problem is to run a variation of that pipeline every 10 minutes that ensures team auth is set properly. |
We are now seeing this happen (i.e. all pipelines disappear but resetting the auth brings them back) much more frequently. This is happening at least once a week, even though our self update job has only been run twice in the past 3 months. I don't really know how to diagnose or investigate this. Any suggestions would be gratefully received. I haven't attempted to move to a non-"main" team, as I understand we will lose all our history. Perhaps it's worth taking that hit if the "main" team is not usable for a default install of control-tower. |
The Control Tower instance we use at EngineerBetter has all the pipelines in the main team with github auth configured. I'm not aware of auth getting wiped outside of upgrades. We do run a pipeline that re-applies the team config every 10 minutes though so it might be hiding the issue. In theory if the github auth config is getting stripped from the main team outside of control tower upgrades then it's either bosh recreating the web instance (the main team as defined in the manifest only has basic auth) or it's a bug in Concourse. I guess you could check if your web instances are getting restarted. |
We did that, but it has made things worse; the web machine is being killed and restarted fairly frequently now, making the web ui unusable. We are attempting to investigate to see if we can figure out why. Any suggestions gratefully received! |
Looks like OOM on the "web" machine is causing this restart loop. Any idea why this might happen or how to stop it happening again? We're not intending to do anything unusual with control-tower and were hoping to not need to peek inside the black box.
|
We colocate influxdb and grafana on the web vm for the out-of-the-box metrics. I guess it's possible that Concourse is producing a high volume of metrics which is using up too much memory. I've also seen it before where having a frequent refresh rate on the grafana dashboard slows down the web instance. I would expect scaling the size of the web vm might resolve it. |
Thanks. Increasing the instance size does seem to have helped so far. We'll keep an eye on it. I'll update here if we have anything further. It's not ideal that the web machine enters a restart loop when under memory pressure. Ideally it would just run slower. (Also, it's definitely not ideal that bosh auto restarting the web machine wipes the team permissions.) (We don't use influxdb or grafana. I think I asked on a different issue how to turn them off.) |
I added a flag last night that lets you opt out of deploying the colocated metrics stack. If you download the new release then you can deploy with |
Thanks, unfortunately I tried using this release to deploy with the new flag and hit the following error:
I found the same issue with the 0.18.0 and 0.18.1 releases, and had to go back to 0.17.30 to complete a successful deployment. |
Weird. That doesn't look like it should be related to anything in the new release(s). Sometimes the contents of the local Another possibility is that one of the bosh prerequisites has gotten broken somehow on your machine. |
Deleting that directory and updating all the prereqs fixed it, thanks! I noticed that although Grafana etc. are no longer running, which is great, there are still security group rules added in the -atc group for ports 3000, 8844 and 8443, all of which I believe are related to metrics. It would be nice if when using the |
I forgot about the firewall ports. I'll look into patching that out. I'm glad you managed to get it deployed 😄. Why the |
This does seem to have fixed things for us. Thanks for your help. (The original issue at the top of this thread remains, AFAIK) |
I cut 0.18.2 over the weekend to remove the metrics ports from the firewall when disabling metrics. FYI ports 8844 and 8443 are credhub and UAA respectively so they are still required. The original issue is more of a feature request to configure github auth on the main team. I'll leave the issue open until that gets looked at. |
I just cut 0.19.0 which adds flags for configuring github auth on the main team at deploy time. These settings should persist through web recreations. A small note is that the Concourse release options I chose to use only support setting the |
I have GitHub auth federation set up and a "main" team in Concourse that looks like this:
Every time I run the "self update" job, it resets all the permissions and I have to log in as the root admin user and re-apply my team's permissions with the yaml file.
I don't know if this is related to GitHub auth federation.
I feel like this is a bug and the self update should not change permissions.
The text was updated successfully, but these errors were encountered: