Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

self update job resets team permissions #91

Open
RichardBradley opened this issue Mar 31, 2021 · 15 comments
Open

self update job resets team permissions #91

RichardBradley opened this issue Mar 31, 2021 · 15 comments
Labels
enhancement New feature or request

Comments

@RichardBradley
Copy link
Contributor

I have GitHub auth federation set up and a "main" team in Concourse that looks like this:

roles:
- name: owner
  local:
    users: ["admin"]
- name: pipeline-operator
  github:
    teams: ["myorg:myteam"]

Every time I run the "self update" job, it resets all the permissions and I have to log in as the root admin user and re-apply my team's permissions with the yaml file.

I don't know if this is related to GitHub auth federation.

I feel like this is a bug and the self update should not change permissions.

@crsimmons
Copy link
Contributor

Yeah we face a similar issue on our own Concourse where we have github auth on the main team. Main team auth is configured as part of the BOSH manifest when deploying Concourse and we don't expose the flags through Control Tower. This means every deploy will apply the manifest and wipe custom main team auth. Auth on other teams shouldn't be impacted. Given the current implementation of Control Tower this is expected behaviour.

I made concourse-mgmt in an attempt to create tooling for managing Concourse teams from Concourse. Our mitigation for this problem is to run a variation of that pipeline every 10 minutes that ensures team auth is set properly.

@RichardBradley
Copy link
Contributor Author

We are now seeing this happen (i.e. all pipelines disappear but resetting the auth brings them back) much more frequently. This is happening at least once a week, even though our self update job has only been run twice in the past 3 months.

I don't really know how to diagnose or investigate this. Any suggestions would be gratefully received.

I haven't attempted to move to a non-"main" team, as I understand we will lose all our history. Perhaps it's worth taking that hit if the "main" team is not usable for a default install of control-tower.

@crsimmons
Copy link
Contributor

The Control Tower instance we use at EngineerBetter has all the pipelines in the main team with github auth configured. I'm not aware of auth getting wiped outside of upgrades. We do run a pipeline that re-applies the team config every 10 minutes though so it might be hiding the issue.

In theory if the github auth config is getting stripped from the main team outside of control tower upgrades then it's either bosh recreating the web instance (the main team as defined in the manifest only has basic auth) or it's a bug in Concourse. I guess you could check if your web instances are getting restarted.

@RichardBradley
Copy link
Contributor Author

We do run a pipeline that re-applies the team config every 10 minutes though so it might be hiding the issue.

We did that, but it has made things worse; the web machine is being killed and restarted fairly frequently now, making the web ui unusable.

We are attempting to investigate to see if we can figure out why. Any suggestions gratefully received!

@RichardBradley
Copy link
Contributor Author

RichardBradley commented Mar 16, 2022

Looks like OOM on the "web" machine is causing this restart loop.
We will try deploying a larger server with control-tower deploy --web-size medium.

Any idea why this might happen or how to stop it happening again? We're not intending to do anything unusual with control-tower and were hoping to not need to peek inside the black box.

2022-03-16T15:45:22.085786+00:00 8593cba8-5f6e-4d7e-95b1-012eee77b396 kernel: [ 1294.206443] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=runc-bpm-uaa.scope,mems_allowed=0,global_oom,task_memcg=/,
task=influxd,pid=11559,uid=1000
2022-03-16T15:45:22.085787+00:00 8593cba8-5f6e-4d7e-95b1-012eee77b396 kernel: [ 1294.206465] Out of memory: Killed process 11559 (influxd) total-vm:4456212kB, anon-rss:1019316kB, file-rss:0kB, shmem-rss:0kB, UI
D:1000 pgtables:7120kB oom_score_adj:0
2022-03-16T15:45:22.207469+00:00 8593cba8-5f6e-4d7e-95b1-012eee77b396 kernel: [ 1294.361999] oom_reaper: reaped process 11559 (influxd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

@crsimmons
Copy link
Contributor

We colocate influxdb and grafana on the web vm for the out-of-the-box metrics. I guess it's possible that Concourse is producing a high volume of metrics which is using up too much memory. I've also seen it before where having a frequent refresh rate on the grafana dashboard slows down the web instance. I would expect scaling the size of the web vm might resolve it.

@RichardBradley
Copy link
Contributor Author

Thanks.

Increasing the instance size does seem to have helped so far. We'll keep an eye on it. I'll update here if we have anything further.

It's not ideal that the web machine enters a restart loop when under memory pressure. Ideally it would just run slower.

(Also, it's definitely not ideal that bosh auto restarting the web machine wipes the team permissions.)

(We don't use influxdb or grafana. I think I asked on a different issue how to turn them off.)

@crsimmons
Copy link
Contributor

I added a flag last night that lets you opt out of deploying the colocated metrics stack. If you download the new release then you can deploy with --no-metrics to get rid of those extra processes.

@beccar97
Copy link

Thanks, unfortunately I tried using this release to deploy with the new flag and hit the following error:

Error getting CPI info:
  Executing external CPI command: '/home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/jobs/aws_cpi/bin/cpi':
    Running command: '/home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/jobs/aws_cpi/bin/cpi', stdout: '', stderr: 'bundler: failed to load command:/home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/bin/aws_cpi (/home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/bin/aws_cpi)
/home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/3.1.0/net/https.rb:23:in `require': cannot load such file -- openssl (LoadError)
Did you mean?  open3
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/3.1.0/net/https.rb:23:in `<top (required)>'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/vendor/bundle/ruby/3.1.0/gems/aws-sdk-core-3.113.1/lib/seahorse/client/net_http/connection_pool.rb:5:in `require'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/vendor/bundle/ruby/3.1.0/gems/aws-sdk-core-3.113.1/lib/seahorse/client/net_http/connection_pool.rb:5:in `<top (required)>'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/vendor/bundle/ruby/3.1.0/gems/aws-sdk-core-3.113.1/lib/seahorse.rb:36:in `require_relative'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/vendor/bundle/ruby/3.1.0/gems/aws-sdk-core-3.113.1/lib/seahorse.rb:36:in `<top (required)>'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/vendor/bundle/ruby/3.1.0/gems/aws-sdk-core-3.113.1/lib/aws-sdk-core.rb:4:in `require'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/vendor/bundle/ruby/3.1.0/gems/aws-sdk-core-3.113.1/lib/aws-sdk-core.rb:4:in `<top (required)>'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/lib/cloud/aws.rb:5:in `require'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/lib/cloud/aws.rb:5:in `<top (required)>'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/bin/aws_cpi:7:in `require'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/bin/aws_cpi:7:in `<top (required)>'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/cli/exec.rb:58:in `load'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/cli/exec.rb:58:in `kernel_load'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/cli/exec.rb:23:in `run'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/cli.rb:484:in `exec'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/vendor/thor/lib/thor.rb:392:in `dispatch'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/cli.rb:31:in `dispatch'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/vendor/thor/lib/thor/base.rb:485:in `start'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/cli.rb:25:in `start'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/gems/3.1.0/gems/bundler-2.3.5/exe/bundle:48:in `block in <top (required)>'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/friendly_errors.rb:103:in `with_friendly_errors'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/gems/3.1.0/gems/bundler-2.3.5/exe/bundle:36:in `<top (required)>'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/bin/bundle:25:in `load'
        from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/bin/bundle:25:in `<main>'
':
      exit status 1

Exit code 

I found the same issue with the 0.18.0 and 0.18.1 releases, and had to go back to 0.17.30 to complete a successful deployment.

@crsimmons
Copy link
Contributor

Weird. That doesn't look like it should be related to anything in the new release(s). Sometimes the contents of the local ~/.bosh directory can get inexplicably broken. You could try deleting/renaming that directory and trying again.

Another possibility is that one of the bosh prerequisites has gotten broken somehow on your machine.

@beccar97
Copy link

Deleting that directory and updating all the prereqs fixed it, thanks! I noticed that although Grafana etc. are no longer running, which is great, there are still security group rules added in the -atc group for ports 3000, 8844 and 8443, all of which I believe are related to metrics. It would be nice if when using the --no-metrics flag these rules weren't created, since they are unneeded. Thanks for adding the flag, it's good to know that we've not got them using up space/memory unneccessarily anymore :)

@crsimmons
Copy link
Contributor

I forgot about the firewall ports. I'll look into patching that out.

I'm glad you managed to get it deployed 😄. Why the ~/.bosh directory sometimes breaks is still a mystery to me even after all these years of working with BOSH...

@RichardBradley
Copy link
Contributor Author

RichardBradley commented Mar 23, 2022

Increasing the instance size does seem to have helped so far. We'll keep an eye on it. I'll update here if we have anything further.

This does seem to have fixed things for us. Thanks for your help.

(The original issue at the top of this thread remains, AFAIK)

@crsimmons
Copy link
Contributor

I cut 0.18.2 over the weekend to remove the metrics ports from the firewall when disabling metrics. FYI ports 8844 and 8443 are credhub and UAA respectively so they are still required.

The original issue is more of a feature request to configure github auth on the main team. I'll leave the issue open until that gets looked at.

@crsimmons crsimmons added the enhancement New feature or request label Mar 28, 2022
@crsimmons
Copy link
Contributor

I just cut 0.19.0 which adds flags for configuring github auth on the main team at deploy time. These settings should persist through web recreations.

A small note is that the Concourse release options I chose to use only support setting the owner role on the main team. There is a more free-form option in the release where you can provide your own config which would support configuring other roles but I wasn't sure how to cleanly let users pass multiline strings to flags in Control-Tower so I left it out for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants