Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Un-break templates when using vault stanza change_mode noop #11783

Merged
merged 1 commit into from
Jan 10, 2022

Conversation

grembo
Copy link
Contributor

@grembo grembo commented Jan 6, 2022

Templates in nomad jobs make use of the vault token defined in
the vault stanza when issuing credentials like client certificates.

When using change_mode "noop" in the vault stanza, consul-template
is not informed in case a vault token is re-issued (which can
happen from time to time for various reasons, as described
in https://www.nomadproject.io/docs/job-specification/vault).

As a result, consul-template will keep using the old vault token
to renew credentials and - once the token expired - stop renewing
credentials. The symptom of this problem is a vault_token
file that is newer than the issued credential (e.g., TLS certificate)
in a job's /secrets directory.

This change corrects this, so that h.updater.updatedVaultToken(token)
is called, which will inform stakeholders about the new
token and make sure, the new token is used by consul-template.

Example job template fragment:

vault {
    policies = ["nomad-job-policy"]
    change_mode = "noop"
}

template {
  data = <<-EOH
    {{ with secret "pki_int/issue/nomad-job"
    "common_name=myjob.service.consul" "ttl=90m"
    "alt_names=localhost" "ip_sans=127.0.0.1"}}
    {{ .Data.certificate }}
    {{ .Data.private_key }}
    {{ .Data.issuing_ca }}
    {{ end }}
  EOH
  destination = "${NOMAD_SECRETS_DIR}/myjob.crt"
  change_mode = "noop"
}

This fix does not alter the meaning of the three change modes of vault

  • "noop" - Take no action
  • "restart" - Restart the job
  • "signal" - send a signal to the task

as the switch statement following line 232 contains the necessary
logic.

It is assumed that "take no action" was never meant to mean "don't tell
consul-template about the new vault token".

Successfully tested in a staging cluster consisting of multiple
nomad client nodes.

Templates in nomad jobs make use of the vault token defined in
the vault stanza when issuing credentials like client certificates.

When using change_mode "noop" in the vault stanza, consul-template
is not informed in case a vault token is re-issued (which can
happen from time to time for various reasons, as described
in https://www.nomadproject.io/docs/job-specification/vault).

As a result, consul-template will keep using the old vault token
to renew credentials and - once the token expired - stop renewing
credentials. The symptom of this problem is a vault_token
file that is newer than the issued credential (e.g., TLS certificate)
in a job's /secrets directory.

This change corrects this, so that h.updater.updatedVaultToken(token)
is called, which will inform stakeholders about the new
token and make sure, the new token is used by consul-template.

Example job template fragment:

    vault {
        policies = ["nomad-job-policy"]
        change_mode = "noop"
    }

    template {
      data = <<-EOH
        {{ with secret "pki_int/issue/nomad-job"
        "common_name=myjob.service.consul" "ttl=90m"
        "alt_names=localhost" "ip_sans=127.0.0.1"}}
        {{ .Data.certificate }}
        {{ .Data.private_key }}
        {{ .Data.issuing_ca }}
        {{ end }}
      EOH
      destination = "${NOMAD_SECRETS_DIR}/myjob.crt"
      change_mode = "noop"
    }

This fix does not alter the meaning of the three change modes of vault

- "noop" - Take no action
- "restart" - Restart the job
- "signal" - send a signal to the task

as the switch statement following line 232 contains the necessary
logic.

It is assumed that "take no action" was never meant to mean "don't tell
consul-template about the new vault token".

Successfully tested in a staging cluster consisting of multiple
nomad client nodes.
@hashicorp-cla
Copy link

hashicorp-cla commented Jan 6, 2022

CLA assistant check
All committers have signed the CLA.

@grembo
Copy link
Contributor Author

grembo commented Jan 9, 2022

@DerekStrickland is there any additional information required at this point?

@DerekStrickland
Copy link
Contributor

DerekStrickland commented Jan 10, 2022

@grembo No additional information is needed at this time. We just need to review this internally and think through any implications or side-effects it might have. I also need to take a look at what tests failed and why.

We are very grateful that you took time to research this and submit a PR. I mean super grateful! Thanks so much for taking time to do this. I'll bring this up with the team and get you feedback as soon as I can.

@tgross tgross self-requested a review January 10, 2022 16:23
@tgross tgross self-assigned this Jan 10, 2022
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @grembo! In this scenario how does the workload actually get informed about the new secret? Is this just a case where the workload is starting a process from scratch regularly to pick up the change? (ex. a CGI script or something)

In any case, the behavior you're looking for here certainly seems to be implied by the docs and by this block in the loop:

if updatedToken {
	switch h.vaultStanza.ChangeMode {
	...
	case structs.VaultChangeModeNoop:
		fallthrough

Why have the fallthrough if it's never meant to be called with that value, right? But after the fallthrough we're calling the updatedVaultToken method which looks like this:

func (tr *TaskRunner) updatedVaultToken(token string) {
	// Update the task runner and environment
	tr.setVaultToken(token)
	// Trigger update hooks with the new Vault token
	tr.triggerUpdateHooks()
}

That last line to trigger the update hooks is worrisome, as it means we'd be potentially triggering task updates to Consul registered services and script checks as well. An example corner case here is if a client has restarted with new environment variables, the task environment for interpolation may be different, but this previously wouldn't normally trigger an update. With this change a service could be changed even for change_mode = "noop"

@grembo
Copy link
Contributor Author

grembo commented Jan 10, 2022

Hi @grembo!

Hi @tgross,

Thank you for responding so quickly.

In this scenario how does the workload actually get informed about the new secret? Is this just a case where the workload is starting a process from scratch regularly to pick up the change? (ex. a CGI script or something)

In this specific scenario the workload isn't interested in the vault token at all (ideally, it wouldn't even show up in /secrets), but only in the resulting TLS certificate. In order to issue and renew it, a vault stanza has to be present.

For the workload I tested with, I actually pick up the new certificate simply by watching the file (myjob.crt) as I can't use signals in a meaningful way and restarting isn't an option either. If I'm not mistaken, the same logic applies also for other change_modes in the template stanza (e.g., signal) though, as long as the vault stanza contains change_mode noop.

In any case, the behavior you're looking for here certainly seems to be implied by the docs and by this block in the loop:

if updatedToken {
	switch h.vaultStanza.ChangeMode {
	...
	case structs.VaultChangeModeNoop:
		fallthrough

Why have the fallthrough if it's never meant to be called with that value, right?

This is exactly what made me opt for removing the if statement instead of adding code in another place :)

But after the fallthrough we're calling the updatedVaultToken method which looks like this:

func (tr *TaskRunner) updatedVaultToken(token string) {
	// Update the task runner and environment
	tr.setVaultToken(token)
	// Trigger update hooks with the new Vault token
	tr.triggerUpdateHooks()
}

That last line to trigger the update hooks is worrisome, as it means we'd be potentially triggering task updates to Consul registered services and script checks as well. An example corner case here is if a client has restarted with new environment variables, the task environment for interpolation may be different, but this previously wouldn't normally trigger an update. With this change a service could be changed even for change_mode = "noop"

This is a valid point, but sounds like it is something users of vault change_mode=signal are already suffering from at the moment - setting it to "signal" as a user I would expect to exactly get the configured signal in case of token renewal and not also an updated environment in case I happened to restart nomad with an altered environment in the meantime (which is a pretty hard to debug/reproduce combination).

I can't tell if anybody out there depends on vault change_mode's exact current behavior (e.g., just create vault tokens in /secrets and pick those up and restarting nomad client with an altered environment in the process).

If this is a major concern, another change_mode could be introduced to address the situation (e.g., named "propagate") and documentation for change_mode noop could be adapted to discourage using it in combination with noop, maybe ultimately deprecating the combination, as it is inherently unreliable.

Update: All of this assuming that the call to tr.triggerUpdateHooks() is actually required to propagate the new token to consul-template - I didn't dig deep enough into the code yet to confirm this assumption.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the thorough reply @grembo. I'm convinced.

LGTM 👍 We'll get this merged and ship it in the upcoming scheduled release.

@tgross tgross merged commit e9032c1 into hashicorp:main Jan 10, 2022
tgross added a commit that referenced this pull request Jan 10, 2022
@tgross tgross added this to the 1.2.4 milestone Jan 10, 2022
@grembo
Copy link
Contributor Author

grembo commented Jan 10, 2022

@tgross Perfect, thank you!

tgross added a commit that referenced this pull request Jan 10, 2022
tgross added a commit that referenced this pull request Jan 11, 2022
tgross added a commit that referenced this pull request Jan 11, 2022
lgfa29 pushed a commit that referenced this pull request Jan 17, 2022
lgfa29 pushed a commit that referenced this pull request Jan 17, 2022
Templates in nomad jobs make use of the vault token defined in
the vault stanza when issuing credentials like client certificates.

When using change_mode "noop" in the vault stanza, consul-template
is not informed in case a vault token is re-issued (which can
happen from time to time for various reasons, as described
in https://www.nomadproject.io/docs/job-specification/vault).

As a result, consul-template will keep using the old vault token
to renew credentials and - once the token expired - stop renewing
credentials. The symptom of this problem is a vault_token
file that is newer than the issued credential (e.g., TLS certificate)
in a job's /secrets directory.

This change corrects this, so that h.updater.updatedVaultToken(token)
is called, which will inform stakeholders about the new
token and make sure, the new token is used by consul-template.

Example job template fragment:

    vault {
        policies = ["nomad-job-policy"]
        change_mode = "noop"
    }

    template {
      data = <<-EOH
        {{ with secret "pki_int/issue/nomad-job"
        "common_name=myjob.service.consul" "ttl=90m"
        "alt_names=localhost" "ip_sans=127.0.0.1"}}
        {{ .Data.certificate }}
        {{ .Data.private_key }}
        {{ .Data.issuing_ca }}
        {{ end }}
      EOH
      destination = "${NOMAD_SECRETS_DIR}/myjob.crt"
      change_mode = "noop"
    }

This fix does not alter the meaning of the three change modes of vault

- "noop" - Take no action
- "restart" - Restart the job
- "signal" - send a signal to the task

as the switch statement following line 232 contains the necessary
logic.

It is assumed that "take no action" was never meant to mean "don't tell
consul-template about the new vault token".

Successfully tested in a staging cluster consisting of multiple
nomad client nodes.
lgfa29 pushed a commit that referenced this pull request Jan 18, 2022
Templates in nomad jobs make use of the vault token defined in
the vault stanza when issuing credentials like client certificates.

When using change_mode "noop" in the vault stanza, consul-template
is not informed in case a vault token is re-issued (which can
happen from time to time for various reasons, as described
in https://www.nomadproject.io/docs/job-specification/vault).

As a result, consul-template will keep using the old vault token
to renew credentials and - once the token expired - stop renewing
credentials. The symptom of this problem is a vault_token
file that is newer than the issued credential (e.g., TLS certificate)
in a job's /secrets directory.

This change corrects this, so that h.updater.updatedVaultToken(token)
is called, which will inform stakeholders about the new
token and make sure, the new token is used by consul-template.

Example job template fragment:

    vault {
        policies = ["nomad-job-policy"]
        change_mode = "noop"
    }

    template {
      data = <<-EOH
        {{ with secret "pki_int/issue/nomad-job"
        "common_name=myjob.service.consul" "ttl=90m"
        "alt_names=localhost" "ip_sans=127.0.0.1"}}
        {{ .Data.certificate }}
        {{ .Data.private_key }}
        {{ .Data.issuing_ca }}
        {{ end }}
      EOH
      destination = "${NOMAD_SECRETS_DIR}/myjob.crt"
      change_mode = "noop"
    }

This fix does not alter the meaning of the three change modes of vault

- "noop" - Take no action
- "restart" - Restart the job
- "signal" - send a signal to the task

as the switch statement following line 232 contains the necessary
logic.

It is assumed that "take no action" was never meant to mean "don't tell
consul-template about the new vault token".

Successfully tested in a staging cluster consisting of multiple
nomad client nodes.
lgfa29 pushed a commit that referenced this pull request Jan 18, 2022
@github-actions
Copy link

github-actions bot commented Nov 3, 2022

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
Development

Successfully merging this pull request may close these issues.

5 participants