Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provider/google: Resolve the conversation around instance templates and startup scripts #10227

Closed
paddycarver opened this issue Nov 18, 2016 · 16 comments

Comments

@paddycarver
Copy link
Contributor

Pinging @JDiPierro, @cblecker, @mikemcrill, and @evandbrown, as they've all shown interest in this issue.

Essentially, the shortcoming is that instances support a metadata_startup_script property, but instance templates do not. This makes it less intuitive than it should be to turn an instance into an instance template.

We've got three PRs and an open issue on this already, so rather than having this conversation in all of them, I wanted to centralise stuff in a single issue.

After researching the history of this, it turns out that:

  • metadata_startup_script was born because we wanted instances to be recreated when the metadata.startup-script property changed, but because it's embedded in metadata, and we don't want instances to recreate when metadata changes, metadata_startup_script is necessary to force that.
  • metadata.startup-script was deprecated because having two places to set the variable was causing issues with our reads, so now an error is thrown if it is used.
  • Now that we have instance templates, we want to be able to use startup scripts on instance templates.
  • Instance templates are immutable, so updating their startup script involves creating a new instance template and destroying the old one.
  • Instance templates cannot be destroyed while they're in use by instance group managers.
  • Destroying an instance group manager destroys all the instances it's managing.
  • It's desirable that an instance should be recreated when its startup script changes, but it's not desirable that an instance group manager should be destroyed when its startup script changes.

So, after a lot of discussion yesterday, here's the proposed path forward:

  • I'll open a PR that adds the metadata_startup_script to instance templates.
  • Our documentation already shows that instance templates need lifecycle { create_before_destroy = true } to be used with instance group managers, which resolves the problem of destroying templates. The workflow is now: the new template is created, the instance group manager is updated to use that template, the old template is destroyed. This is all accomplished with create_before_destroy.
  • We'll deprecate the name attribute on instance templates, preferring the name_prefix, as create_before_destroy will have conflict issues if trying to use the same name (we can't have two templates with the same name at the same time). Because instance templates are used primarily for managed instance groups (according to the docs), the desire to use a static name for instance templates seems like an edge case.

The upshot of all this:

  • Creating managed instance groups will continue to work as documented, now with the added benefit of steering users away from giving their templates static names.
  • Updating a template attached to a managed instance group will work as expected: new instances will be created with that template, instances that already exist will not be recreated.
  • Instances (created by terraform, not an instance group manager) will continue to be immediately recreated when their startup script changes.
  • This behaviour matches our AWS provider, which is nice uniformity.
  • Minus the caveats of instances being/not being recreated and instance templates deprecating the name property, our instances and instance templates now behave the same.
  • Configurations that currently have a template with metadata.startup-script will receive an error to update to metadata_startup_script, just like instances.

Hopefully, this combines to form a set of behaviours that users expect. Note this means that we do not support instances updating their startup script without being recreated, but we were unable to come up with a use case for that, as Terraform has no concept of restarting. It also means we do not support updating the startup script for managed instance groups and immediately recreating them, though workarounds exist for that.

Does anyone have any issues with or objections to this proposed path forward?

References

@JDiPierro
Copy link
Contributor

JDiPierro commented Nov 18, 2016

It's desirable that an instance should be recreated when its startup script changes

My use case/issue actually prefers the opposite of this. I'd think the desired behavior would be up to the user. IMO metadata.startup-script should be favored since that's what's actually being set on the resource. It makes sense to me to keep configuration for a provider's resource as close to the provider's API as possible. A key on either the resource itself (that could get repetitive) or the provider could be used to determine if resources should be destroyed by a startup-script change.

Thanks for including me in the discussion :)

@paddycarver
Copy link
Contributor Author

Hey @JDiPierro, thanks for chiming in. I'd love to hear more about your use case. We couldn't come up with any reason anyone would want that behaviour, so I'd love to hear more about what you're trying to achieve.

@cblecker
Copy link
Contributor

I agree with @JDiPierro here. Startup scripts in the Google Cloud context are mutable, and changing them does not necessitate the immediate destruction of the resource. Startup scripts are not one-time provision scripts, but rather are run each and every time the instance is started/booted.

An example scenario:
An instance startup script could be used to check in with a configuration management system and ensure that the instance is up to date. Or an inventory system to ensure the instance's particulars are logged somewhere. The startup script can be stored in a file, and interpolated into Terraform. The startup script may have parameters that should be modified and changed via the API so that the next time the server is rebooted they take effect, but that doesn't mean the whole instance (including things like boot disks) need to be destroyed. The instance could simply be rebooted.

While instance_templates are immutable as a whole, running instance metadata (including startup scripts) are completely mutable. The action of separating this one piece of metadata and treating it differently than the rest also creates a disparity between Windows and *nix based systems, as Windows has it's own set of special metadata parameters.. one for first boot, and one for all boots.

More details on startup scripts is available here: https://cloud.google.com/compute/docs/startupscript

I would advocate for startup-script to be just another piece of mutable metadata in both the instance_template, and instance resource types. Should the user actually wish to destroy and recreate the resource, they can manually taint it.

A future enhancement to this (although perhaps pushing the limits of this particular discussion), would be a lifecycle option to manually specific that a parameter should ForceNew. Similar to how ignore_changes is a list of strings to specify which configuration parameters should be ignored, perhaps a force_destroy_on_change could be a list of strings that would force a destroy/create operation, rather than just a simple modify.

@paddycarver
Copy link
Contributor Author

Hey @cblecker,

Thanks for the notes, I really appreciate them. I have a few follow-up questions, if that's ok; I just want to understand what's going on so that informed decisions can be made. 😄

In all your example scenarios, I'm not 100% sure where restarting fits in. Is it common for machines to restart in your infrastructure? The only thing I can think of that would cause restarts is if you have the VM configured to restart when it shuts down--either from a migration or a hardware failure. I'm having trouble seeing why the machine would want to perform different actions when starting up as opposed to restarting. It also seems like a mismatch with terraform's goals to allow infrastructure's state to be defined by whether or not it has restarted, especially since (unless restarts are being manually triggered for some reason?) restarts are so unpredictable.

To give you a bit of context as to where we're coming from: in our discussions, startup scripts were considered part of the "identity" of an instance. To quote @sparkprime:

The idea is that if you use the metadata field startup-script to define the internal state of the instance, then you want changing that script to recreate the instance.

So that's the thinking behind my current stance. This is exactly the kind of feedback I was hoping for when I opened this issue, though, and I'm super grateful for any extra insight you can provide. I'm going to investigate what bringing back metadata.startup-script could mean and if it's possible, but in the meantime, I'd love any more information you (or anyone else!) can provide to help me understand why it's important.

@sparkprime
Copy link
Contributor

metadata_startup_script was not intended to replace metadata.startup-script but to augment it, as both "dynamically updating metadata" and "recreating instance upon startup-script change" are valid use-cases.

@sparkprime
Copy link
Contributor

There are other ways of determining internal instance state from terraform configs, but they require agents in the VM doing hanging gets on the metadata server. Simply recreating the instance on startup script change is much easier, although it takes a minute or so to recreate it (as opposed to it being instant by triggering an agent that makes the change within the VM).

@sparkprime
Copy link
Contributor

I would not advise modifying the template of an instance group, because all of the instances in an instance group are supposed to be based on the same template. In order to do a deployment strategy, you should be using more than one instance group. Turn up the new instance group with the new template, adjust the sizes and/or load balancer assignment to gradually roll out the change, and then eventually destroy the old one.

@paddycarver
Copy link
Contributor Author

Thanks for weighing in on this @sparkprime.

metadata_startup_script was not intended to replace metadata.startup-script but to augment it

Understood. @phinze had to remove it in GH-3507. I believe it was causing diff issues:

This seemed to be because GCE sets metadata.startup-script to a blank string on instance creation, and if a user specifies any metadata in their config this is seen as the desired full contents of metadata, so we get a diff trying to remove startup-script.

I'm currently investigating how to bring back metadata.startup-script without regressing back into those diff issues, as I think it would solve that issue nicely.

I would not advise modifying the template of an instance group, because all of the instances in an instance group are supposed to be based on the same template.

I don't know how advisable it is, but we seem to get into trouble when we start making it impossible to do things it's possible to do with the API. :) So I'd rather leave that functionality as it is, rather than opening a PR to yank it out. Though we do recommend the same strategy your recommend for deployments.

So, assuming I can get metadata.startup-script working as expected, does that resolve everyone's concerns about this proposed solution?

Thanks so much for all the feedback, it's incredibly helpful and I really appreciate it.

@cblecker
Copy link
Contributor

Yes, I'm comfortable with that path forward. That way both options are acceptable and accessible to users.

One additional suggestion to keep in mind that, barring any technical complications, would be to ensure that whatever schema and state changes that are done to google_compute_instance are done to google_compute_instance_template as well. Even through instance templates are immutable, and changing the startup-script in either place would trigger recreation of the template, it would be extremely powerful for a user to take an existing and working instance resource, and copy and paste it to a template with minimal to no changes.

@JDiPierro
Copy link
Contributor

Sounds like a great plan to me. Also I agree with @cblecker that being able to go from an instance definition to an instance template with minimal changes is extremely valuable.

@sparkprime
Copy link
Contributor

Agree with @cblecker it should be added to instance template too. You will never be able to simply copy/paste from instance to instance template but it makes sense to minimize the differences when it comes to "fake" fields.

@sparkprime
Copy link
Contributor

It might be interesting to generalize this. E.g. if I can say in my tf config that in this particular instance, changing metadata.startup-script should force new, then metadata_startup_script is not needed. Additionally, the fact that instance template is ForceNew in Terraform but not in the APIs could be addressed if it were possible to say in a specific config that this particular instance template is entirely ForceNew. Such customization would allow the user more control over roll-out strategies.

@paddycarver
Copy link
Contributor Author

One additional suggestion to keep in mind that, barring any technical complications, would be to ensure that whatever schema and state changes that are done to google_compute_instance are done to google_compute_instance_template as well. Even through instance templates are immutable, and changing the startup-script in either place would trigger recreation of the template, it would be extremely powerful for a user to take an existing and working instance resource, and copy and paste it to a template with minimal to no changes.

This is my goal. I've got some code passing most our tests right now that combines the logic for disks, instances, and instance templates, though I'll caution that I'm in exploratory mode and that just because I've got an implementation sketched out doesn't mean we'll use it or even that this is the right path forward. I just want to know what it means and what it looks like. :)

You will never be able to simply copy/paste from instance to instance template

I'd love to hear more about why not. So far the biggest differences I've seen is that instance templates are immutable where instances are not, and that instance templates are global, where instances are regional. So there are a lot of things one supports that the other doesn't, but it seems like trivial instances/instance templates could be copy/pasted back and forth without much issues. I'm going to create a test scenario to be sure, but more insight here would be appreciated.

It might be interesting to generalize this. E.g. if I can say in my tf config that in this particular instance, changing metadata.startup-script should force new, then metadata_startup_script is not needed. Additionally, the fact that instance template is ForceNew in Terraform but not in the APIs could be addressed if it were possible to say in a specific config that this particular instance template is entirely ForceNew. Such customization would allow the user more control over roll-out strategies.

That is an interesting path forward, and something we discussed a bit. At the moment, I'm trying to limit my scope a bit just because I'm new to the Terraform team, so I want to spend a bit more time working on it before I tackle any wider things that have an impact outside a single provider, so I'm aiming at a short-term fix. But that is certainly something I'd like to explore a bit more when I have the experience to gauge its merits and potential shortcomings a bit better.

@cblecker
Copy link
Contributor

I think this can be closed, as #10537 was merged :)

@paddycarver
Copy link
Contributor Author

Agreed. I was leaving it open to track something related that I stumbled upon in investigating this, but in hindsight, that seems silly. :) Closing this out, thanks everyone for your input and feedback!

@ghost
Copy link

ghost commented Apr 18, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants