Async Terraform CLI library #3

ulucinar · 2021-08-12T11:10:16Z

Terraform CLI lifecycle manager
https://github.com/upbound/purple-squad/issues/626
Scheduler for asynchronous (Apply, Destroy) & synchronous (Refresh) Terraform CLI pipelines
https://github.com/upbound/purple-squad/issues/625

negz

Could you update this PR with a little more context around what is going on here? Some questions:

Is this all freshly written code, or are parts of it copied from elsewhere?
Are we considering integrating with Terraform at the provider gRPC API level (as we have done previously)? I thought we were going to investigate integrating at the terraform CLI level?

negz · 2021-08-13T07:05:58Z

cmd/provider/main.go

+
+func main() {
+	conf := &ProviderConfig{}
+	// initialize a cobra command for nats-monitoring


I don't think this is for nats-monitoring. :)

Resolved via removal (please see the comment below on the removal of the provider wrapper implementation).

Initially, I believe it makes sense to continue without the shared gRPC server and only with provider binary mirroring for production, so that we can safely invoke terraform inits in synchronous contexts and we are doing things in the documented ways Terraform expects us to do. Shared gRPC server is certainly a hack that relies on undocumented features used for testing providers and running the Terraform CLI on Windows.

But it looks like we have an option if we do hit certain scalability issues in the long run. The idea is still to run the Terraform CLI behind the scenes, but instead of letting the Terraform commands fork provider gRPC servers when they are about to make gRPC calls to the provider, we have (for now) an option to make them share a provider gRPC server.

negz · 2021-08-13T07:06:50Z

cmd/provider/main.go

+	if err := cmd.Execute(); err != nil {
+		panic(err)
+	}
+}


We typically use https://github.com/alecthomas/kong (or its predecessor kingpin) for all Crossplane CLI parsing. I'd prefer not to introduce cobra unless there is a good reason to do so.

Resolved via removal. I have removed the provider wrapper implementation, which is for running a Terraform provider plugin as a shared gRPC server. I will also add some more details to the PR description about our current strategy for wrapping the Terraform CLI.

negz · 2021-08-13T07:07:53Z

pkg/log/core.go

+limitations under the License.
+*/
+
+package log


Can we stick with https://github.com/crossplane/crossplane-runtime/tree/master/pkg/logging instead?

Done. I have removed the log package that I copied from wesaas repo.

muvaf

I don't love the idea we're tracking our operations using the filesystem locks because then we have to manage two locks, including Terraform's own, but that's not blocking. I'll open a separate issue to discuss.

pkg/conversion/cli.go

pkg/tfcli/client.go

muvaf · 2021-08-23T11:29:34Z

pkg/tfcli/builders.go

+func providerFromEnv() *withProvider {
+	p := &withProvider{}
+	if p.source == "" {
+		p.source = os.Getenv(envTFProviderSource)


Not a strong opinion at all, but I always find it risky for libraries to rely on environment variables for anything. I'd expect the importers of this repository to handle the user input from wherever they want and give it to exposed structs explicitly.

Environment variables carry some undesired traits similar to global variables but they also provide some convenience when you have layers of abstractions, so that you can control the behavior at a lower layer without having to flow configuration through intermediate layers. Terraform CLI's certain behavior is exclusively controlled via environment variables and similar to that it was convenient to introduce these two environment variables (XP_TERRAFORM_PROVIDER_SOURCE and XP_TERRAFORM_PROVIDER_VERSION) that, together with some Terraform configuration via the TF_CLI_CONFIG_FILE env. variable control the plugin binary mirroring behavior.

I would suggest to keep them initially for convenience & development purposes but what's important here is, as we have previously discussed, we need to have the source & version info of the Terraform provider against which we have generated code available at runtime for the tfcli library and ultimately for the Terraform CLI.

I agree, I'm not against environment variables. It's just when a library uses them, it imposes that specific set of configuration points on its consumers (provider binaries in this case) forcefully, leaving no way for pre-processing the variables or denying such ability.

Unfortunately I do have a strong opinion about environment variables. 🤓 I'm all for using them, but I strongly prefer them to all be "loaded" from the environment in one, predictable place - typically in a program's main() function. Having os.Getenv calls scattered around the codebase makes it difficult to keep track of the various environment variables (and more broadly, configuration sources) that a program uses. For libraries I feel it is best to take configuration as inputs and leave things operational like configuration, logging, etc to the caller.

Done: removed tfcli configuration via environment variables.

pkg/tfcli/builders.go

pkg/tfcli/refresh.go

muvaf · 2021-08-23T13:23:05Z

pkg/tfcli/client.go

+
+// returns true if state file exists
+// TODO(aru): differentiate not-found error
+func (c *client) loadStateFromWorkspace(errNoState bool) (bool, error) {


It looks like all usages of loadStateFromWorkspace returns c.state.tfState immediately already. What do you think about making the function signature func (c *client) loadStateFromWorkspace() ([]byte, error) here and return os.ErrNotExist (or a custom one that they can check) error in case it doesn't exist?

muvaf · 2021-08-23T13:26:07Z

pkg/tfcli/client.go

+func (c *client) Close(_ context.Context) error {
+	var result error
+	if c.pInfo != nil {
+		result = multierr.Combine(errors.Wrap(c.pInfo.Kill(), errDestroyProcess))


After pInfo is initialized, we complete the reconciliation and come back again. If the operation is still continuing, at that point, isn't c.pInfo is nil? How do we save which process to kill in the subsequent reconciliations?

tfcli.Client can be running a single process at a given time. pInfo is related to that process if set. Common controller calls Close after an MR is deleted successfully. We don't aim to keep track of all Terraform CLI process forks done in the context of an MR via multiple Clients initialized with the same handle corresponding to the same workspace. The workspace state prevents that scenario anyway.

Relying on Terraform state locks, however, we do need a stale lock cleanup mechanism as we have been discussing. We will address that in a separate PR.

I'm not sure I got the relation with locks. What I meant is that, from my reading, c.pInfo will be non-nil only if the operation is started in this reconciliation. In the next one, i.e. in-progress state, we initialize a new client but don't call process.New function because a process is in progress. Then when we trigger a deletion we don't close right away, we check its progress in the next reconciliation and close it only if it's completed. But then c.pInfo is nil because that's a new reconciliation where c.pInfo wasn't initialized. So Close(ctx) may not be cleaning up anything because in line 258 p.Info will be nil all the time.

For async operations, we do not hold references to forked processes across reconciliation boundaries. Currently, we do not need to that because we are observing the workspace state. We will block on deletion if there is an ongoing command.

pkg/tfcli/destroy.go

muvaf · 2021-08-23T14:32:34Z

pkg/tfcli/client.go

+		return nil, "", cliErrors.NewPipelineInProgressError(cliErrors.PipelineNotStarted)
+	}
+
+	if err := c.checkTFStateLock(); err != nil {


c.initConfiguration returns whether we're in progress or not looking at .xp.lock and here we check .terraform.tfstate.lock.info. Do we expect a case where these do not match and we're unable to report error in lower parts of the stack? I'm wondering if we actually gain an error case information by checking TF state file other than what terraform execution would return.

Terraform state lock (.terraform.tfstate.lock.info) is managed by the Terraform CLI and tells us that there is a Terraform command running in the workspace. .xp.lock (Crossplane lock) is managed by tfcli.Client and tells us that there is an active operation whose result is not reported yet (to the common controller).

At this line, we have already asserted that there is already an ongoing operation, with the expected type, whose result has not yet been collected by the tfcli consumer (common controller). And now we are checking if the Terraform command has completed or not.

turkenh

There are also some linter errors that we would need to resolve before moving on.

Please check the output of make reviewable.

turkenh · 2021-08-24T08:49:52Z

pkg/process/fork.go

+	result.stderr = stderr
+
+	if autoStart {
+		return result, result.cmd.Start()


Should we target to (re)use pi.Run() in case of autoStart as well? To unify the way we run the process.
Or even better, could we completely remote autoStart option and always expect the caller to run pi.Run()?

autoStart is a convenience option to automatically fork the process without blocking the call to process.New for async pipelines. Thus, process.New is always non-blocking.
Run expects a context.Context and waits for the completion of the forked process. So the flow is either process.New -> pi.Start (either with the autostart option or via an explicit call to process.Info.Start -> poll until completion for async pipelines, or process.New -> blocking call to process.Info.Run.

turkenh · 2021-08-24T08:52:21Z

pkg/process/fork.go

+	}()
+
+	select {
+	case <-ctx.Done():


Can we signal the running process to do a graceful stop in this case?

Interesting point. When we are considering cleanup logic in a separate PR, we may consider sending a SIGINT.

A way to kill the ongoing process would be nice for deletion, too. Right now, if deletion call comes in and there is an ongoing apply operation, we just wait for it to complete instead of sending a kill signal to apply process.

turkenh · 2021-08-24T09:23:55Z

pkg/process/fork.go

+	return pi.Run(ctx)
+}
+
+func (pi *Info) Start() error {


Similarly, can we drop Start()and use always use Run? Like Run(context.Background()) is no context available.

I would prefer to have separate calls for the blocking process.Info.Run and non-blocking process.Info.Start.

turkenh · 2021-08-24T09:57:50Z

pkg/tfcli/init.go

+func (c *client) checkOperation() error {
+	xpStatePath := filepath.Join(c.wsPath, fileStateLock)
+	// Terraform state lock file does not seem to contain operation type
+	buff, err := ioutil.ReadFile(xpStatePath)


We had a couple of offline discussions around some possible alternatives to relying on filesystem locks (terraforms and now ours) but would love to revisit them without blocking this PR.

Yes, agreed. As discussed previously, I think we can experiment with some alternatives to see how reliable they are by testing if we violate any assumptions by the Terraform CLI, and how scalable would those alternatives be in terms of memory, concurrency, etc.

negz

I haven't reviewed thoroughly, but a few comments primarily around style and idiomatic Go. Will we add tests to this PR before it is merged?

pkg/tfcli/apply.go

negz · 2021-08-24T23:31:46Z

pkg/tfcli/builders.go

+func providerFromEnv() *withProvider {
+	p := &withProvider{}
+	if p.source == "" {
+		p.source = os.Getenv(envTFProviderSource)


Unfortunately I do have a strong opinion about environment variables. 🤓 I'm all for using them, but I strongly prefer them to all be "loaded" from the environment in one, predictable place - typically in a program's main() function. Having os.Getenv calls scattered around the codebase makes it difficult to keep track of the various environment variables (and more broadly, configuration sources) that a program uses. For libraries I feel it is best to take configuration as inputs and leave things operational like configuration, logging, etc to the caller.

pkg/tfcli/builders.go

pkg/tfcli/client.go

ulucinar · 2021-08-27T15:08:07Z

There are also some linter errors that we would need to resolve before moving on.

Please check the output of make reviewable.

Thanks Hasan, fixed those issues.

ulucinar · 2021-08-27T15:20:40Z

I haven't reviewed thoroughly, but a few comments primarily around style and idiomatic Go. Will we add tests to this PR before it is merged?

Thanks @negz. We are planning to merge this PR without automatic tests. I've manually tested CRUD operations on top of the common controller. I've opened #39 to track test implementation.

ulucinar · 2021-08-27T15:21:12Z

Thanks @muvaf for the review.

muvaf

My only blocking comment is related to the CreateOrUpdate call under Delete.

As we discussed, we'll need to do another pass to apply existing Go patterns we use in other repos but we'd like to merge it now to unblock other work.

#39 and possibly #36 could be a good issues to introduce such changes.

pkg/conversion/cli.go

muvaf · 2021-09-01T21:18:33Z

pkg/conversion/cli.go

+	if tferrors.IsApplying(err) {
+		// then resource was deleted while an apply operation was in-progress
+		// we need to wait/consume this final Apply operation
+		_, err = t.CreateOrUpdate(ctx, tr)


I'm wondering if there is a way for the underlying system to report that it's done on its own and we'd just check it, instead of making us call CreateOrUpdate, just to ping it so that it looks at the filesystem and check if the operation is completed. I don't feel comfortable calling CreateOrUpdate inside Delete function.

Done: Added tfcli.Client.DiscardOperation and replaced the call to CreateOrUpdate with that.

muvaf · 2021-09-01T21:20:17Z

pkg/process/fork.go

+	}()
+
+	select {
+	case <-ctx.Done():


A way to kill the ongoing process would be nice for deletion, too. Right now, if deletion call comes in and there is an ongoing apply operation, we just wait for it to complete instead of sending a kill signal to apply process.

muvaf · 2021-09-01T21:38:13Z

pkg/tfcli/client.go

+	if c.wsPath != "" {
+		result = multierr.Combine(result, errors.Wrap(c.fs.RemoveAll(c.wsPath), errDestroyWorkspace))
+	}


Suggested change

if c.wsPath != "" {

result = multierr.Combine(result, errors.Wrap(c.fs.RemoveAll(c.wsPath), errDestroyWorkspace))

}

if c.wsPath != "" {

if err := c.fs.RemoveAll(c.wsPath); err != nil {

return errors.Wrap(err, errDestroyWorkspace)

}

}

It makes it much easier to skim through the code quickly if we use existing idiomatic way of error handling. In fact, I'd suggest not using multierr.Combine at all and go for subsequent if err := ..s if we need to ahve a series of steps.

Here we are tearing down multiple resources and if any of them fails, we would still like to tear the rest down and report any failures to the caller.

- Terraform CLI async pipeline scheduler Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>