Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Async Terraform CLI library #3

Merged
merged 15 commits into from
Sep 8, 2021

Conversation

ulucinar
Copy link
Collaborator

@ulucinar ulucinar commented Aug 12, 2021

@ulucinar ulucinar requested review from turkenh and muvaf August 12, 2021 11:10
Copy link
Member

@negz negz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update this PR with a little more context around what is going on here? Some questions:

  • Is this all freshly written code, or are parts of it copied from elsewhere?
  • Are we considering integrating with Terraform at the provider gRPC API level (as we have done previously)? I thought we were going to investigate integrating at the terraform CLI level?


func main() {
conf := &ProviderConfig{}
// initialize a cobra command for nats-monitoring
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is for nats-monitoring. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved via removal (please see the comment below on the removal of the provider wrapper implementation).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I believe it makes sense to continue without the shared gRPC server and only with provider binary mirroring for production, so that we can safely invoke terraform inits in synchronous contexts and we are doing things in the documented ways Terraform expects us to do. Shared gRPC server is certainly a hack that relies on undocumented features used for testing providers and running the Terraform CLI on Windows.

But it looks like we have an option if we do hit certain scalability issues in the long run. The idea is still to run the Terraform CLI behind the scenes, but instead of letting the Terraform commands fork provider gRPC servers when they are about to make gRPC calls to the provider, we have (for now) an option to make them share a provider gRPC server.

if err := cmd.Execute(); err != nil {
panic(err)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically use https://github.com/alecthomas/kong (or its predecessor kingpin) for all Crossplane CLI parsing. I'd prefer not to introduce cobra unless there is a good reason to do so.

Copy link
Collaborator Author

@ulucinar ulucinar Aug 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved via removal. I have removed the provider wrapper implementation, which is for running a Terraform provider plugin as a shared gRPC server. I will also add some more details to the PR description about our current strategy for wrapping the Terraform CLI.

pkg/log/core.go Outdated
limitations under the License.
*/

package log
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have removed the log package that I copied from wesaas repo.

@ulucinar ulucinar marked this pull request as draft August 13, 2021 09:11
@ulucinar ulucinar force-pushed the aru/provider-wrapper branch 3 times, most recently from 6ba8820 to bae691b Compare August 22, 2021 22:31
@ulucinar ulucinar marked this pull request as ready for review August 22, 2021 22:38
@ulucinar ulucinar force-pushed the aru/provider-wrapper branch 3 times, most recently from 89abf36 to d316670 Compare August 23, 2021 16:36
Copy link
Member

@muvaf muvaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the idea we're tracking our operations using the filesystem locks because then we have to manage two locks, including Terraform's own, but that's not blocking. I'll open a separate issue to discuss.

pkg/conversion/cli.go Outdated Show resolved Hide resolved
pkg/tfcli/client.go Outdated Show resolved Hide resolved
func providerFromEnv() *withProvider {
p := &withProvider{}
if p.source == "" {
p.source = os.Getenv(envTFProviderSource)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a strong opinion at all, but I always find it risky for libraries to rely on environment variables for anything. I'd expect the importers of this repository to handle the user input from wherever they want and give it to exposed structs explicitly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment variables carry some undesired traits similar to global variables but they also provide some convenience when you have layers of abstractions, so that you can control the behavior at a lower layer without having to flow configuration through intermediate layers. Terraform CLI's certain behavior is exclusively controlled via environment variables and similar to that it was convenient to introduce these two environment variables (XP_TERRAFORM_PROVIDER_SOURCE and XP_TERRAFORM_PROVIDER_VERSION) that, together with some Terraform configuration via the TF_CLI_CONFIG_FILE env. variable control the plugin binary mirroring behavior.

I would suggest to keep them initially for convenience & development purposes but what's important here is, as we have previously discussed, we need to have the source & version info of the Terraform provider against which we have generated code available at runtime for the tfcli library and ultimately for the Terraform CLI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I'm not against environment variables. It's just when a library uses them, it imposes that specific set of configuration points on its consumers (provider binaries in this case) forcefully, leaving no way for pre-processing the variables or denying such ability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I do have a strong opinion about environment variables. 🤓 I'm all for using them, but I strongly prefer them to all be "loaded" from the environment in one, predictable place - typically in a program's main() function. Having os.Getenv calls scattered around the codebase makes it difficult to keep track of the various environment variables (and more broadly, configuration sources) that a program uses. For libraries I feel it is best to take configuration as inputs and leave things operational like configuration, logging, etc to the caller.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: removed tfcli configuration via environment variables.

pkg/tfcli/builders.go Outdated Show resolved Hide resolved
pkg/tfcli/builders.go Outdated Show resolved Hide resolved
pkg/tfcli/refresh.go Outdated Show resolved Hide resolved

// returns true if state file exists
// TODO(aru): differentiate not-found error
func (c *client) loadStateFromWorkspace(errNoState bool) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like all usages of loadStateFromWorkspace returns c.state.tfState immediately already. What do you think about making the function signature func (c *client) loadStateFromWorkspace() ([]byte, error) here and return os.ErrNotExist (or a custom one that they can check) error in case it doesn't exist?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

func (c *client) Close(_ context.Context) error {
var result error
if c.pInfo != nil {
result = multierr.Combine(errors.Wrap(c.pInfo.Kill(), errDestroyProcess))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After pInfo is initialized, we complete the reconciliation and come back again. If the operation is still continuing, at that point, isn't c.pInfo is nil? How do we save which process to kill in the subsequent reconciliations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tfcli.Client can be running a single process at a given time. pInfo is related to that process if set. Common controller calls Close after an MR is deleted successfully. We don't aim to keep track of all Terraform CLI process forks done in the context of an MR via multiple Clients initialized with the same handle corresponding to the same workspace. The workspace state prevents that scenario anyway.

Relying on Terraform state locks, however, we do need a stale lock cleanup mechanism as we have been discussing. We will address that in a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I got the relation with locks. What I meant is that, from my reading, c.pInfo will be non-nil only if the operation is started in this reconciliation. In the next one, i.e. in-progress state, we initialize a new client but don't call process.New function because a process is in progress. Then when we trigger a deletion we don't close right away, we check its progress in the next reconciliation and close it only if it's completed. But then c.pInfo is nil because that's a new reconciliation where c.pInfo wasn't initialized. So Close(ctx) may not be cleaning up anything because in line 258 p.Info will be nil all the time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For async operations, we do not hold references to forked processes across reconciliation boundaries. Currently, we do not need to that because we are observing the workspace state. We will block on deletion if there is an ongoing command.

pkg/tfcli/destroy.go Outdated Show resolved Hide resolved
return nil, "", cliErrors.NewPipelineInProgressError(cliErrors.PipelineNotStarted)
}

if err := c.checkTFStateLock(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.initConfiguration returns whether we're in progress or not looking at .xp.lock and here we check .terraform.tfstate.lock.info. Do we expect a case where these do not match and we're unable to report error in lower parts of the stack? I'm wondering if we actually gain an error case information by checking TF state file other than what terraform execution would return.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terraform state lock (.terraform.tfstate.lock.info) is managed by the Terraform CLI and tells us that there is a Terraform command running in the workspace. .xp.lock (Crossplane lock) is managed by tfcli.Client and tells us that there is an active operation whose result is not reported yet (to the common controller).

At this line, we have already asserted that there is already an ongoing operation, with the expected type, whose result has not yet been collected by the tfcli consumer (common controller). And now we are checking if the Terraform command has completed or not.

Copy link
Member

@turkenh turkenh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also some linter errors that we would need to resolve before moving on.

Please check the output of make reviewable.

result.stderr = stderr

if autoStart {
return result, result.cmd.Start()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we target to (re)use pi.Run() in case of autoStart as well? To unify the way we run the process.
Or even better, could we completely remote autoStart option and always expect the caller to run pi.Run()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autoStart is a convenience option to automatically fork the process without blocking the call to process.New for async pipelines. Thus, process.New is always non-blocking.
Run expects a context.Context and waits for the completion of the forked process. So the flow is either process.New -> pi.Start (either with the autostart option or via an explicit call to process.Info.Start -> poll until completion for async pipelines, or process.New -> blocking call to process.Info.Run.

}()

select {
case <-ctx.Done():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we signal the running process to do a graceful stop in this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point. When we are considering cleanup logic in a separate PR, we may consider sending a SIGINT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A way to kill the ongoing process would be nice for deletion, too. Right now, if deletion call comes in and there is an ongoing apply operation, we just wait for it to complete instead of sending a kill signal to apply process.

return pi.Run(ctx)
}

func (pi *Info) Start() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, can we drop Start()and use always use Run? Like Run(context.Background()) is no context available.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have separate calls for the blocking process.Info.Run and non-blocking process.Info.Start.

func (c *client) checkOperation() error {
xpStatePath := filepath.Join(c.wsPath, fileStateLock)
// Terraform state lock file does not seem to contain operation type
buff, err := ioutil.ReadFile(xpStatePath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a couple of offline discussions around some possible alternatives to relying on filesystem locks (terraforms and now ours) but would love to revisit them without blocking this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. As discussed previously, I think we can experiment with some alternatives to see how reliable they are by testing if we violate any assumptions by the Terraform CLI, and how scalable would those alternatives be in terms of memory, concurrency, etc.

Copy link
Member

@negz negz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed thoroughly, but a few comments primarily around style and idiomatic Go. Will we add tests to this PR before it is merged?

pkg/tfcli/apply.go Outdated Show resolved Hide resolved
func providerFromEnv() *withProvider {
p := &withProvider{}
if p.source == "" {
p.source = os.Getenv(envTFProviderSource)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I do have a strong opinion about environment variables. 🤓 I'm all for using them, but I strongly prefer them to all be "loaded" from the environment in one, predictable place - typically in a program's main() function. Having os.Getenv calls scattered around the codebase makes it difficult to keep track of the various environment variables (and more broadly, configuration sources) that a program uses. For libraries I feel it is best to take configuration as inputs and leave things operational like configuration, logging, etc to the caller.

pkg/tfcli/builders.go Outdated Show resolved Hide resolved
pkg/tfcli/client.go Show resolved Hide resolved
@ulucinar
Copy link
Collaborator Author

There are also some linter errors that we would need to resolve before moving on.

Please check the output of make reviewable.

Thanks Hasan, fixed those issues.

@ulucinar
Copy link
Collaborator Author

I haven't reviewed thoroughly, but a few comments primarily around style and idiomatic Go. Will we add tests to this PR before it is merged?

Thanks @negz. We are planning to merge this PR without automatic tests. I've manually tested CRUD operations on top of the common controller. I've opened #39 to track test implementation.

@ulucinar
Copy link
Collaborator Author

Thanks @muvaf for the review.

Copy link
Member

@muvaf muvaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only blocking comment is related to the CreateOrUpdate call under Delete.

As we discussed, we'll need to do another pass to apply existing Go patterns we use in other repos but we'd like to merge it now to unblock other work.

#39 and possibly #36 could be a good issues to introduce such changes.

pkg/conversion/cli.go Outdated Show resolved Hide resolved
pkg/conversion/cli.go Outdated Show resolved Hide resolved
if tferrors.IsApplying(err) {
// then resource was deleted while an apply operation was in-progress
// we need to wait/consume this final Apply operation
_, err = t.CreateOrUpdate(ctx, tr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if there is a way for the underlying system to report that it's done on its own and we'd just check it, instead of making us call CreateOrUpdate, just to ping it so that it looks at the filesystem and check if the operation is completed. I don't feel comfortable calling CreateOrUpdate inside Delete function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: Added tfcli.Client.DiscardOperation and replaced the call to CreateOrUpdate with that.

}()

select {
case <-ctx.Done():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A way to kill the ongoing process would be nice for deletion, too. Right now, if deletion call comes in and there is an ongoing apply operation, we just wait for it to complete instead of sending a kill signal to apply process.

Comment on lines +265 to +266
if c.wsPath != "" {
result = multierr.Combine(result, errors.Wrap(c.fs.RemoveAll(c.wsPath), errDestroyWorkspace))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if c.wsPath != "" {
result = multierr.Combine(result, errors.Wrap(c.fs.RemoveAll(c.wsPath), errDestroyWorkspace))
}
if c.wsPath != "" {
if err := c.fs.RemoveAll(c.wsPath); err != nil {
return errors.Wrap(err, errDestroyWorkspace)
}
}

It makes it much easier to skim through the code quickly if we use existing idiomatic way of error handling. In fact, I'd suggest not using multierr.Combine at all and go for subsequent if err := ..s if we need to ahve a series of steps.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are tearing down multiple resources and if any of them fails, we would still like to tear the rest down and report any failures to the caller.

- Terraform CLI async pipeline scheduler

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
- If there are no changes in the plan, do not try to parse it

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
- Do refractorings in tfcli package

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
- OperationType is used outside of error processing
  to track ongoing Terraform CLI pipelines. Thus,
  it makes sense to remove it outside of the
  tfcli/errors package.

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
- Reintroduce tfcli/errors.{IsApplying,IsDestroying} functions

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
- Remove Client.closeOnError
- Use afero.Fs
- Add async operation timeouts
- Fix common controller deletion logic

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
@ulucinar ulucinar force-pushed the aru/provider-wrapper branch from 76069f7 to 4121010 Compare September 7, 2021 15:54
Copy link
Member

@muvaf muvaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @ulucinar Thank you for your patience, bearing with the review process!

@ulucinar ulucinar merged commit edc26e5 into crossplane:main Sep 8, 2021
@ulucinar ulucinar deleted the aru/provider-wrapper branch September 8, 2021 08:03
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants