-
Notifications
You must be signed in to change notification settings - Fork 38
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update this PR with a little more context around what is going on here? Some questions:
- Is this all freshly written code, or are parts of it copied from elsewhere?
- Are we considering integrating with Terraform at the provider gRPC API level (as we have done previously)? I thought we were going to investigate integrating at the
terraform
CLI level?
cmd/provider/main.go
Outdated
|
||
func main() { | ||
conf := &ProviderConfig{} | ||
// initialize a cobra command for nats-monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is for nats-monitoring. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved via removal (please see the comment below on the removal of the provider wrapper implementation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I believe it makes sense to continue without the shared gRPC server and only with provider binary mirroring for production, so that we can safely invoke terraform init
s in synchronous contexts and we are doing things in the documented ways Terraform expects us to do. Shared gRPC server is certainly a hack that relies on undocumented features used for testing providers and running the Terraform CLI on Windows.
But it looks like we have an option if we do hit certain scalability issues in the long run. The idea is still to run the Terraform CLI behind the scenes, but instead of letting the Terraform commands fork provider gRPC servers when they are about to make gRPC calls to the provider, we have (for now) an option to make them share a provider gRPC server.
cmd/provider/main.go
Outdated
if err := cmd.Execute(); err != nil { | ||
panic(err) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We typically use https://github.com/alecthomas/kong (or its predecessor kingpin) for all Crossplane CLI parsing. I'd prefer not to introduce cobra unless there is a good reason to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved via removal. I have removed the provider wrapper implementation, which is for running a Terraform provider plugin as a shared gRPC server. I will also add some more details to the PR description about our current strategy for wrapping the Terraform CLI.
pkg/log/core.go
Outdated
limitations under the License. | ||
*/ | ||
|
||
package log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we stick with https://github.com/crossplane/crossplane-runtime/tree/master/pkg/logging instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I have removed the log
package that I copied from wesaas repo.
6ba8820
to
bae691b
Compare
89abf36
to
d316670
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the idea we're tracking our operations using the filesystem locks because then we have to manage two locks, including Terraform's own, but that's not blocking. I'll open a separate issue to discuss.
pkg/tfcli/builders.go
Outdated
func providerFromEnv() *withProvider { | ||
p := &withProvider{} | ||
if p.source == "" { | ||
p.source = os.Getenv(envTFProviderSource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a strong opinion at all, but I always find it risky for libraries to rely on environment variables for anything. I'd expect the importers of this repository to handle the user input from wherever they want and give it to exposed structs explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Environment variables carry some undesired traits similar to global variables but they also provide some convenience when you have layers of abstractions, so that you can control the behavior at a lower layer without having to flow configuration through intermediate layers. Terraform CLI's certain behavior is exclusively controlled via environment variables and similar to that it was convenient to introduce these two environment variables (XP_TERRAFORM_PROVIDER_SOURCE
and XP_TERRAFORM_PROVIDER_VERSION
) that, together with some Terraform configuration via the TF_CLI_CONFIG_FILE
env. variable control the plugin binary mirroring behavior.
I would suggest to keep them initially for convenience & development purposes but what's important here is, as we have previously discussed, we need to have the source & version info of the Terraform provider against which we have generated code available at runtime for the tfcli
library and ultimately for the Terraform CLI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I'm not against environment variables. It's just when a library uses them, it imposes that specific set of configuration points on its consumers (provider binaries in this case) forcefully, leaving no way for pre-processing the variables or denying such ability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I do have a strong opinion about environment variables. 🤓 I'm all for using them, but I strongly prefer them to all be "loaded" from the environment in one, predictable place - typically in a program's main()
function. Having os.Getenv
calls scattered around the codebase makes it difficult to keep track of the various environment variables (and more broadly, configuration sources) that a program uses. For libraries I feel it is best to take configuration as inputs and leave things operational like configuration, logging, etc to the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: removed tfcli
configuration via environment variables.
pkg/tfcli/client.go
Outdated
|
||
// returns true if state file exists | ||
// TODO(aru): differentiate not-found error | ||
func (c *client) loadStateFromWorkspace(errNoState bool) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like all usages of loadStateFromWorkspace
returns c.state.tfState
immediately already. What do you think about making the function signature func (c *client) loadStateFromWorkspace() ([]byte, error)
here and return os.ErrNotExist
(or a custom one that they can check) error in case it doesn't exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
func (c *client) Close(_ context.Context) error { | ||
var result error | ||
if c.pInfo != nil { | ||
result = multierr.Combine(errors.Wrap(c.pInfo.Kill(), errDestroyProcess)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After pInfo
is initialized, we complete the reconciliation and come back again. If the operation is still continuing, at that point, isn't c.pInfo
is nil
? How do we save which process to kill in the subsequent reconciliations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tfcli.Client
can be running a single process at a given time. pInfo
is related to that process if set. Common controller calls Close
after an MR is deleted successfully. We don't aim to keep track of all Terraform CLI process forks done in the context of an MR via multiple Clients initialized with the same handle corresponding to the same workspace. The workspace state prevents that scenario anyway.
Relying on Terraform state locks, however, we do need a stale lock cleanup mechanism as we have been discussing. We will address that in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I got the relation with locks. What I meant is that, from my reading, c.pInfo
will be non-nil only if the operation is started in this reconciliation. In the next one, i.e. in-progress state, we initialize a new client but don't call process.New
function because a process is in progress. Then when we trigger a deletion we don't close right away, we check its progress in the next reconciliation and close it only if it's completed. But then c.pInfo
is nil
because that's a new reconciliation where c.pInfo
wasn't initialized. So Close(ctx)
may not be cleaning up anything because in line 258 p.Info
will be nil
all the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For async operations, we do not hold references to forked processes across reconciliation boundaries. Currently, we do not need to that because we are observing the workspace state. We will block on deletion if there is an ongoing command.
return nil, "", cliErrors.NewPipelineInProgressError(cliErrors.PipelineNotStarted) | ||
} | ||
|
||
if err := c.checkTFStateLock(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c.initConfiguration
returns whether we're in progress or not looking at .xp.lock
and here we check .terraform.tfstate.lock.info
. Do we expect a case where these do not match and we're unable to report error in lower parts of the stack? I'm wondering if we actually gain an error case information by checking TF state file other than what terraform execution would return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Terraform state lock (.terraform.tfstate.lock.info
) is managed by the Terraform CLI and tells us that there is a Terraform command running in the workspace. .xp.lock
(Crossplane lock) is managed by tfcli.Client
and tells us that there is an active operation whose result is not reported yet (to the common controller).
At this line, we have already asserted that there is already an ongoing operation, with the expected type, whose result has not yet been collected by the tfcli
consumer (common controller). And now we are checking if the Terraform command has completed or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are also some linter errors that we would need to resolve before moving on.
Please check the output of make reviewable
.
result.stderr = stderr | ||
|
||
if autoStart { | ||
return result, result.cmd.Start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we target to (re)use pi.Run()
in case of autoStart
as well? To unify the way we run the process.
Or even better, could we completely remote autoStart
option and always expect the caller to run pi.Run()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
autoStart
is a convenience option to automatically fork the process without blocking the call to process.New
for async pipelines. Thus, process.New
is always non-blocking.
Run
expects a context.Context
and waits for the completion of the forked process. So the flow is either process.New
-> pi.Start
(either with the autostart
option or via an explicit call to process.Info.Start
-> poll until completion for async pipelines, or process.New
-> blocking call to process.Info.Run
.
}() | ||
|
||
select { | ||
case <-ctx.Done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we signal the running process to do a graceful stop in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting point. When we are considering cleanup logic in a separate PR, we may consider sending a SIGINT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A way to kill the ongoing process would be nice for deletion, too. Right now, if deletion call comes in and there is an ongoing apply
operation, we just wait for it to complete instead of sending a kill signal to apply
process.
return pi.Run(ctx) | ||
} | ||
|
||
func (pi *Info) Start() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, can we drop Start()
and use always use Run
? Like Run(context.Background())
is no context available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have separate calls for the blocking process.Info.Run
and non-blocking process.Info.Start
.
pkg/tfcli/init.go
Outdated
func (c *client) checkOperation() error { | ||
xpStatePath := filepath.Join(c.wsPath, fileStateLock) | ||
// Terraform state lock file does not seem to contain operation type | ||
buff, err := ioutil.ReadFile(xpStatePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a couple of offline discussions around some possible alternatives to relying on filesystem locks (terraforms and now ours) but would love to revisit them without blocking this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed. As discussed previously, I think we can experiment with some alternatives to see how reliable they are by testing if we violate any assumptions by the Terraform CLI, and how scalable would those alternatives be in terms of memory, concurrency, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed thoroughly, but a few comments primarily around style and idiomatic Go. Will we add tests to this PR before it is merged?
pkg/tfcli/builders.go
Outdated
func providerFromEnv() *withProvider { | ||
p := &withProvider{} | ||
if p.source == "" { | ||
p.source = os.Getenv(envTFProviderSource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I do have a strong opinion about environment variables. 🤓 I'm all for using them, but I strongly prefer them to all be "loaded" from the environment in one, predictable place - typically in a program's main()
function. Having os.Getenv
calls scattered around the codebase makes it difficult to keep track of the various environment variables (and more broadly, configuration sources) that a program uses. For libraries I feel it is best to take configuration as inputs and leave things operational like configuration, logging, etc to the caller.
33c5d44
to
020c3ba
Compare
Thanks Hasan, fixed those issues. |
Thanks @negz. We are planning to merge this PR without automatic tests. I've manually tested CRUD operations on top of the common controller. I've opened #39 to track test implementation. |
Thanks @muvaf for the review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/conversion/cli.go
Outdated
if tferrors.IsApplying(err) { | ||
// then resource was deleted while an apply operation was in-progress | ||
// we need to wait/consume this final Apply operation | ||
_, err = t.CreateOrUpdate(ctx, tr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if there is a way for the underlying system to report that it's done on its own and we'd just check it, instead of making us call CreateOrUpdate
, just to ping it so that it looks at the filesystem and check if the operation is completed. I don't feel comfortable calling CreateOrUpdate
inside Delete
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: Added tfcli.Client.DiscardOperation
and replaced the call to CreateOrUpdate
with that.
}() | ||
|
||
select { | ||
case <-ctx.Done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A way to kill the ongoing process would be nice for deletion, too. Right now, if deletion call comes in and there is an ongoing apply
operation, we just wait for it to complete instead of sending a kill signal to apply
process.
if c.wsPath != "" { | ||
result = multierr.Combine(result, errors.Wrap(c.fs.RemoveAll(c.wsPath), errDestroyWorkspace)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if c.wsPath != "" { | |
result = multierr.Combine(result, errors.Wrap(c.fs.RemoveAll(c.wsPath), errDestroyWorkspace)) | |
} | |
if c.wsPath != "" { | |
if err := c.fs.RemoveAll(c.wsPath); err != nil { | |
return errors.Wrap(err, errDestroyWorkspace) | |
} | |
} |
It makes it much easier to skim through the code quickly if we use existing idiomatic way of error handling. In fact, I'd suggest not using multierr.Combine
at all and go for subsequent if err := ..
s if we need to ahve a series of steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are tearing down multiple resources and if any of them fails, we would still like to tear the rest down and report any failures to the caller.
- Terraform CLI async pipeline scheduler Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
- If there are no changes in the plan, do not try to parse it Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
- Do refractorings in tfcli package Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
- OperationType is used outside of error processing to track ongoing Terraform CLI pipelines. Thus, it makes sense to remove it outside of the tfcli/errors package. Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
- Reintroduce tfcli/errors.{IsApplying,IsDestroying} functions Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
- Remove Client.closeOnError - Use afero.Fs - Add async operation timeouts - Fix common controller deletion logic Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
76069f7
to
4121010
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! @ulucinar Thank you for your patience, bearing with the review process!
https://github.com/upbound/purple-squad/issues/626
https://github.com/upbound/purple-squad/issues/625