Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel init failure when using plugin cache #25849

Closed
wyardley opened this issue Aug 13, 2020 · 11 comments
Closed

parallel init failure when using plugin cache #25849

wyardley opened this issue Aug 13, 2020 · 11 comments
Labels

Comments

@wyardley
Copy link

wyardley commented Aug 13, 2020

I'm seeing an issue when running validation in CI. We have a step that initializes all the states in parallel (we also parallelize in circle using their feature for this, which is what generates the contents of /tmp/tests_to_run).

Terraform Version

Terraform v0.13.0

Terraform Configuration Files

most of the providers are configured something like this

provider "google" {
  project = var.project
  version = "3.33.0"
}

provider "google-beta" {
  project = var.project
  version = "3.33.0"
}

Debug Output

here's what I think is the relevant debug level output; if you end up needing a trace, can work out a way to get that to you.

2020/08/13 20:53:10 [INFO] Terraform version: 0.13.0  
2020/08/13 20:53:10 [INFO] Go runtime version: go1.14.2
2020/08/13 20:53:10 [INFO] CLI args: []string{"/bin/terraform", "validate"}
2020/08/13 20:53:10 [DEBUG] Attempting to open CLI config file: /root/.terraformrc
2020/08/13 20:53:10 [DEBUG] File doesn't exist, but doesn't need to. Ignoring.
2020/08/13 20:53:10 [DEBUG] ignoring non-existing provider search directory terraform.d/plugins
2020/08/13 20:53:10 [DEBUG] ignoring non-existing provider search directory /root/.terraform.d/plugins
2020/08/13 20:53:10 [DEBUG] will search for provider plugins in /root/.local/share/terraform/plugins
2020/08/13 20:53:10 [DEBUG] ignoring non-existing provider search directory /usr/local/share/terraform/plugins
2020/08/13 20:53:10 [DEBUG] ignoring non-existing provider search directory /usr/share/terraform/plugins
2020/08/13 20:53:10 [INFO] CLI command args: []string{"validate"}
2020/08/13 20:53:10 [WARN] Log levels other than TRACE are currently unreliable, and are supported only for backward compatibility.
  Use TF_LOG=TRACE to see Terraform's internal logs.
  ----
2020/08/13 20:53:11 [WARN] Failed to determine selected providers: failed to recall provider packages selected by earlier 'terraform init': some providers could not be installed:
- registry.terraform.io/hashicorp/google-beta: checksum mismatch for v3.34.0 package
2020/08/13 20:53:11 [DEBUG] checking for provisioner in "."
2020/08/13 20:53:11 [DEBUG] checking for provisioner in "/bin"
2020/08/13 20:53:11 [INFO] Failed to read plugin lock file .terraform/plugins/linux_amd64/lock.json: open .terraform/plugins/linux_amd64/lock.json: no such file or directory

Expected Behavior

terraform should have initialized all the states

Actual Behavior

terraform gives the error below for 1 or 2 of the states

Error: Could not load plugin


Plugin reinitialization required. Please run "terraform init".

Plugins are external binaries that Terraform uses to access and manipulate
resources. The configuration provided requires plugins which can't be located,
don't satisfy the version constraints, or are otherwise incompatible.

Terraform automatically discovers provider requirements from your
configuration, including providers used in child modules. To see the
requirements and constraints, run "terraform providers".

Failed to instantiate provider "registry.terraform.io/hashicorp/google-beta"
to obtain schema: unknown provider
"registry.terraform.io/hashicorp/google-beta"

Steps to Reproduce

  1. Enable plugin cache
  2. run something like the following to initialize many states in parallel.

xargs -i -n 1 -P 8 sh -c 'cd "{}" && terraform init -backend=false -input=false' < /tmp/tests-to-run

This works without the plugin cache, and works for most steps with it, however, there's typically at least one failure on every run with the plugin cache enabled. I'm thinking this could be some kind of race condition (or locking issue, see below). Removing the -P 4 or -P 8 argument to limit the parallelism to 1 in the xargs call makes it work consistently.

Is there a way to make this work or do it safely? Or is this a bug? Or is it just not supported to run init in parallel this way when using the cache?

Additional Context

This is in CircleCI, and the following env vars are set:

        CHECKPOINT_DISABLE: "true"
        TF_IN_AUTOMATION: "true"
        TF_PLUGIN_CACHE_DIR: "/tmp/.terraform.d/plugin-cache"

Haven't gone back to see if this affects 0.12 as well.

References

@wyardley wyardley added bug new new issue not yet triaged labels Aug 13, 2020
@danieldreier
Copy link
Contributor

This is an interesting case! The purpose of the plugin cache is to avoid having to re-download providers with every run. It's not designed to deal with concurrency like this, and I'm not surprised it's failing.

The first workaround that comes to my mind would be to run a different CI stage ahead of time that just does an init, to prompt all the provider downloads to the cache, and then run all the others in parallel - maybe without an init.

I'll leave this open for now for an engineer to comment more authoritatively on whether this behavior is expected, but I think that this is a case of not being designed for concurrent use of the plugin cache. If you can't find a good workaround, we can relabel this as an enhancement request, but I don't think this constitutes a bug based on my initial assessment.

@danieldreier danieldreier removed the new new issue not yet triaged label Aug 13, 2020
@wyardley
Copy link
Author

The first workaround that comes to my mind would be to run a different CI stage ahead of time that just does an init, to prompt all the provider downloads to the cache, and then run all the others in parallel - maybe without an init.

I actually thought about that - but I guess the potential issue would be if a couple of states had additional providers that weren't setup / cached in that first init. In our case, the typical providers we use are fairly consistent between the states, so doing something like that might be within the realm of possibility.

Another thing I thought about was caching the provider cache directory, except that I think saving / restoring the cache would possibly take longer than simply downloading fresh every time.

Interestingly, if I login via ssh, blow away the cache and the .terraform directories, I'm sometimes able to get a successful run, but not all the time.

Would absolutely appreciate any other suggestions anyone's got in terms of a way to quickly and safely initialize lots of states concurrently.

@wyardley
Copy link
Author

ps - initializing the states in serial with the cache is almost, but not quite, as fast as doing them in parallel without it.

@danieldreier
Copy link
Contributor

Can you have a per-state provider cache? You said you're using circle - I wonder if you can use their dependency caching on a per-state basis to avoid re-downloading providers, and not deal with locking at all.

@wyardley
Copy link
Author

Can you have a per-state provider cache? You said you're using circle - I wonder if you can use their dependency caching on a per-state basis to avoid re-downloading providers, and not deal with locking at all.

Yes, theoretically, but I wouldn't want to have to configure a separate cache for each state, and restoring that many different caches could be pretty slow.

But the other thing is just that restoring / saving the cache sometimes can take longer than just re-downloading something. A single cache would probably be feasible; but there the problem would be deciding which of the configs to bust that cache based on (normally, I'd bust the cache based on where the provider version is declared, but in this case, that could be many different files).

Anyway, I think I've got a few ideas that may help, but would be great if you can leave this open a little longer in case someone else has got an idea.

@danieldreier danieldreier added question waiting-response An issue/pull request is waiting for a response from the community and removed bug labels Aug 13, 2020
@wyardley
Copy link
Author

One other weird thing is that I think all the inits are exiting 0, but the failure I'm getting actually seems to come from the terraform validate step.

@ghost ghost removed waiting-response An issue/pull request is waiting for a response from the community labels Aug 13, 2020
@danieldreier
Copy link
Contributor

ping @apparentlymart will probably have something interesting to say about this when he's back from time off - messaging here to ping him

@danieldreier danieldreier added the waiting-response An issue/pull request is waiting for a response from the community label Aug 13, 2020
@wyardley
Copy link
Author

Yeah... caching won't work because there's no equivalent to a lockfile that changes [in git] when one or more of the provider versions change. So you can save / restore the cache, but no great way to bust it that I can see so far.

That said, I was able to get it to go relatively quickly by increasing the CircleCI level parallelism from 3 -> 4 and having the xargs command skip it.

@ghost ghost removed the waiting-response An issue/pull request is waiting for a response from the community label Aug 13, 2020
@mildwonkey
Copy link
Contributor

Unfortunately I'm not surprised this doesn't work; it's not designed to run concurrently. We can label this as an enhancement request.

We do have a providers mirror command (which I've just noticed is not properly linked to in our documentation - I'll get that fixed!) that can be used to pre-download providers. You can use that in conjunction with an explicit installation directory configuration.

It's also not concurrency-safe, but it might be a faster method of pre-populating a plugin cache.

@wyardley
Copy link
Author

Thanks @mildwonkey!
yeah, will check that out as well. I've played around with that recently for a different use case, so somewhat familiar with the layout.

@ghost
Copy link

ghost commented Sep 14, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Sep 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants