-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Developer Best Practices
Contents
- Authentication
- Self Links, Names, and IDs
- Default values
- Tests
- Fine-grained resources
- Validation
-
Failed Concurrent Updates and Eventual Consistency
- Mutexes
- Polling
- Error Retry Predicates
- Adding Sleeps (last resort)
It's recommended that you configure your setup to run with "user credentials" or "application default credentials" by default, running as your GCP user account. You can do that by omitting any other form of credential, and by setting the GOOGLE_USE_DEFAULT_CREDENTIALS=TRUE
environment variable so that you can run tests locally with that account.
This will mean that during development you'll encounter which APIs do not allow ADCs, and will be able to document as much on the relevant resources. User project overrides will handle most APIs correctly, and you can run as a service account again by configuring impersonation by specifying GOOGLE_IMPERSONATE_SERVICE_ACCOUNT={{email}}
on a service account you have granted yourself roles/iam.serviceAccountTokenCreator
on.
Note: This must be granted on the service account itself, and will not be inherited from roles/owner
on the project. You can grant it with the following snippet:
variable "service_account_email" {
type = string
}
variable "service_account_project" {
type = string
}
variable "user_email" {
type = string
}
resource "google_service_account_iam_member" "add-impersonation-permissions" {
service_account_id = "projects/${var.service_account_project}/serviceAccounts/${var.service_account_email}"
member = "user:${var.user_email}"
role = "roles/iam.serviceAccountTokenCreator"
}
You can verify you've authenticated as your user credentials through gcloud
by finding Authenticating using DefaultClient...
in your provider logs with TF_LOG=DEBUG
.
Resources in GCP often reference other resources. For example, creating a subnetwork
requires specifying a network
that it belongs to. There are three ways that this reference may occur:
-
ID: A resource's ID should be something that can be used to reference this resource from a different one. For example, a
subnetwork
should be able to reference anetwork
by specifyingnetwork = google_compute_network.foo.id
. In past versions of the provider, we treated the ID as an implementation detail and preferred our users not depend on it. Starting in version 3.0.0, we have reversed that stance and now prefer that the ID field be in a format such that other resources can reference it. This also corresponds to the Resource names AIP. -
name: Newer GCP APIs tend to use the full Resource name in the
name
field. To make it easier on our users, TPG resources that correspond to these APIs should accept a short name (the resource ID as defined in the AIP) in thename
field if that field is settable. Examples and tests should not have resources referencing thename
field of a different resource; they should use theid
instead (unless a resource only accepts a reference to the short name, which is rare but does occur). -
self_link: Many older GCP APIs only accept and return a short name in their
name
field, and return the full resource path (the name as defined in the AIP) in the self_link field. Referencing these resources either byid
orself_link
is acceptable, and should be chosen in examples/tests based on consistency with the rest of the example or test. For new resources, referencing theid
field is preferred. Some resources that do not have aself_link
field in their GCP API have one in their Terraform representation, such asgoogle_kms_crypto_key
. This is legacy behavior and should be avoided for new resources.
We have three options for how to handle fields that have default values that are returned by the API:
-
Set a default value in the schema: This allows users to see in their
terraform plan
exactly what will be created, and thus is the preferred option when possible. As of now, lists, sets, and nested objects cannot use schema-side defaults (though primitive fields within nested objects can). Using this does mean that our resource code gets out-of-date when API-side defaults change, but it also means that we get to control when they change on the resource. - Set the field as Optional+Computed: This is handy when a field, if unset, will have a value returned by the API, but we don't know what that value is. This option is also necessary for data types that can't use schema-side defaults.
- Make the field required: If setting a default for a field would mean that it is possible for a user to set an empty block, we should instead make the field required, even if it isn't required by the API.
Some resources have fields that when set to their default value are not returned by the API. If a field has a default that is set in the schema, but the API does not return it when set to that value, make sure it gets set in state by using the default_if_empty flattener for generated resources (and something similar for hand-written ones).
Setting a list of objects as Optional+Computed means that if the object is not set in a user's config, it will be seen as "Computed", and not show a diff. This means that for an Optional+Computed list of objects, it is impossible for a user to explicitly state that the list should be empty. Setting schema.SchemaConfigModeAttr
on the list makes this possible, but is considered legacy behavior that should be avoided when possible.
All resources must have acceptance tests. One common pattern is as follows:
-
TestAcc[ResourceName]_basic
: This test should be the simplest possible test for this resource, and will generally set only required fields on the resource. -
TestAcc[ResourceName]_full
: This test should set as many fields as possible on the resource. -
TestAcc[ResourceName]_update
: This test will have aTestStep
to create the resource and another to update it.
Every field that can be set on the resource should be set in at least one test, and every field that can be updated on the resource should be updated in at least one test. A single test may test setting/updating many fields; this approach is recommended to cut down on total test time. Because APIs have different behavior for setting empty values (whether it means to not update the field or to update it to empty), it is helpful to test that fields can be updated to their zero value.
Resources that are generated by Magic Modules will have tests that match the examples in the documentation. Because of this, it is also useful to have tests that correspond to common usage patterns of the resource.
Nearly every resource in the provider supports import, and so we can check that a resource was successfully created using ImportState
and ImportStateVerify
checks. These tests work in the following manner:
- In the
Config
step, a resource is created and its representation is written to state. - In the
ImportState
step,terraform import
is run (using the resource'sid
attribute as the import id), and that result is written to a separate state. -
ImportStateVerify
checks that the two states are exactly the same
Import-style tests mean that we don't have to check attributes individually on the resource, nor do we need custom logic to call an API and determine whether a resource exists. This manner of testing is expected for all resources except in exceptional circumstances.
Not all attributes on a resource are returned from a resource's GET call, and thus can't be imported. For example, secrets are often returned as empty, and we sometimes add virtual fields to a resource that are not part of its API to control behavior. In these circumstances, the field can be ignored in the test using ImportStateVerifyIgnore
. This is also useful for fields that have DiffSuppressFuncs that can't be worked around by providing a different value in the test.
Not all resources have an id
that can be used as the import id. In these cases, it is best to attempt to make it possible to import the resource using its id
, but if that doesn't make sense or makes for a bad UX, the import id can be specified in the test using ImportStateId
or ImportStateIdPrefix
. These fields are also useful to set to test that a resource can be imported with multiple different formats.
Computed fields will always match between the two states that are compared, so import tests don't check that the value is set to an expected value (or set, period). To test computed fields, use the resource.TestCheckResourceAttr
family of checks.
Certain circumstances (panics, failed deletes, etc.) will leave behind dangling resources. These are bad: they often cost money to keep present and they often lead to quota failures in future test runs. Sweepers run before and after every test suite run in CI, and are used to delete all resources of a given type from the test project. Sweepers can be added in the tests for a given resource by adding an init()
function that contains a call to resource.AddTestSweepers(...)
. Sweepers should be added when possible.
Sweepers run by using the -sweep and -sweep-run TESTARGS flags:
make testacc TEST=./google/sweeper TESTARGS='-sweep=us-central1 -sweep-run=<sweeper-name-here>'
For the majority of our resources, the acceptance tests are sufficient for testing. However, some resources use utility functions or have more complicated business logic (such as CustomizeDiff
s). These functions should be unit tested.
Fine-grained resources are Terraform resources that are nested objects in other resources in their API representation. Generally, fine-grained resources are created because of user demand, though there are some patterns that make it likely that a resource should be fine-grained:
-
add[Field]
/delete[Field]
methods on the parent resource - Lists of objects that should not be authoritative (for example, having multiple
google_project_service
resources instead of a singlegoogle_project_services
resource because entries can be added to the list of enabled APIs out-of-band) - Scenarios where a part of a resource might be modified by a different team than the team that created the resource
In general, whether or not to make a separate fine-grained resource for a nested object is subjective.
There are two places where a user's config might be validated by the provider: plan-time and apply-time.
Plan-time validation is accomplished through various fields on the field schema, such as ValidateFunc
, [Min|Max]Items
, ConflictsWith
, AtLeastOneOf
, and ExactlyOneOf
, as well as CustomizeDiff
on the resource as a whole. These functions give our users confidence that their plans will succeed, and thus are generally good to have when possible.
However, plan-time validation can sometimes bite our users, such as in cases where the API accepts new values that it didn't before (increasing the max length of a string, adding a new enum value, increasing the number of elements in a list, etc.) In these cases, users have to wait for new versions of the provider to be released that relax these validations, even though their configs should succeed. When adding validations, some judgment should be applied to decide whether that validation is likely to need to change in the future.
Apply-time validation happens as business logic inside of a resource's CRUD methods. Because a bad config value would likely cause a 400 level error from the API anyway, these rarely add value and should generally be avoided. They should only be added when a plan-time validation is somehow impossible and the API provides an error message that is confusing or nonexistent.
Terraform uses a dependency graph to avoid creating parent and child resources at the same time. However, sibling relationships or parent resources with transient states aren't typically covered by this. In addition, some resources are eventually consistent, where creation/update succeeds but the resource isn't ready, causing subsequent updates or dependent resource creation to fail.
This tends to appear for customers as some combination of the following:
- Initial apply fails but a second apply succeeds immediately after
- Getting 404s or 403s errors immediately after creation (and less commonly update) succeeds
- Errors with "operation in progress" messages involving the parent resource.
Mutexes should be used when a parent resource does not support concurrent edits or several sibling resources cannot be added at the same time.
We use a mutex that is created globally across the Terraform instance. The developer for a resource specifies a key format that is shared across connected resources to prevent them from being edited (create/update/delete) at the same time.
In MMv1, this is done by adding a mutex
property to the Terraform resource
override:
NetworkEndpoint: !ruby/object:Overrides::Terraform::ResourceOverride
mutex: 'networkEndpoint/{{project}}/{{zone}}/{{network_endpoint_group}}'
In Terraform (either generated or handwritten), this appears as blocks in C/U/D before any transaction of API requests
lockName, err := replaceVars(d, config, "networkEndpoint/{{project}}/{{zone}}/{{network_endpoint_group}}"
if err != nil {
return err
}
mutexKV.Lock(lockName)
defer mutexKV.Unlock(lockName)
- Container cluster and child node pool resources
- Networks and children routes, network peerings
- Bigquery datasets and accesses
Polling should be used when a read API state does not match the expected state right after an API call succeeds. For example:
- Edit succeeded but resource is not ready, typically shown through a status
like
"Ready:False"
or"UPDATING"
- Reading a resource returns a 404 right after creation returns, due to cached reads or eventually consistent reads
In Terraform, we add a call to the PollingWaitTime
util after the initial
API call success. This takes as arguments:
- A [
PollReadFunc
], a function that handles reading the resource state. This should essentially do the same API call as Read() but skip the state setting. - A
PollCheckResponseFunc
, a function to check the resource state (obtained viaPollReadFunc
) and determine whether or not we should stop polling with an success, stop polling with an error, or continue polling on resource state.
In MMv1, using a Provider::Terraform::PollAsync
async
manages adding the
necessary code (autogenerating PollReadFunc
and adding call to
PollingWaitTime
in appropriate place). Most of the work around defining a
PollCheckResponseFunc
to
use. For example, for PubSub topic:
Topic: !ruby/object:Overrides::Terraform::ResourceOverride
async: !ruby/object:Provider::Terraform::PollAsync
check_response_func: PollCheckForExistence
actions: ['create']
operation: !ruby/object:Api::Async::Operation
timeouts: !ruby/object:Api::Timeouts
insert_minutes: 6
update_minutes: 6
delete_minutes: 4
generates the appropriate code in Terraform and uses the common util PollCheckForExistence
to check whether the response is a 404.
common_polling.go should contain polling utils.
- Cloud Run (custom definitions in cloudrun_polling.go)
- PubSub
A more generic approach to transient errors in Terraform is request retries,
used to handle and retry temporary errors but also to wait against unpredictable
states.
retry_utils.go
contains the various retry handler utils. Typically for eventually consistent
resources we create a custom error retry predicate
function to determine whether a given error is retryable and pass it
as an additional retry predicate to
retryTimeDuration(...)
or sendRequest(...)
Error retry predicates can be added less manually via:
-
error_retry_predicates
in Terraform resource overrides for MMv1 - the default predicates for all requests (less likely for eventually consistent resources)
If the resource/API is mostly handwritten, you can also add it to the
retry transport used in the API client as defined in config.go
.
- Added to BigQuery client/RetryTransport
- A check against the Pubsub topic parent project
The final non-ideal way of handling eventual consistency is by simply adding sleeps to specific resources. In some cases, we don't want to poll or can't actually poll for the appropriate state. Some examples:
- Attempting to create a child resource right after the project was created - this manifests as several types of errors, some of which cannot be retried, and we can't poll on the expected child resource behavior from the parent (project) resource.
- An IAM binding hasn't propagated, causing 403s that are impossible to differentiate from real 403 permission errors. We don't want to wait for the whole create/update timeout on 403s because of quota and because we should return early if user needs to add a permission.