-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Classifying normalization vs. drift #70
Comments
This may be related to #34 / #64, or an abstraction on top of them? Perhaps we could have ReadModifiers similar to PlanModifiers? We could also try and combine the two concepts into one, though I'm nervous about how the presence/absence of a config/plan (present in PlanResourceChange, absent in ReadResource) may impact the leakiness of that abstraction. |
I guess my hypothesis here was that the most robust thing would be for the framework to offer something that makes consistency between However, I do admit that it's hard to back that up with real data since we've not ever had an SDK which handled this consistently, and so we only have the current not-quite-right case and thus the grass naturally seems greener on the other side. There might be some hidden use-cases for intentionally making these divergent which are currently not visible because they "just work" without any special effort in the current SDK. If we were designing the old SDK's With that said, I can at least try to frame the problem more concretely in old SDK terms, so we have something specific to compare with: for the old SDK design, I'd propose to have the I'm not sure how to generalize that to the broader concern of "plan customization" because there are various other use-cases for plan customization that aren't related to normalization/drift classification and which, as you point out, would make sense to only run during the planning step and not during the refreshing step. For example, the classic |
I ended up thinking about this some more today because I encountered another situation where a normalization change ended up propagating downstream and generating a confusing result. I've felt bad about just showing up here with a problem statement and not making any particular suggestions on what to do about it, so I had a look again at the framework design so far (looks like it's coming together well!) and noticed some hooks that seem promising to me as parts of a potential solution to the problem I've raised here, and wanted to note them. Perhaps you all already had this in mind when you designed these things and I'm just telling you things you already know 😀 but can't hurt to write it out here anyway, for future reference.
|
modifiedPlan, err := tftypes.Transform(plan, markComputedNilsAsUnknown(ctx, resourceSchema)) | |
if err != nil { | |
resp.Diagnostics = append(resp.Diagnostics, &tfprotov6.Diagnostic{ | |
Severity: tfprotov6.DiagnosticSeverityError, | |
Summary: "Error modifying plan", | |
Detail: "There was an unexpected error updating the plan. This is always a problem with the provider. Please report the following to the provider developer:\n\n" + err.Error(), | |
}) | |
return resp, nil | |
} |
This got me thinking about a possible additional transform step which takes an object representing the prior state along with the resource type schema. For each leaf attribute it visits, it could perhaps raise the two tftypes
values representing the value that was given to transform and the corresponding value from the prior state into schema.Value
, call Value.Equal
with them, and then use the result to decide between returning the prior value (if Equal
returns true
) or the new value (if it returns false
).
To get the effect I was hoping for when I opened this, I think we'd need to then call this function at least in both PlanResourceChange
and in ReadResource
, so that both of them will always prefer to preserve an existing value rather than switching to a functionally-equivalent value that Terraform Core would not consider as equal.
I also wonder about doing it in ApplyResourceChange
in order to keep the planned value if the final value is functionally equivalent to it, for consistency's sake, though that would have a different benefit than the one I opened this issue about: it would help avoid provider developers getting tripped up by the "Provider produced inconsistent result after apply" safety check, by automatically preserving exactly the value that was in the plan as long as the final result was functionally equivalent to it.
I don't have nearly enough context to consider any of this to be a proper design proposal... more just some initial ideas to hopefully better explain the sort of thing I was imagining when I opened this issue. I hope it's useful in some way! Please let me know if anything here seems surprising or weird; I'm still very early in learning my way around the framework so I wouldn't be surprised or offended to hear that I misunderstood something in this initial exploration of the problem.
With that said, if this looks like a promising direction and you'd be open to outside input then I'd be happy to try to do some more of the legwork to turn this into a more proper design proposal, though I would need to wait a little while to first work through some other tasks since our team is stretched a bit thin right now.
To satisfy my curiosity I made an attempt at implementing something like what I described in the previous comment, over in #178. For the moment I'm offering that mainly just as something to play with to see the effect of the design I was considering. If it seems promising then I'd be happy to produce a more complete design proposal before getting into the weeds of how exactly I implemented that, since I expect I didn't do it in the best way due to my unfamiliarity with this codebase. One new detail I did note while working on this is that |
(off-topic but I feel like it's very well deserved) beautifully and detailed write up 💯 kudos |
I've found 4 places where normalization/equivalence vs. drift/meaningful changes affects the AWS provider. As for impact, we have approximately 11 open issues with a combined 400 upvotes that relate to the problem. 1. Create/Config UpdateThis is the situation that is handled with 2. Read/Set/Refresh - Avoiding Perpetual DiffsThis situation is not well handled currently but can be intercepted to check equivalence before 3. Read/Set - Actual Drift DetectionThis is also not handled well currently. Basically, if there is a legitimate out-of-band change, a practitioner wants to see the minimal diff. However, that's more complicated than equivalency, where we can ignore order. When the data store (e.g., AWS) gives us non-equivalent JSON data, its internal arrays can be in any random order. Here's a scenario. Assume we're working with JSON data including an array of 1000 IAM entities with some permissions.
If α and β include arrays of 1000 IAM entities that only differ by 1 entity, we cannot ignore as equivalent. Although we've kept α in the original config order by ignoring equivalent updates, since β is entirely random order, we're going to report a much larger (and varying) diff than the 1 entity actual difference. 4.
|
Hi @YakDriver! Thanks for sharing that extra context. I must admit I hadn't really considered the case of importing. I'm rusty on the legacy SDK and so I don't remember exactly how If you have some examples to hand of import-related problems in particular then I'd love to learn about them to fold them into my understanding of how that contributes, since my investigation so far has focused only on the read, plan, and apply steps. |
@apparentlymart Here's an example that will not pass resource "aws_kms_key" "test" {
description = "don_key"
deletion_window_in_days = 7
policy = <<POLICY
{
"Version": "2012-10-17",
"Id": "kms-tf-1",
"Statement": [
{
"Sid": "Enable IAM User Permissions",
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": "kms:*",
"Resource": "*"
}
]
}
POLICY
} However, if we switch to resource "aws_kms_key" "test" {
description = "don-key"
deletion_window_in_days = 7
policy = jsonencode({
Id = "kms-tf-1"
Statement = [{
Sid = "Enable IAM User Permissions"
Effect = "Allow"
Principal = {
AWS = "*"
}
Action = "kms:*"
Resource = "*"
}]
Version = "2012-10-17"
})
} |
Ahh, I went to look in the SDK repository to remind myself what I'm not really sure what is the equivalent functionality to that in the new framework right now, but I'll use this comment as a reminder to research that some more later to see if something to address that might be in scope for the proposal I've been working towards (intermittently) here. |
This new framework does not implement acceptance testing features. A separate, new acceptance testing framework is being planned instead. I'd ensure there is a covering issue back in https://github.com/hashicorp/terraform-plugin-sdk/issues to cover |
I was focusing on understanding what role It's a tricky case where we've given Terraform CLI no information about the expected data structure -- it just knows that it's a string, and can detect heuristically that the string contains JSON -- and so CLI doesn't really have any option but to be conservative and assume that the ordering of arrays matters, because for JSON alone (without any information about the higher-level language built on top of it) array order is defined as being significant. Unless we want to switch to describing the expected shape of an IAM policy using Terraform's type system in the schema, I think that will end up being a rough edge that this issue alone probably won't be able to solve. There might be some more involved designs we could think about, such as some way to explicitly hint in schema that a string is supposed to contain JSON and that parts of it have arrays that represent sets, but I wouldn't want to delay solving the more straightforward cases in order to comprehensively address that particularly gnarly case. With that said, I think there would be a path to addressing this within the constraints of what I was beginning to propose here, but it would be one that would require a partnership between general SDK logic and provider-specific logic. Specifically, when performing the With that said, I'm mentioning that only for completeness here and not necessarily recommending implementing something like that: it's a much more complex analysis than we've traditionally required here, and so I'd ultimately leave the decision to the AWS provider team about whether it's worth implementing something like that to address the immediate problems with policy diffs, given that we don't currently have any defined path to a framework-only or CLI-only solution for that particular awkward case. |
Thank you @apparentlymart and @bflad for the comments and thought on this!! It's helped a lot in thinking about it. For now, our approach will be to solve the If needed, I would not be opposed to a |
We still need to setup a broader discussion on this topic internally, but to quickly document another potential proposal we could noodle on is adding a "semantic equals" interface to the type system for values, then using that new, optional interface to automatically normalize values after any provider-defined plan/state writing has occurred (similar to #178). This ensures that value types can still distinguish between fully equal values (e.g. syntax, byte-for-byte, etc.) and any semantically equal values (e.g. different representations). Provider developers could then opt into this framework behavior, if desired for their use cases. As a quick sketch: package attr
type ValueWithSemanticEquals interface {
Value
// SemanticEquals should return true if the given Value is
// semantically equal (e.g. another valid representation) of
// the Value.
SemanticEquals(Value) bool
} Where the framework logic will then check for this prior to returning a RPC response, such as: package fwserver
func (s *Server) CreateResource(ctx context.Context, req *CreateResourceRequest, resp *CreateResourceResponse) {
// ...
// Example normalization logic that goes through the Values and
// if the schema type implements attr.TypeWithSemanticEquals and
// returns true, it will replace the new value with the old value.
normalize(ctx, req.PlannedState, resp.NewState)
}
func (s *Server) PlanResourceChange(ctx context.Context, req *PlanResourceChangeRequest, resp *PlanResourceChangeResponse) {
// ...
// Example normalization logic that goes through the Values and
// if the schema type implements attr.TypeWithSemanticEquals and
// returns true, it will replace the new value with the old value.
normalize(ctx, req.PriorState, resp.ProposedNewState)
}
func (s *Server) ReadResource(ctx context.Context, req *ReadResourceRequest, resp *ReadResourceReponse) {
// ...
// Example normalization logic that goes through the Values and
// if the schema type implements attr.TypeWithSemanticEquals and
// returns true, it will replace the new value with the old value.
normalize(ctx, req.State, resp.NewState)
}
func (s *Server) UpdateResource(ctx context.Context, req *UpdateResourceRequest, resp *UpdateResourceResponse) {
// ...
// Example normalization logic that goes through the Values and
// if the schema type implements attr.TypeWithSemanticEquals and
// returns true, it will replace the new value with the old value.
normalize(ctx, req.PlannedState, resp.NewState)
} If providers want to opt out of the behavior and begin returning these differences, they can change which schema type is being used (e.g. "JSON string type" to types.String or another type), or adjust/remote the type's There's also nothing in the way of using This |
Reference: #70 This change set includes an initial implemention of type-based semantic equality functionality for the framework. Semantic equality functionality enables provider developers to prevent drift detection and certain cases of Terraform data consistency errors, by writing logic that defines when a prior value should be preserved when compared to a new value with inconsequential differences. The definition of inconsequential depends on the context of the value, but typically it refers to when a values have equivalent meaning with differing syntax. For example, the JSON specification defines whitespace as optional and object properties as unordered. Remote systems or code libraries may normalize JSON encoded strings into a differing byte ordering while remaining exactly equivalent in terms of the JSON specification. Type-based refers to the logic being baked into the extensible custom type system. This design is preferable over being schema-based, which refers to logic being repetitively coded across multiple attribute definitions. Being type-based means the provider developer can state these intentions by nature of pointing to a custom type which bakes in this logic, rather than needing to remember all the details to duplicate. For example, proper JSON string handling in the prior terraform-plugin-sdk required implementing an attribute as a string type with an appropriate JSON validation reference and JSON normalization reference. With these changes, a bespoke and reusable JSON string type with automatic validation and equivalency handling can be created. Developers can implement semantic equality logic by implementing a new type-specific interface, which has one method that receives the new value and should return whether the prior value should be preserved or any diagnostics. The website documentation for custom types has been rewritten to remove details more pertinent to the original design of the framework and instead focus on how developers can either use existing custom types or create custom types based on the `basetypes` package `Typable` and `Valuable` interfaces.
Reference: #70 This change set includes an initial implemention of type-based semantic equality functionality for the framework. Semantic equality functionality enables provider developers to prevent drift detection and certain cases of Terraform data consistency errors, by writing logic that defines when a prior value should be preserved when compared to a new value with inconsequential differences. The definition of inconsequential depends on the context of the value, but typically it refers to when a values have equivalent meaning with differing syntax. For example, the JSON specification defines whitespace as optional and object properties as unordered. Remote systems or code libraries may normalize JSON encoded strings into a differing byte ordering while remaining exactly equivalent in terms of the JSON specification. Type-based refers to the logic being baked into the extensible custom type system. This design is preferable over being schema-based, which refers to logic being repetitively coded across multiple attribute definitions. Being type-based means the provider developer can state these intentions by nature of pointing to a custom type which bakes in this logic, rather than needing to remember all the details to duplicate. For example, proper JSON string handling in the prior terraform-plugin-sdk required implementing an attribute as a string type with an appropriate JSON validation reference and JSON normalization reference. With these changes, a bespoke and reusable JSON string type with automatic validation and equivalency handling can be created. Developers can implement semantic equality logic by implementing a new type-specific interface, which has one method that receives the new value and should return whether the prior value should be preserved or any diagnostics. The website documentation for custom types has been rewritten to remove details more pertinent to the original design of the framework and instead focus on how developers can either use existing custom types or create custom types based on the `basetypes` package `Typable` and `Valuable` interfaces.
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. |
(I originally wrote up a longer version of this which included a lot more background information and historical context, but then when I submitted it GitHub encountered an error and lost my content. Sadly I don't have time to reproduce what had been 45 minutes of research and writing and so I'm resorting to a more direct description of a problem in the hope that readers will be familiar enough with this problem to realize what I'm talking about without the background information. However, I'm happy to answer any clarifying questions if the context is unclear here.)
An important responsibility of a provider is to classify whether a difference between two values represents drift (a material change in meaning often -- but not always -- caused by edits made outside of Terraform) or normalization (a different way of writing down the same thing).
In the old SDK we had
DiffSuppressFunc
as the primary solution to this problem. On a per-attribute basis this function would take two values and returntrue
if they are functionally equivalent orfalse
if they are materially different.Unfortunately, the implementation of
DiffSuppressFunc
was applied only duringPlanResourceChange
and not duringReadResource
, which sadly only solves half of the problem and thus largely defeats the benefit of the mechanism.We can see a concrete example of that in hashicorp/terraform#29181, where the provider seems to have classified a reordering of a JSON array as normalization rather than as drift, but because the SDK didn't check
DiffSuppressFunc
on theRead
response the normalized value was misclassified as a material change.Although the recent "Objects have changed outside of Terraform" message has made this sort of incorrect result more visible, note that it's still a potential problem regardless of whether Terraform explicitly reports it because Terraform allows using the value of an argument of one resource to populate an argument of another resource. For example:
If this "HappyCloud" vendor has an API which normalizes thingy names to lowercase, then we'd likely want to classify case changes as normalization vs. drift. If
ReadResource
were to setname
tohello
rather thanHello
then a subsequent plan could propose the following, even thoughPlanResourceChange
forhappycloud_thingy
classified the name change as normalization:In modern Terraform the "Objects have changed outside of Terraform" message would at least give a clue as to why this happened, but it'd be better if Terraform had just reported no changes needed at all, thus allowing the configuration to properly converge in only one plan/apply run.
As far as I can tell this new framework doesn't yet have a comparable feature to
DiffSuppressFunc
at all, and so my purpose in opening this issue is mainly just to capture this background information (albeit a cut down version due to the work loss mentioned earlier) in the hope that it's useful for designing any feature that might meet a similar need.I'd also assert, though others might disagree, that this normalization vs. drift problem is a crucial enough part of a provider's responsibility to get right that it would be warranted to have first-class framework support for it, such that it can work correctly by default and thus hopefully avoid the risk of individual resource implementations effectively repeating the old SDK's mistake of only solving half of the problem. 😖
The text was updated successfully, but these errors were encountered: