Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classifying normalization vs. drift #70

Closed
apparentlymart opened this issue Jul 15, 2021 · 14 comments · Fixed by #737
Closed

Classifying normalization vs. drift #70

apparentlymart opened this issue Jul 15, 2021 · 14 comments · Fixed by #737
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@apparentlymart
Copy link

apparentlymart commented Jul 15, 2021

(I originally wrote up a longer version of this which included a lot more background information and historical context, but then when I submitted it GitHub encountered an error and lost my content. Sadly I don't have time to reproduce what had been 45 minutes of research and writing and so I'm resorting to a more direct description of a problem in the hope that readers will be familiar enough with this problem to realize what I'm talking about without the background information. However, I'm happy to answer any clarifying questions if the context is unclear here.)

An important responsibility of a provider is to classify whether a difference between two values represents drift (a material change in meaning often -- but not always -- caused by edits made outside of Terraform) or normalization (a different way of writing down the same thing).

In the old SDK we had DiffSuppressFunc as the primary solution to this problem. On a per-attribute basis this function would take two values and return true if they are functionally equivalent or false if they are materially different.

Unfortunately, the implementation of DiffSuppressFunc was applied only during PlanResourceChange and not during ReadResource, which sadly only solves half of the problem and thus largely defeats the benefit of the mechanism.

We can see a concrete example of that in hashicorp/terraform#29181, where the provider seems to have classified a reordering of a JSON array as normalization rather than as drift, but because the SDK didn't check DiffSuppressFunc on the Read response the normalized value was misclassified as a material change.

Although the recent "Objects have changed outside of Terraform" message has made this sort of incorrect result more visible, note that it's still a potential problem regardless of whether Terraform explicitly reports it because Terraform allows using the value of an argument of one resource to populate an argument of another resource. For example:

resource "happycloud_thingy" "foo" {
  name = "Hello"
}

resource "monitorsauce_doodad" "bar" {
  title = "Statistics for ${happycloud_thing.foo.name}"
}

If this "HappyCloud" vendor has an API which normalizes thingy names to lowercase, then we'd likely want to classify case changes as normalization vs. drift. If ReadResource were to set name to hello rather than Hello then a subsequent plan could propose the following, even though PlanResourceChange for happycloud_thingy classified the name change as normalization:

  # monitorsauce_doodad.bar must be replaced
-/+ resource "monitorsauce_doodad" "bar" {
     ~ title = "Statistics for Hello" -> "Statistics for hello" (forces replacement)
    }

In modern Terraform the "Objects have changed outside of Terraform" message would at least give a clue as to why this happened, but it'd be better if Terraform had just reported no changes needed at all, thus allowing the configuration to properly converge in only one plan/apply run.

As far as I can tell this new framework doesn't yet have a comparable feature to DiffSuppressFunc at all, and so my purpose in opening this issue is mainly just to capture this background information (albeit a cut down version due to the work loss mentioned earlier) in the hope that it's useful for designing any feature that might meet a similar need.

I'd also assert, though others might disagree, that this normalization vs. drift problem is a crucial enough part of a provider's responsibility to get right that it would be warranted to have first-class framework support for it, such that it can work correctly by default and thus hopefully avoid the risk of individual resource implementations effectively repeating the old SDK's mistake of only solving half of the problem. 😖

@paddycarver
Copy link
Contributor

This may be related to #34 / #64, or an abstraction on top of them? Perhaps we could have ReadModifiers similar to PlanModifiers? We could also try and combine the two concepts into one, though I'm nervous about how the presence/absence of a config/plan (present in PlanResourceChange, absent in ReadResource) may impact the leakiness of that abstraction.

@apparentlymart
Copy link
Author

apparentlymart commented Jul 21, 2021

I guess my hypothesis here was that the most robust thing would be for the framework to offer something that makes consistency between ReadResource and PlanResourceChange the default behavior, because them having different rules for classifying these (or, in the extreme case, not having any normalization detection in the ReadResource case at all) seems more likely to be a mistake than an intentional decision.

However, I do admit that it's hard to back that up with real data since we've not ever had an SDK which handled this consistently, and so we only have the current not-quite-right case and thus the grass naturally seems greener on the other side. There might be some hidden use-cases for intentionally making these divergent which are currently not visible because they "just work" without any special effort in the current SDK.

If we were designing the old SDK's DiffSuppressFunc as a new feature today then I would've proposed it be run automatically by the SDK also during ReadResource in order to get the intended effect, but I know this new framework is aiming to make some different design tradeoffs so I expected that lifting that feature exactly as-is wouldn't be the ideal approach, and hence this rather vague issue just gesturing at a problem but without a concrete solution. 😕

With that said, I can at least try to frame the problem more concretely in old SDK terms, so we have something specific to compare with: for the old SDK design, I'd propose to have the ReadResource part of the SDK first call the provider's Read function, having the provider call d.Set to update everything to match the remote system, and then afterwards to call DiffSuppressFunc on every attribute the Read function had changed in order to compare the original value (prior to Read) with the new value (set inside Read). If DiffSuppressFunc returns true, then reset the value back to the original because the detected change was classified as normalization, rather than as drift.

I'm not sure how to generalize that to the broader concern of "plan customization" because there are various other use-cases for plan customization that aren't related to normalization/drift classification and which, as you point out, would make sense to only run during the planning step and not during the refreshing step. For example, the classic CustomizeDiff use-case of "if you change the content of s3_bucket_object then the version attribute should be unknown in the plan"; I can't find any analog for that rule in the ReadResource step, because we already know concretely what the new version number is in that case.

@apparentlymart
Copy link
Author

apparentlymart commented Aug 19, 2021

I ended up thinking about this some more today because I encountered another situation where a normalization change ended up propagating downstream and generating a confusing result. I've felt bad about just showing up here with a problem statement and not making any particular suggestions on what to do about it, so I had a look again at the framework design so far (looks like it's coming together well!) and noticed some hooks that seem promising to me as parts of a potential solution to the problem I've raised here, and wanted to note them. Perhaps you all already had this in mind when you designed these things and I'm just telling you things you already know 😀 but can't hurt to write it out here anyway, for future reference.

attr.Value

The main thing I learned about today that I hadn't previously seen is attr.Value as a higher-level a abstraction over the physical types in Terraform's type system. Adding a new concept in between the idea of an attribute and the physical type of that attribute seems like a great insight that I hadn't previously considered, and seems like it'll give a lot of opportunities for handling cross-cutting concerns in typical providers!

The Equal method that you've already defined for Value reads to me as, semantically if not behaviorally, the new-framework equivalent of DiffSuppressFunc in the old SDK, assuming I'm correctly understanding the intent that it should return true if two values are functionally equivalent per the definition of the specific type in question. For example, I'm thinking about a hypothetical Value implementation representing a case-insensitive string, where Equal would return true if both values have the same sequence of letters even if their case disagrees in a way that the remote API doesn't care to distinguish.

Potential automatic Read/Plan/Apply response tidying

The other thing I saw on my travels is that there seems to be some precedent in the PlanResourceChange implementation for some automatic behavior (that is, behavior implemented for all providers in the framework rather than in the provider code itself), currently involving fitting in all of the necessary unknown values into the attributes that won't be decided until the apply step:

modifiedPlan, err := tftypes.Transform(plan, markComputedNilsAsUnknown(ctx, resourceSchema))
if err != nil {
resp.Diagnostics = append(resp.Diagnostics, &tfprotov6.Diagnostic{
Severity: tfprotov6.DiagnosticSeverityError,
Summary: "Error modifying plan",
Detail: "There was an unexpected error updating the plan. This is always a problem with the provider. Please report the following to the provider developer:\n\n" + err.Error(),
})
return resp, nil
}

This got me thinking about a possible additional transform step which takes an object representing the prior state along with the resource type schema. For each leaf attribute it visits, it could perhaps raise the two tftypes values representing the value that was given to transform and the corresponding value from the prior state into schema.Value, call Value.Equal with them, and then use the result to decide between returning the prior value (if Equal returns true) or the new value (if it returns false).

To get the effect I was hoping for when I opened this, I think we'd need to then call this function at least in both PlanResourceChange and in ReadResource, so that both of them will always prefer to preserve an existing value rather than switching to a functionally-equivalent value that Terraform Core would not consider as equal.

I also wonder about doing it in ApplyResourceChange in order to keep the planned value if the final value is functionally equivalent to it, for consistency's sake, though that would have a different benefit than the one I opened this issue about: it would help avoid provider developers getting tripped up by the "Provider produced inconsistent result after apply" safety check, by automatically preserving exactly the value that was in the plan as long as the final result was functionally equivalent to it.


I don't have nearly enough context to consider any of this to be a proper design proposal... more just some initial ideas to hopefully better explain the sort of thing I was imagining when I opened this issue. I hope it's useful in some way! Please let me know if anything here seems surprising or weird; I'm still very early in learning my way around the framework so I wouldn't be surprised or offended to hear that I misunderstood something in this initial exploration of the problem.

With that said, if this looks like a promising direction and you'd be open to outside input then I'd be happy to try to do some more of the legwork to turn this into a more proper design proposal, though I would need to wait a little while to first work through some other tasks since our team is stretched a bit thin right now.

@apparentlymart
Copy link
Author

apparentlymart commented Sep 24, 2021

To satisfy my curiosity I made an attempt at implementing something like what I described in the previous comment, over in #178.

For the moment I'm offering that mainly just as something to play with to see the effect of the design I was considering. If it seems promising then I'd be happy to produce a more complete design proposal before getting into the weeds of how exactly I implemented that, since I expect I didn't do it in the best way due to my unfamiliarity with this codebase.

One new detail I did note while working on this is that Value.Equal is only definable for leaf attributes, and not for attributes that have other attributes nested inside them. In practice I think that means that this approach would work okay for primitive-typed attributes and collection-of-primitive-typed attributes (an unordered list of strings that allows duplicates, for example), but it would not solve for normalization of the overall shape of a data structure, such as the way that AWS EC2 splits and recombines security group rules. With that said, Terraform Core's own safety rules for how providers can modify the proposed value in PlanResourceChange would prevent that sort of structural normalization today anyway, I think.

@carlosjgp
Copy link

(off-topic but I feel like it's very well deserved)
Hats off @apparentlymart

beautifully and detailed write up 💯 kudos

@YakDriver
Copy link
Member

YakDriver commented Nov 30, 2021

I've found 4 places where normalization/equivalence vs. drift/meaningful changes affects the AWS provider. As for impact, we have approximately 11 open issues with a combined 400 upvotes that relate to the problem.

1. Create/Config Update

This is the situation that is handled with DiffSuppressFunc at present by something like verify.SuppressEquivalentPolicyDiffs().

2. Read/Set/Refresh - Avoiding Perpetual Diffs

This situation is not well handled currently but can be intercepted to check equivalence before d.Set().

3. Read/Set - Actual Drift Detection

This is also not handled well currently. Basically, if there is a legitimate out-of-band change, a practitioner wants to see the minimal diff. However, that's more complicated than equivalency, where we can ignore order. When the data store (e.g., AWS) gives us non-equivalent JSON data, its internal arrays can be in any random order.

Here's a scenario. Assume we're working with JSON data including an array of 1000 IAM entities with some permissions.

  • α represents the original data, including the order of all its arrays.
  • α', α'', α''', and so forth are equivalent to α but with different orders.
Action Config State Data Store
Create α α α'
Refresh (ignore equivalent) α α α''
Refresh (ignore equivalent) α α α'''
Refresh α α vs. β β

If α and β include arrays of 1000 IAM entities that only differ by 1 entity, we cannot ignore as equivalent. Although we've kept α in the original config order by ignoring equivalent updates, since β is entirely random order, we're going to report a much larger (and varying) diff than the 1 entity actual difference.

4. ImportStateVerify

We can write a custom check but, again, it would be helpful to have this comparison have an option for equivalence. The test here would be to compare the config against what we read from the data store (e.g., AWS). If the config uses HEREDOC and has random extra white space, ImportStateVerify will report ImportStateVerify attributes not equivalent. Difference is shown below. Top is actual, bottom is expected. Semantically, the JSON data is equivalent but the strings differ.

@apparentlymart
Copy link
Author

Hi @YakDriver! Thanks for sharing that extra context.

I must admit I hadn't really considered the case of importing. I'm rusty on the legacy SDK and so I don't remember exactly how ImportStateVerify fits in, and so maybe this is not a relevant observation, but I feel that our current model of import is a bit challenging with regard to normalization vs. drift because in that case we're constructing an entirely new state object from something essentially equivalent to a "read" call, and so we don't have anything to compare with in order to decide which value should "win".

If you have some examples to hand of import-related problems in particular then I'd love to learn about them to fold them into my understanding of how that contributes, since my investigation so far has focused only on the read, plan, and apply steps.

@YakDriver
Copy link
Member

YakDriver commented Nov 30, 2021

@apparentlymart ImportStateVerify is definitely not our primary concern. See my comment above where I've amplified. Also, I've found that by using jsonencode() and being careful, I can avoid ISV problems. It does involve rewriting test configurations to make sure they're seen as equivalent even if they were already equivalent but not seen as such. It's a burden I'd prefer the community to not have but, like I said, not top priority.

Here's an example that will not pass ImportStateVerify.

resource "aws_kms_key" "test" {
   description             = "don_key"
   deletion_window_in_days = 7

   policy = <<POLICY
 {
   "Version": "2012-10-17",
   "Id": "kms-tf-1",
   "Statement": [
     {
       "Sid": "Enable IAM User Permissions",
       "Effect": "Allow",
       "Principal": {
         "AWS": "*"
       },
       "Action": "kms:*",
       "Resource": "*"
     }
   ]
 }
 POLICY
}

However, if we switch to jsonencode(), ImportStateVerify will see the equivalent JSON as equivalent:

resource "aws_kms_key" "test" {
   description             = "don-key"
   deletion_window_in_days = 7

  policy = jsonencode({
     Id        = "kms-tf-1"
     Statement = [{
       Sid       = "Enable IAM User Permissions"
       Effect    = "Allow"
       Principal = {
         AWS = "*"
       }
       Action   = "kms:*"
       Resource = "*"
     }]
     Version = "2012-10-17"
   })
 }

@apparentlymart
Copy link
Author

Ahh, I went to look in the SDK repository to remind myself what ImportStateVerify is and now remember that it's part of the test framework rather than part of the runtime code. Sorry for misunderstanding what you were talking about there! I can see where there is some conceptual overlap with the normalization vs. drift problem there though, because the test harness is trying to verify that the state objects generated during import are "equal" to some original state.

I'm not really sure what is the equivalent functionality to that in the new framework right now, but I'll use this comment as a reminder to research that some more later to see if something to address that might be in scope for the proposal I've been working towards (intermittently) here.

@bflad
Copy link
Contributor

bflad commented Nov 30, 2021

This new framework does not implement acceptance testing features. A separate, new acceptance testing framework is being planned instead. I'd ensure there is a covering issue back in https://github.com/hashicorp/terraform-plugin-sdk/issues to cover ImportStateVerify and state equality in the meantime, if nothing else, to remind us to think about that issue during that design.

@apparentlymart
Copy link
Author

I was focusing on understanding what role ImportStateVerify was playing in my comments above but also want to acknowledge that indeed complex data structures represented as JSON are a definite pain point for Terraform diffing in general (both for reporting drift and for showing what action is planned).

It's a tricky case where we've given Terraform CLI no information about the expected data structure -- it just knows that it's a string, and can detect heuristically that the string contains JSON -- and so CLI doesn't really have any option but to be conservative and assume that the ordering of arrays matters, because for JSON alone (without any information about the higher-level language built on top of it) array order is defined as being significant.

Unless we want to switch to describing the expected shape of an IAM policy using Terraform's type system in the schema, I think that will end up being a rough edge that this issue alone probably won't be able to solve. There might be some more involved designs we could think about, such as some way to explicitly hint in schema that a string is supposed to contain JSON and that parts of it have arrays that represent sets, but I wouldn't want to delay solving the more straightforward cases in order to comprehensively address that particularly gnarly case.

With that said, I think there would be a path to addressing this within the constraints of what I was beginning to propose here, but it would be one that would require a partnership between general SDK logic and provider-specific logic. Specifically, when performing the ReadResource operation for something containing an IAM policy, rather than just returning the remote policy JSON verbatim and letting the Equal function decide normalization vs. drift the provider would need to deeply analyze the before and after policy JSON itself and have an algorithm to merge the new into the old in a way that preserves the relative ordering of any unchanged elements in "set-like" arrays, and thus the UI-level diff renderer's LCS algorithm should typically be able to show just the additions and removals.

With that said, I'm mentioning that only for completeness here and not necessarily recommending implementing something like that: it's a much more complex analysis than we've traditionally required here, and so I'd ultimately leave the decision to the AWS provider team about whether it's worth implementing something like that to address the immediate problems with policy diffs, given that we don't currently have any defined path to a framework-only or CLI-only solution for that particular awkward case.

@YakDriver
Copy link
Member

Thank you @apparentlymart and @bflad for the comments and thought on this!! It's helped a lot in thinking about it.

For now, our approach will be to solve the Read situation with a little helper in the provider. Since many of our outstanding issues convolute the messy real diffs and perpetual diffs, it's difficult to tell how much pain comes from each. After fixing this part, we can reassess what level of pain remains.

If needed, I would not be opposed to a TypeJSON to allow more specific handling. Depending on the results of the first round of fixes, we may want to test out the concept in the AWS provider and if it goes well, promote the idea up.

@bflad
Copy link
Contributor

bflad commented Sep 27, 2022

We still need to setup a broader discussion on this topic internally, but to quickly document another potential proposal we could noodle on is adding a "semantic equals" interface to the type system for values, then using that new, optional interface to automatically normalize values after any provider-defined plan/state writing has occurred (similar to #178). This ensures that value types can still distinguish between fully equal values (e.g. syntax, byte-for-byte, etc.) and any semantically equal values (e.g. different representations). Provider developers could then opt into this framework behavior, if desired for their use cases.

As a quick sketch:

package attr

type ValueWithSemanticEquals interface {
  Value

  // SemanticEquals should return true if the given Value is
  // semantically equal (e.g. another valid representation) of
  // the Value.
  SemanticEquals(Value) bool
}

Where the framework logic will then check for this prior to returning a RPC response, such as:

package fwserver

func (s *Server) CreateResource(ctx context.Context, req *CreateResourceRequest, resp *CreateResourceResponse) {
	// ...

	// Example normalization logic that goes through the Values and
 	// if the schema type implements attr.TypeWithSemanticEquals and
	// returns true, it will replace the new value with the old value.
	normalize(ctx, req.PlannedState, resp.NewState)
}

func (s *Server) PlanResourceChange(ctx context.Context, req *PlanResourceChangeRequest, resp *PlanResourceChangeResponse) {
	// ...

	// Example normalization logic that goes through the Values and
 	// if the schema type implements attr.TypeWithSemanticEquals and
	// returns true, it will replace the new value with the old value.
	normalize(ctx, req.PriorState, resp.ProposedNewState)
}

func (s *Server) ReadResource(ctx context.Context, req *ReadResourceRequest, resp *ReadResourceReponse) {
	// ...

	// Example normalization logic that goes through the Values and
 	// if the schema type implements attr.TypeWithSemanticEquals and
	// returns true, it will replace the new value with the old value.
	normalize(ctx, req.State, resp.NewState)
}

func (s *Server) UpdateResource(ctx context.Context, req *UpdateResourceRequest, resp *UpdateResourceResponse) {
	// ...

	// Example normalization logic that goes through the Values and
 	// if the schema type implements attr.TypeWithSemanticEquals and
	// returns true, it will replace the new value with the old value.
	normalize(ctx, req.PlannedState, resp.NewState)
}

If providers want to opt out of the behavior and begin returning these differences, they can change which schema type is being used (e.g. "JSON string type" to types.String or another type), or adjust/remote the type's SemanticEquals() method if it is under their control.

There's also nothing in the way of using UpgradeResourceState to transparently update the state regardless of the type implementation. I think this should be purposefully excluded from that RPC handling so provider developers can choose whether to preserve prior normalization or not.

This SemanticEquals() method would then be something that a theoretical "JSON string" type could implement to handle whitespace and unordered property differences, for example. I think this may strike a nice balance between "automatically doing the right thing" if you implement specialized types while still leaving it opt-in.

@bflad bflad modified the milestones: v1.0.0, v1.1.0 Nov 9, 2022
@bflad bflad self-assigned this Jan 10, 2023
@bflad bflad modified the milestones: v1.1.0, v1.3.0 Jan 13, 2023
bflad added a commit that referenced this issue Mar 14, 2023
Reference: #679

This documentation may have stemmed from an early design goal to include semantic equality within types. The initial implementation did not include that functionality, which is being tracked with #70.
bflad added a commit that referenced this issue Mar 14, 2023
Reference: #679

This documentation may have stemmed from an early design goal to include semantic equality within types. The initial implementation did not include that functionality, which is being tracked with #70.
bflad added a commit that referenced this issue Apr 28, 2023
bflad added a commit that referenced this issue May 26, 2023
bflad added a commit that referenced this issue May 26, 2023
Reference: #70

This change set includes an initial implemention of type-based semantic equality functionality for the framework.

Semantic equality functionality enables provider developers to prevent drift detection and certain cases of Terraform data consistency errors, by writing logic that defines when a prior value should be preserved when compared to a new value with inconsequential differences. The definition of inconsequential depends on the context of the value, but typically it refers to when a values have equivalent meaning with differing syntax. For example, the JSON specification defines whitespace as optional and object properties as unordered. Remote systems or code libraries may normalize JSON encoded strings into a differing byte ordering while remaining exactly equivalent in terms of the JSON specification.

Type-based refers to the logic being baked into the extensible custom type system. This design is preferable over being schema-based, which refers to logic being repetitively coded across multiple attribute definitions. Being type-based means the provider developer can state these intentions by nature of pointing to a custom type which bakes in this logic, rather than needing to remember all the details to duplicate. For example, proper JSON string handling in the prior terraform-plugin-sdk required implementing an attribute as a string type with an appropriate JSON validation reference and JSON normalization reference. With these changes, a bespoke and reusable JSON string type with automatic validation and equivalency handling can be created.

Developers can implement semantic equality logic by implementing a new type-specific interface, which has one method that receives the new value and should return whether the prior value should be preserved or any diagnostics.

The website documentation for custom types has been rewritten to remove details more pertinent to the original design of the framework and instead focus on how developers can either use existing custom types or create custom types based on the `basetypes` package `Typable` and `Valuable` interfaces.
bflad added a commit that referenced this issue May 31, 2023
Reference: #70

This change set includes an initial implemention of type-based semantic equality functionality for the framework.

Semantic equality functionality enables provider developers to prevent drift detection and certain cases of Terraform data consistency errors, by writing logic that defines when a prior value should be preserved when compared to a new value with inconsequential differences. The definition of inconsequential depends on the context of the value, but typically it refers to when a values have equivalent meaning with differing syntax. For example, the JSON specification defines whitespace as optional and object properties as unordered. Remote systems or code libraries may normalize JSON encoded strings into a differing byte ordering while remaining exactly equivalent in terms of the JSON specification.

Type-based refers to the logic being baked into the extensible custom type system. This design is preferable over being schema-based, which refers to logic being repetitively coded across multiple attribute definitions. Being type-based means the provider developer can state these intentions by nature of pointing to a custom type which bakes in this logic, rather than needing to remember all the details to duplicate. For example, proper JSON string handling in the prior terraform-plugin-sdk required implementing an attribute as a string type with an appropriate JSON validation reference and JSON normalization reference. With these changes, a bespoke and reusable JSON string type with automatic validation and equivalency handling can be created.

Developers can implement semantic equality logic by implementing a new type-specific interface, which has one method that receives the new value and should return whether the prior value should be preserved or any diagnostics.

The website documentation for custom types has been rewritten to remove details more pertinent to the original design of the framework and instead focus on how developers can either use existing custom types or create custom types based on the `basetypes` package `Typable` and `Valuable` interfaces.
@github-actions
Copy link

github-actions bot commented Jul 1, 2023

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants