Skip to content
This repository has been archived by the owner on Feb 23, 2024. It is now read-only.

Add new resource for Device Network Mode #240

Merged
merged 2 commits into from
Jul 21, 2020
Merged

Add new resource for Device Network Mode #240

merged 2 commits into from
Jul 21, 2020

Conversation

t0mk
Copy link
Contributor

@t0mk t0mk commented Jul 12, 2020

This PR decouples the Network Mode handling from the device resource. We've had a lot of issues with the network mode reading, setting and waiting for it. If the API calls fails, it's better if it's in a separate resource, because some devices have very long provisioning times and re-creating of fixing TF state becomes really ugly.

Relevant issues:

The new usage will then look as:

resource "packet_vlan" "test" {
  description = "VLAN in New Jersey"
  facility    = "ewr1"
  project_id  = local.project_id
}

resource "packet_device" "test" {
  hostname         = "test"
  plan             = "m1.xlarge.x86"
  facilities       = ["ewr1"]
  operating_system = "ubuntu_16_04"
  billing_cycle    = "hourly"
  project_id       = local.project_id
}

// THIS IS NEW vvvvvvvvvvvvvvvvvvvvvvvvv
resource "packet_device_network_mode" "test" {
  device_id = packet_device.test.id
  mode = "hybrid"
}

resource "packet_port_vlan_attachment" "test" {
// LINK DEVICE THROUGH THE NEW RESOURCE vvvvvvvvvv
  device_id = packet_device_network_mode.test.id
  port_name = "eth1"
  vlan_vnid = packet_vlan.test.vxlan
}

Note that the packet_port_vlan_attachment must depend on packet_device_network_mode, not on packet_device, so that TF waits for the network mode change.

I also added a http_get_excludes param, which can be used to fool the Packet API cache and maybe workaround the bug. If TF hangs on the state change, try to add project_lite to the excludes, and it might just avoid the cache hit:

resource "packet_device_network_mode" "test" {
  device_id = packet_device.test.id
  mode = "hybrid"
  http_get_excludes = ["project_lite"]
}

@t0mk
Copy link
Contributor Author

t0mk commented Jul 13, 2020

The Acceptance tests succeed.

@c0dyhi11
Copy link

Hey @t0mk,
This looks great. Are we going to keep the legacy way of doing things around for a short time? So that folks have time to migrate to this new way of doing things?
It would be nice if both methods work so that anyone downloading code and testing tomorrow don't get a ton of errors.

Also what is the significance of the string project_lite? Does that always need to be that exact string? Or is that the project name? Is this documented somewhere?

@t0mk
Copy link
Contributor Author

t0mk commented Jul 13, 2020

@c0dyhi11, I was thinking about keeping it in the old way, but if we keep monitoring/changing the network mode in the packet_device, there is no benefit of having the new resource. If the device resource still waits for the network_type layer3, and the API wrongly reports layer2-bonded from the cache, the provisioning of the vmware devices will still timeout and fail, and the TF packet_device resources will not be noted in the TF state, even though they are active in the API.

I.e. I think that if we won't remove the network_type handling code from the packet_device, there will be no benefit of this PR.

The project_lite in HTTP GET excludes is just a workaround for the bug (wrongly reported network_state). If you add exclude=project_lite to HTTP GET /devices/<uuid>, the Packet API will (based on my experience) return the proper network_type, avoiding the unreasonable cache hit. The bug came for 3rd time now, and since Thursday, I didn't hear about anyone about to fix it. I add it, because I saw it fit to have a possibility to workaround the bug.

@c0dyhi11
Copy link

Is there a way we can have a very helpful error message if someone tries to implement the old way?
Something like: The 'network_type' property of the 'packet_device' resource has been depreciated as of packet provider version 3.0 please update your resource as documented here (Link to Docs) or pin you provider version to 2.10.1 or older

Something along those lines?

@t0mk
Copy link
Contributor Author

t0mk commented Jul 13, 2020

@c0dyhi11 Yes, See hashicorp@d25aab3 in this PR.

@c0dyhi11
Copy link

Awesome. Let me prep the google-anthos repo with these changes and get a PR for that going. Then maybe we can merge these together.

@t0mk
Copy link
Contributor Author

t0mk commented Jul 14, 2020

@c0dyhi11 OK. Thanks. Ping me when you need the provider release with this.

Also, you can fix the provider version in the google-anthos master branch to 2.10.1 and do a branch for provider 3.0.0. Considering how extensively you use the TF provider there, it's probably a good idea to fix the version anyway.

@displague
Copy link
Member

displague commented Jul 14, 2020

I have concerns about the direction of this PR. (other than that, it looks good for what it does 😄 - see some notes below ("If we proceed in this direction:"))

Terraform represents the upstream cloud provider API and the expectation for Terraform users is for the TF provider to faithfully represent the upstream API except in places where that API is not stateful and must be reinterpreted in a stateful way.

In the Packet API, network_mode is a property of the device. It is stateful. Making changes to this API end-point affects the resource, as expected.

The problem being addressed here is that the Packet API is, at times, returning incorrect responses (suspected to be caching).

This is a Packet API problem and should be addressed there.

This PR has found a creative way to circumvent the caching problem, through cache-busting strategies introduced in this PR (by introducing excludes parameters in the API request). This strategy could be applied whenever the Device state is being fetched (fetch the device and then fetch the network_type with a cache-buster).

This client-side fixing strategy could be implemented in packngo to address equinixmetal-archive/packngo#133. By addressing this problem in packngo the Terraform provider would only need a packngo import bump.

We've had a lot of issues with the network mode reading, setting and waiting for it. If the API calls fails, it's better if it's in a separate resource, because some devices have very long provisioning times and re-creating of fixing TF state becomes really ugly.

Help me understand the other problems that this approach is improving. As far as I know, the only attribute experiencing drift, due to perceived API bugs, is the network_type. If other attributes are experiencing drift they should be addressed separately. We shouldn't need to introduce a new packet.device_other_attribute resource each type an attribute bug exists upstream.

A packet.device resource update to change only the network_type should be as instantaneous as updating a separate network_type resource. In both cases an API fetch is being performed, state is being compared, and the diff should result in only the network-type is being updated. How is this not the case? Is this one state change a long running API call?

Does this cache-busting approach alone address the problem? Do any added benefits of this resource justify breaking the API, existing modules, and configurations?

As you pointed out, one of the benefits is that network_type changes can be controlled in Terraform and chained, without need to perturb the network_state from provisioning scripts (tainting the device resource):

device create
 -> provisioner
   -> network_type change
     -> secondary provisioner

In this scenario, it would seem that a post provision network_type change could be healed by Terraform and the secondary provisioner would fire again, which may also be desirable.

If we proceed in this direction:

  • is the composite 'device-id + network_type' primary key for the resource_packet_device_network_mode sufficient? I think so. When the network_type is changed externally the resource will be seen as deleted and will need to be created.
  • we must include tests for the new resource at packet/resource_packet_device_network_mode.go (as we must for any new resource or data resource).
  • the docs for the resource should include import directions
  • What benefit is there in renaming the device field from network_type to network_mode? If we are bumping to v3.0.0 we can just keep the field name the same (for use in read-only terraform configs there is no change needed) while making it a read-only computed field.

@t0mk
Copy link
Contributor Author

t0mk commented Jul 15, 2020

Hi @displague , thanks for your comment.

Terraform represents the upstream cloud provider API and the expectation for Terraform users is for the TF provider to faithfully represent the upstream API except in places where that API is not stateful and must be reinterpreted in a stateful way.

This is not the case in general, see the *_attachment resources.

This client-side fixing strategy could be implemented in packngo to address equinixmetal-archive/packngo#133. By addressing this problem in packngo the Terraform provider would only need a packngo import bump.

It could be done in packngo, but it can't be configurable then. @c0dyhi11 noticed that in his case, the &exclude=project_lite won't avoid the caching bug. With this attr, I just tried to offer a way to play with the API cache.

Help me understand the other problems that this approach is improving. As far as I know, the only attribute experiencing drift, due to perceived API bugs, is the network_type. If other attributes are experiencing drift they should be addressed separately. We shouldn't need to introduce a new packet.device_other_attribute resource each type an attribute bug exists upstream.

A packet.device resource update to change only the network_type should be as instantaneous as updating a separate network_type resource. In both cases an API fetch is being performed, state is being compared, and the diff should result in only the network-type is being updated. How is this not the case? Is this one state change a long running API call?

The Network Mode of the device is not a common attribute. It can be read simply from jq path ".network_ports[0].network_type", of a device resource JSON, but it can't be set via the device resource. Setting the network mode is a series of API operations, best seen in https://github.com/packethost/packngo/blob/master/ports.go#L321. There's bonding, disbonding and port layer conversions. If those ops are done in the proper order, the device will most likely end up in the desired network mode. You can also ask Packet frontend people about the network mode, as they do it the same way, but in Javascript.

I put the network mode change to a separate resource because

  • the operation is flaky, and if it fails, the whole TF resoruce create fails, and the device resource is not written in the TF state although it exists alright in the API. When network mode change is in separate resource, the device resource succeeds, and only the network mode change will fail. it can be re-tried faster (no need to remove existing device because TF doesn't have in its state). IIUC these network mode conversion fails caused long delays in the anthos setup development.
  • it's slow. If the network mode conversion is decoupled, other things can done with the device resource after it's active in parallel TF goroutines (volume attachment, floating ip attachment, running prov scripts).

As you pointed out, one of the benefits is that network_type changes can be controlled in Terraform and chained, without need to perturb the network_state from provisioning scripts (tainting the device resource):

device create
-> provisioner
-> network_type change
-> secondary provisioner
In this scenario, it would seem that a post provision network_type change could be healed by Terraform and the secondary provisioner would fire again, which may also be desirable.

Sorry, I don't understand what you mean by this.

is the composite 'device-id + network_type' primary key for the resource_packet_device_network_mode sufficient? I think so. When the network_type is changed externally the resource will be seen as deleted and will need to be created.

Only the device ID is the "Id()" of the packet_device_network_mode. The "mode" is an updatable attribute, storing the state of network_mode of the device. I don't understand what you mean by "primary_key" in context of TF provider resources.

we must include tests for the new resource at packet/resource_packet_device_network_mode.go (as we must for any new resource or data resource).

Of course the new resource is tested, I always add tests when possible. See
https://github.com/terraform-providers/terraform-provider-packet/pull/240/files#diff-f9b2a8f74421332dde70a5ed4a1ca377

the docs for the resource should include import directions

I don't know what you mean by this.

What benefit is there in renaming the device field from network_type to network_mode? If we are bumping to v3.0.0 we can just keep the field name the same (for use in read-only terraform configs there is no change needed) while making it a read-only computed field.

I renamed it because it's referred to as "Mode" in the API, e.g.:

In Layer 3 mode, individual network interfaces are placed in an LACP bond where all management IPs are assigned.

.. and I thought it's good to be consistent in terminology for UX. I know it will break some setups, but the deprecation strings are informative, I think people can't get confused.
.

@displague
Copy link
Member

displague commented Jul 15, 2020

Thanks for those insights, @t0mk! I'm really seeing the benefits of this approach based on your thorough response.

I'll try to clarify some of my earlier points. Largely, you've convinced me of the merits of this approach.

I'll follow-up with my code based feedback.


This is not the case in general, see the *_attachment resources.

Attachment resources (in any provider) are one of those cases where the general Terraform best practice of faithfully representing the cloud provider's API starts to bend. Providers have the option to represent resources like these IPs and Ports as maps or slices within the resource (device). In this case, the TF Device resource could align with the GET /device/{id} representation, which includes IPs and ports. Changes to the TF Device would need to be interpreted to trigger attachment API calls, in much the same way as device action API calls are used to apply power state attributes.

For some API properties, it certainly does make more sense to offer new resources, especially when that resource creates or has a dependency relationship with provisioning phases. I can see how this would be the case for volume_attachments and port_vlan_attachments. I'm not yet convinced that IP attachments fit these criteria, but I have lots to learn about that and I'm sure you've encountered these use-cases and can speak to the benefits.


The Network Mode of the device is not a common attribute. It can be read simply from jq path ".network_ports[0].network_type", of a device resource JSON, but it can't be set via the device resource. Setting the network mode is a series of API operations, best seen in https://github.com/packethost/packngo/blob/master/ports.go#L321.

This is something that I didn't understand. Thanks for these details and references!

the operation is flaky, and if it fails, the whole TF resource create fails, and the device resource is not written in the TF state although it exists alright in the API

Terraform providers can store partial state to overcome this

But there may be some reasons not to introduce this:
hashicorp/terraform-website#1152


In this scenario, it would seem that a post provision network_type change could be healed by Terraform and the secondary provisioner would fire again, which may also be desirable.
Sorry, I don't understand what you mean by this.

I was not at all clear there. I was thinking about how the google-anthos module changes the network type within a python script provisioner on the device, changing the state of the device after Terraform has provisioned it. This would create a state drift for Terraform, which would cause subsequent terraform apply operations to undo what the python script did to the network type.

Your new proposed resources offer a way to both heal and avoid that condition because network_mode (network_type) state changes and provisioners can be chained. A python script (provisioner) that changes network_type among other things could omit the network type changes, deferring that to Terraform. The python script, as a secondary provisioner with a depends_on relationship to the packet_device_network_mode resource, would only be invoked after the network state was ready.

This is a powerful addition following the precedent of packet_port_vlan_attachment (thanks for pointing that out!).


Only the device ID is the "Id()" of the packet_device_network_mode. The "mode" is an updatable attribute, storing the state of network_mode of the device. I don't understand what you mean by "primary_key" in context of TF provider resources.

My mistake, I meant Id() when I said primary_key (old habit).

It seems users will have to be careful about the order that the network_attachment is defined. I see that you covered that in the docs: https://github.com/terraform-providers/terraform-provider-packet/pull/240/files#diff-8dd49a56d3344614491b5395d96b9d0eR36 🚀


the new resource is tested.

I missed the test additions because I expected to find new acceptance tests, with a new resource. The tests you added seem sufficient, but it may help to split these tests into multiple resource.TestSteps (like this) to show that the network_type (mode) received on device creation was what we expected and was then changed because of the network_device_mode resource.


the docs for the resource should include import directions
I don't know what you mean by this.

## Import

Packet Device Network Modes can be imported using the device id, e.g.

```
terraform import packet_device_network_mode.mydevicenetmode {device-uuid}
```

I renamed it because it's referred to as "Mode" in the API, e.g.:

I'd argue that this is not represented consistently and I would defer to the name of the API field. The Web UX and other documentation refer to Servers rather than Devices, for example. In Terraform, generally, the provider's upstream API is the source of truth (or at least names).

https://www.terraform.io/docs/extend/best-practices/naming.html#naming

Most names in a Terraform provider will be drawn from the upstream API/SDK that the provider is using. The upstream API names will likely need to be modified for casing or changing between plural and singular to make the provider more consistent with the common Terraform practices below.

Required: true,
ValidateFunc: validation.StringInSlice([]string{"layer3", "layer2-bonded", "layer2-individual", "hybrid"}, false),
},
"http_get_excludes": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see value in this attribute if the API caching issue were completely resolved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's obviously a workaround for a current issue. Attributes and fields can be added and deprecated with every provider release (for which there's no fixed schedule).

I wanted to offer fast, simple and temporary workaround.

So no, I don't. And with this PR stalling, I see the value of this attribute decreasing every day. But then again, if not for the API caching issue, there's no value in any code, or walls of text in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API problem has been resolved, network_type should be stable now. 🎉

I think the new resource you are presenting here is still valuable and valid for the reasons that you've given (and defended ❤️).

A low level, invisible, toggle like http_get_excludes probably doesn't belong in this PR, anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for helping me understand these changes!

@t0mk
Copy link
Contributor Author

t0mk commented Jul 16, 2020

I'd argue that this is not represented consistently and I would defer to the name of the API field. The Web UX and other documentation refer to Servers rather than Devices, for example. In Terraform, generally, the provider's upstream API is the source of truth (or at least names).

You suggest network_type then? I can change it.

I missed the test additions because I expected to find new acceptance tests, with a new resource. The tests you added seem sufficient, but it may help to split these tests into multiple resource.TestSteps (like this) to show that the network_type (mode) received on device creation was what we expected and was then changed because of the network_device_mode resource.

OK, I'll add the test steps! It's a good idea to keep the number of acceptance tests low, they're run at the same time (more or less) and they eat quite a lot of resources. All possible resources obviously must be tested, but should not be tested twice.

I will also add imports to the doc of the new resource.

If you ack all this, I will do the changes.

@t0mk
Copy link
Contributor Author

t0mk commented Jul 17, 2020

Hi @displague, thanks for your feedback

I will

  • remove http_get_excludes
  • verify network state in acceptance tests.
  • add import to docs

I would like to still clarify these:

  • should I keep the packet_device param called network_type?
  • should I keep new resource named "packet_device_netowrk_mode"? Or rather packet_device_network_type?

@displague
Copy link
Member

displague commented Jul 18, 2020

That sounds good @t0mk.

Let's stick with network_type in both cases.

Do you think users will run into issues with the computed network_type in "device" lagging a refresh cycle behind the packet_device_network_type?

Perhaps, if you think this would create a bad experience, the network_type notice should offer advice here, to not use the computed network_type when also using the new resource type.

@t0mk
Copy link
Contributor Author

t0mk commented Jul 20, 2020

@displague

Let's stick with network_type in both cases.

OK

Do you think users will run into issues with the computed network_type in "device" lagging a refresh cycle behind the packet_device_network_type?

Yes, I think that using the packet_device_network_type will make the network_type attr of packet_device inherently incorrect, as the p_d_n_t resource will modify the network type, and the TF state of parent packet_device will not be updated. I ran into this problem when trying to do the 2-phase acceptance tests.

Perhaps, if you think this would create a bad experience, the network_type notice should offer advice here, to not use the computed network_type when also using the new resource type.

I will make the network_type attr packet_device "Deprecated" (https://www.terraform.io/docs/extend/best-practices/deprecations.html#provider-attribute-removal), it will show that it's now recommended to use the new resource.

},
Type: schema.TypeString,
Computed: true,
Deprecated: "You should handle Network Type with the new packet_device_network_type resource.",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@displague Please suggest better deprec msg if you feel like.

@t0mk
Copy link
Contributor Author

t0mk commented Jul 20, 2020

@displague I did changes that we've agreed. I squashed some commits because there was a lot of changes which were reverted.

Copy link
Member

@displague displague left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, @t0mk! I really appreciate the detail you put into the added acceptance tests around network type modifications.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants