Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openshift installer is not correctly tracking terraform resources #1000

Closed
brianredbeard opened this issue Jan 5, 2019 · 4 comments
Closed

Comments

@brianredbeard
Copy link

Version

$ openshift-install version
v0.8.0

Platform (aws|libvirt|openstack):

AWS

What happened?

Background information

After performing an install which failed and rebooting, i lost the cluster assets (they were stored in /tmp while working through troubleshooting with @abhinavdahiya). As such, I had to manually reap all resources related to the cluster. In the process of performing this manual cleanup, I missed the following three resources:

  • rb-master-role
  • rb-bootstrap-role
  • rb-worker-role

Installation failure

As the roles already existed in IAM when attempting to perform an installation it failed with the following error:

[bharrington@leviathan OPENSHIFT.RwRC]$ ./openshift-install-linux-amd64 create cluster --dir=.
INFO Consuming "Install Config" from target directory 
INFO Creating cluster...                          
ERROR                                              
ERROR Error: Error applying plan:                  
ERROR                                              
ERROR 3 errors occurred:                           
ERROR 	* module.masters.aws_iam_role.master_role: 1 error occurred: 
ERROR 	* aws_iam_role.master_role: Error creating IAM Role rb-master-role: EntityAlreadyExists: Role with name rb-master-role already exists. 
ERROR 	status code: 409, request id: 3da0f0e0-107b-11e9-815b-3377a72ac1b6 
ERROR                                              
ERROR                                              
ERROR 	* module.bootstrap.aws_iam_role.bootstrap: 1 error occurred: 
ERROR 	* aws_iam_role.bootstrap: Error creating IAM Role rb-bootstrap-role: EntityAlreadyExists: Role with name rb-bootstrap-role already exists. 
ERROR 	status code: 409, request id: 3d9e7fd8-107b-11e9-815b-3377a72ac1b6 
ERROR                                              
ERROR                                              
ERROR 	* module.iam.aws_iam_role.worker_role: 1 error occurred: 
ERROR 	* aws_iam_role.worker_role: Error creating IAM Role rb-worker-role: EntityAlreadyExists: Role with name rb-worker-role already exists. 
ERROR 	status code: 409, request id: 3d9f6a21-107b-11e9-a107-2b53e6606676 
ERROR                                              
ERROR                                              
ERROR                                              
ERROR                                              
ERROR                                              
ERROR Terraform does not automatically rollback in the face of errors. 
ERROR Instead, your Terraform state file has been partially updated with 
ERROR any resources that successfully completed. Please address the error 
ERROR above and apply again to incrementally change your infrastructure. 
ERROR                                              
ERROR                                              
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply using Terraform

Noting the following message:

"Terraform does not automatically rollback in the face of errors. Instead, your Terraform state file has been partially updated with any resources that successfully completed.

I then used the destroy cluster mechanism of the installer as follows:

1   [bharrington@leviathan OPENSHIFT.RwRC]$ ./openshift-install-linux-amd64 destroy cluster --dir=.
2   INFO Deleted NAT Gateway                           id=nat-05927ef6d073b9121
3   INFO Deleted NAT Gateway                           id=nat-05198d50a7ad0ea0d
4   INFO Deleted subnet                                id=subnet-0c28912fe5073cfb1
5   INFO Deleted NAT Gateway                           id=nat-0e2f41d8d9b4e3453
6   INFO Deleted load balancer                         name=rb-ext
7   INFO Removed role rb-master-role from instance profile rb-master-profile 
8   INFO Deleted load balancer                         name=rb-int
9   INFO Deleted subnet                                id=subnet-0101aba6dbd7f1803
10  INFO deleted profile rb-master-profile            
11  INFO Deleted target group                          name=rb-api-ext
12  INFO Deleted security group                        id=sg-021ce2083c435246a
13  INFO Deleted target group                          name=rb-api-int
14  INFO Deleted subnet                                id=subnet-0f993073b5407c002
15  INFO deleted role rb-master-role                  
16  INFO Deleted target group                          name=rb-services
17  INFO Removed role rb-worker-role from instance profile rb-worker-profile 
18  INFO deleted profile rb-worker-profile            
19  INFO Deleted record rb-api.os4.rvu.io. from r53 zone /hostedzone/Z1LPW7VY401D8I 
20  INFO deleted role rb-worker-role                  
21  INFO Deleted record rb-api.os4.rvu.io. from r53 zone /hostedzone/Z2MCRFQ54FOX8O 
22  INFO Deleted route53 zone                          id=/hostedzone/Z2MCRFQ54FOX8O
23  INFO Removed role rb-bootstrap-role from instance profile rb-bootstrap-profile 
24  INFO deleted profile rb-bootstrap-profile         
25  INFO deleted role rb-bootstrap-role               
26  INFO Emptied bucket                                name=terraform-20190104234830758200000001
27  INFO Deleted bucket                                name=terraform-20190104234830758200000001
28  INFO Deleted security group                        id=sg-0facb240e6af5009f
29  INFO Deleted VPC endpoint                          id=vpce-0d692d145faedf7ce
30  INFO Deleted route table                           id=rtb-026f3a544a749f08c
31  INFO Deleted route table                           id=rtb-073895ff9d3db914a
32  INFO Deleted route table                           id=rtb-02eaed5eba17b53c2
33  INFO Deleted route table                           id=rtb-0844bf4d66d7085e4
34  INFO Disassociated route table association         id=rtbassoc-07f68e182c77a22ff
35  INFO Disassociated route table association         id=rtbassoc-0f99d1bf0d204160a
36  INFO Disassociated route table association         id=rtbassoc-0a554c1525b4c8d49
37  INFO Deleted security group                        id=sg-038242c8afcb4a9c3
38  INFO Deleted security group                        id=sg-06a6940fa83e766e2
39  INFO Deleted security group                        id=sg-0c5ce8b88e7940f5c
40  INFO Detached Internet GW igw-0611cdbc2bfa65e38 from VPC vpc-0b68ab55648319ce2 
41  INFO Deleted internet gateway                      id=igw-0611cdbc2bfa65e38
42  INFO Deleted subnet                                id=subnet-02ceff55dec67667b
43  INFO Deleted subnet                                id=subnet-03f851d63ac7b7b4f
44  INFO Deleted subnet                                id=subnet-0d1fb0531e0fabc31
45  INFO Deleted Elastic IP                            ip=35.161.25.141
46  INFO Deleted VPC                                   id=vpc-0b68ab55648319ce2
47  INFO Deleted Elastic IP                            ip=52.33.9.142
48  INFO Deleted Elastic IP                            ip=52.40.180.51

As noted on lines 7, 17, & 18 the installer deleted those roles despite the fact that it failed due to their existence.

What you expected to happen?

I would expect that the installer would only delete "resources that successfully completed", as per it's error message. As it was not able to successfully create those resources it should not have removed them when the cleanup was performed.

How to reproduce it (as minimally and precisely as possible)?

Create conflicting roles, perform an install, destroy the cluster

@brianredbeard
Copy link
Author

In addition to the tracking mentioned above, attempting to then re-do the install errors as follows:

[bharrington@leviathan OPENSHIFT.RwRC]$ ./openshift-install-linux-amd64 create cluster --dir=.
FATAL failed to fetch Cluster: failed to load asset "Cluster": "terraform.tfstate" already exists.  There may already be a running cluster 

@wking
Copy link
Member

wking commented Jan 5, 2019

After performing an install which failed and rebooting, i lost the cluster assets (they were stored in /tmp while working through troubleshooting with @abhinavdahiya). As such, I had to manually reap all resources related to the cluster.

Some discussion of improving this experience in #746.

FATAL failed to fetch Cluster: failed to load asset "Cluster": "terraform.tfstate" already exists.  There may already be a running cluster 

Disscussion of this in #522.

As noted on lines 7, 17, & 18 the installer deleted those roles despite the fact that it failed due to their existence.

This we can probably fix. I'll see about working something up.

wking added a commit to wking/openshift-installer that referenced this issue Jan 10, 2019
As reported by Brian Harrington, we're currently deleting these roles
based on cluster name, when we'd ideally be deleting them based on the
more-specific cluster ID [1].  Tagging the roles is a step in that
direction, although as of this commit we still delete roles by name.
In coming work, I'll pivot to deleting these based on their tags.

The tag property is documented in [2].  Unfortunately, instance
profiles at not tag-able [3].

[1]: openshift#1000
[2]: https://www.terraform.io/docs/providers/aws/r/iam_role.html#tags
[3]: https://www.terraform.io/docs/providers/aws/r/iam_instance_profile.html
wking added a commit to wking/openshift-installer that referenced this issue Jan 10, 2019
As reported by Brian Harrington, we're currently deleting these roles
based on cluster name, when we'd ideally be deleting them based on the
more-specific cluster ID [1].  Tagging the roles is a step in that
direction, although as of this commit we still delete roles by name.
In coming work, I'll pivot to deleting these based on their tags.

The tag property is documented in [2].  Unfortunately, instance
profiles are not tag-able [3].

[1]: openshift#1000
[2]: https://www.terraform.io/docs/providers/aws/r/iam_role.html#tags
[3]: https://www.terraform.io/docs/providers/aws/r/iam_instance_profile.html
wking added a commit to wking/openshift-installer that referenced this issue Jan 10, 2019
As reported by Brian Harrington, we're currently deleting these roles
based on cluster name, when we'd ideally be deleting them based on the
more-specific cluster ID [1].  Tagging the roles (new in 2018-11-16
[2]) is a step in that direction, although as of this commit we still
delete roles by name.  In coming work, I'll pivot to deleting these
based on their tags.

The tag property is documented in [3].  Unfortunately, instance
profiles are not tag-able [4].

[1]: openshift#1000
[2]: https://aws.amazon.com/blogs/security/add-tags-to-manage-your-aws-iam-users-and-roles/
[3]: https://www.terraform.io/docs/providers/aws/r/iam_role.html#tags
[4]: https://www.terraform.io/docs/providers/aws/r/iam_instance_profile.html
wking added a commit to wking/openshift-installer that referenced this issue Jan 11, 2019
As reported by Brian Harrington, we're currently deleting these roles
based on cluster name, when we'd ideally be deleting them based on the
more-specific cluster ID [1].  Tagging the roles (new in 2018-11-16
[2]) is a step in that direction, although as of this commit we still
delete roles by name.  In coming work, I'll pivot to deleting these
based on their tags.

The tag property is documented in [3].  Unfortunately, instance
profiles are not tag-able [4].

[1]: openshift#1000
[2]: https://aws.amazon.com/blogs/security/add-tags-to-manage-your-aws-iam-users-and-roles/
[3]: https://www.terraform.io/docs/providers/aws/r/iam_role.html#tags
[4]: https://www.terraform.io/docs/providers/aws/r/iam_instance_profile.html
wking added a commit to wking/openshift-installer that referenced this issue Jan 11, 2019
As reported by Brian Harrington, we're currently deleting these roles
based on cluster name, when we'd ideally be deleting them based on the
more-specific cluster ID [1].  Tagging the roles (new in 2018-11-16
[2]) is a step in that direction, although as of this commit we still
delete roles by name.  In coming work, I'll pivot to deleting these
based on their tags.

The tag property is documented in [3].  Unfortunately, instance
profiles are not tag-able [4].

[1]: openshift#1000
[2]: https://aws.amazon.com/blogs/security/add-tags-to-manage-your-aws-iam-users-and-roles/
[3]: https://www.terraform.io/docs/providers/aws/r/iam_role.html#tags
[4]: https://www.terraform.io/docs/providers/aws/r/iam_instance_profile.html
wking added a commit to wking/openshift-installer that referenced this issue Jan 11, 2019
As reported by Brian Harrington, we're currently deleting these roles
based on cluster name, when we'd ideally be deleting them based on the
more-specific cluster ID [1].  Tagging the roles (new in 2018-11-16
[2]) is a step in that direction, although as of this commit we still
delete roles by name.  In coming work, I'll pivot to deleting these
based on their tags.

The tag property is documented in [3].  Unfortunately, instance
profiles are not tag-able [4].

[1]: openshift#1000
[2]: https://aws.amazon.com/blogs/security/add-tags-to-manage-your-aws-iam-users-and-roles/
[3]: https://www.terraform.io/docs/providers/aws/r/iam_role.html#tags
[4]: https://www.terraform.io/docs/providers/aws/r/iam_instance_profile.html
@wking
Copy link
Member

wking commented Feb 1, 2019

With #1039, role matching is now by tag (taking advantage of the tags from #1036), instead of by name. That helps with this, but we still effectively delete roles by cluster name, because our kubernetes.io/cluster tag only uses the cluster name. There are medium-term plans to address that #762, and recent work like #1169 is moving us in that direction. So I'm going to close this issue as a dup of #762, now that there are no more problems unique to this issue. But please comment if you feel I should reopen for some reason.

/close

@openshift-ci-robot
Copy link
Contributor

@wking: Closing this issue.

In response to this:

With #1039, role matching is now by tag (taking advantage of the tags from #1036), instead of by name. That helps with this, but we still effectively delete roles by cluster name, because our kubernetes.io/cluster tag only uses the cluster name. There are medium-term plans to address that #762, and recent work like #1169 is moving us in that direction. So I'm going to close this issue as a dup of #762, now that there are no more problems unique to this issue. But please comment if you feel I should reopen for some reason.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants