Skip to content
This repository has been archived by the owner on Feb 5, 2020. It is now read-only.

CI failure umbrella issue #1054

Closed
1 of 6 tasks
s-urbaniak opened this issue Jun 12, 2017 · 28 comments
Closed
1 of 6 tasks

CI failure umbrella issue #1054

s-urbaniak opened this issue Jun 12, 2017 · 28 comments

Comments

@s-urbaniak
Copy link
Contributor

s-urbaniak commented Jun 12, 2017

The following failure modes have been seen lately:

#1052 (comment)
#1053 (comment)
#1053 (comment)


EDIT: added by Ed

Things we know we need to fix:

  • Jenkins dies
  • AWS limits
  • TF destroy issues with timeouts or dependencies
  • Some smoke tests are flaky
  • Smoke tests timeout waiting for Tectonic components to come up
  • Creating EIPs for NAT Gateways (associating the EIPs to the NAT Gateways)
@Quentin-M
Copy link
Contributor

Quentin-M commented Jun 13, 2017

Just got:

* module.etcd.aws_route53_record.etc_a_nodes[2]: index 2 out of range for list aws_instance.etcd_node.*.private_ip (max 2) in:

${aws_instance.etcd_node.*.private_ip[count.index]}

[sur]: addressed in #1246

@Quentin-M
Copy link
Contributor

Quentin-M commented Jun 13, 2017

From https://jenkins-tectonic-installer.prod.coreos.systems/blue/organizations/jenkins/tectonic-installer/detail/PR-968/4/pipeline:

�[31mError applying plan:

3 error(s) occurred:

* module.vpc.aws_security_group.console (destroy): 1 error(s) occurred:

* aws_security_group.console: DependencyViolation: resource sg-f25bb183 has a dependent object
	status code: 400, request id: 52d71468-4208-4725-818a-f9e7beb361ba
* module.vpc.aws_security_group.api (destroy): 1 error(s) occurred:

* aws_security_group.api: DependencyViolation: resource sg-155cb664 has a dependent object
	status code: 400, request id: 5799d0da-f069-44b3-bf71-9b535548eb90
* module.vpc.aws_internet_gateway.igw (destroy): 1 error(s) occurred:

* aws_internet_gateway.igw: Error waiting for internet gateway (igw-72198414) to detach: couldn't find resource (31 retries)

The IGW issue also appeared in https://jenkins-tectonic-installer.prod.coreos.systems/blue/organizations/jenkins/tectonic-installer/detail/PR-1074/2/pipeline/ again. So #1017 is still happening.

[sur] addressed in #1242, #1017

Quentin-M added a commit that referenced this issue Jun 13, 2017
Jenkinsfile: retry destroy on AWS Smoke until #1017/#1054 are fixed
@Quentin-M
Copy link
Contributor

The IGW issue has been mitigated for now by retrying the deletion (#1077). SPC engineers are working upstream to add the necessary timeout lifecycle.

@ggreer
Copy link
Contributor

ggreer commented Jun 14, 2017

https://jenkins-tectonic-installer.prod.coreos.systems/blue/organizations/jenkins/tectonic-installer/detail/PR-1075/19/pipeline/41

modulError applying plan:

1 error(s) occurred:

* module.etcd.aws_instance.etcd_node[1]: 1 error(s) occurred:

* aws_instance.etcd_node.1: Error waiting for instance (i-04ee9cf6a84e21d27) to become ready: Failed to reach target state. Reason: Server.InternalError: Internal error on launch

Seems to be the same as #894.

[sur] addressed in #1246

@ggreer
Copy link
Contributor

ggreer commented Jun 23, 2017

https://jenkins-tectonic-installer.prod.coreos.systems/blue/organizations/jenkins/tectonic-installer/detail/PR-1177/5/pipeline/39

Error applying plan:
1 error(s) occurred:
* module.vpc.aws_nat_gateway.nat_gw[2]: index 2 out of range for list aws_eip.nat_eip.*.id (max 2) in:
${aws_eip.nat_eip.*.id[count.index]}

[sur] addressed in #1246

@Quentin-M
Copy link
Contributor

Quentin-M commented Jun 23, 2017

^

Quentin Machu [10:31]
https://github.com/hashicorp/terraform/issues/13828
It is literally the reason we updated to 096-master. It was happening 95% of the time before.

[sur] addressed in #1246

@ggreer
Copy link
Contributor

ggreer commented Jun 26, 2017

Still seems to be a common reason for tests failing: https://jenkins-tectonic-installer.prod.coreos.systems/blue/rest/organizations/jenkins/pipelines/tectonic-installer/branches/PR-1193/runs/3/nodes/39/steps/90/log/?start=0

Error applying plan:
1 error(s) occurred:
* module.vpc.aws_nat_gateway.nat_gw[2]: index 2 out of range for list aws_eip.nat_eip.*.id (max 2) in:
${aws_eip.nat_eip.*.id[count.index]}

edit: kans
A little more context here: The EIPs have actually been successfully created before the gateways in equal numbers. This looks like another graph problem.

[sur] addressed in #1246

@kans
Copy link
Contributor

kans commented Jun 26, 2017

A timeout destroying the zone

aws_subnet.priv_subnet.3: Still destroying... (ID: subnet-3a67da16, 9m50s elapsed)
aws_subnet.priv_subnet.2: Still destroying... (ID: subnet-75393310, 9m50s elapsed)
Error applying plan:

5 error(s) occurred:

* aws_route53_zone.priv_zone (destroy): 1 error(s) occurred:
* aws_route53_zone.priv_zone: HostedZoneNotEmpty: The specified hosted zone contains non-required resource record sets  and so cannot be deleted.
	status code: 400, request id: 30e503f2-5881-11e7-b2bc-a15177c134b1
* aws_subnet.priv_subnet[0] (destroy): 1 error(s) occurred:
* aws_subnet.priv_subnet.0: Error deleting subnet: timeout while waiting for state to become 'destroyed' (last state: 'pending', timeout: 10m0s)
* aws_subnet.priv_subnet[2] (destroy): 1 error(s) occurred:
* aws_subnet.priv_subnet.2: Error deleting subnet: timeout while waiting for state to become 'destroyed' (last state: 'pending', timeout: 10m0s)
* aws_subnet.priv_subnet[1] (destroy): 1 error(s) occurred:
* aws_subnet.priv_subnet.1: Error deleting subnet: timeout while waiting for state to become 'destroyed' (last state: 'pending', timeout: 10m0s)
* aws_subnet.priv_subnet[3] (destroy): 1 error(s) occurred:
* aws_subnet.priv_subnet.3: Error deleting subnet: timeout while waiting for state to become 'destroyed' (last state: 'pending', timeout: 10m0s)

[sur] addressed in #1245

@sym3tri sym3tri mentioned this issue Jun 28, 2017
6 tasks
@kans
Copy link
Contributor

kans commented Jun 28, 2017

https://jenkins-tectonic-installer.prod.coreos.systems/blue/organizations/jenkins/tectonic-installer/detail/PR-1216/1/pipeline/43/#step-131-log-341

aws_route_table_association.priv_subnet.0: Creation complete (ID: rtbassoc-1f142164)

aws_route_table_association.priv_subnet.1: Creation complete (ID: rtbassoc-1e142165)

aws_route_table_association.priv_subnet.3: Creation complete (ID: rtbassoc-062f1a7d)

aws_route_table_association.priv_subnet.2: Creation complete (ID: rtbassoc-1d142166)

Error applying plan:


1 error(s) occurred:


* aws_customer_gateway.customer_gateway: 1 error(s) occurred:


* aws_customer_gateway.customer_gateway: An existing customer gateway for IpAddress: 34.199.164.236, VpnType: ipsec.1, BGP ASN: 65000 has been found

[sur] filed #1455

@kans
Copy link
Contributor

kans commented Jun 28, 2017

A timeout for the VPN gateway

aws_vpn_gateway.vpg: Still creating... (10m0s elapsed)

Error applying plan:
1 error(s) occurred:
* aws_vpn_gateway.vpg: 1 error(s) occurred:
* aws_vpn_gateway.vpg: Error waiting for VPN gateway (vgw-c5ae45ac) to attach: timeout while waiting for state to become 'attached' (last state: 'detached', timeout: 10m0s)

[sur] filed #1456

@s-urbaniak
Copy link
Contributor Author

s-urbaniak commented Jun 30, 2017

https://jenkins-tectonic-installer.prod.coreos.systems/blue/rest/organizations/jenkins/pipelines/tectonic-installer/branches/PR-1247/runs/14/nodes/43/steps/146/log/?start=0

1 error(s) occurred:

* aws_instance.ovpn: 1 error(s) occurred:

* aws_instance.ovpn: Error launching source instance: InvalidAMIID.NotFound: The image id 'ami-d3e743b3' does not exist
	status code: 400, request id: 8cbdaa3c-1540-4cef-9364-4f476128b332

[sur] addressed in #1265

@s-urbaniak
Copy link
Contributor Author

s-urbaniak commented Jun 30, 2017

https://jenkins-tectonic-installer.prod.coreos.systems/blue/rest/organizations/jenkins/pipelines/tectonic-installer/branches/PR-1247/runs/21/nodes/43/steps/122/log/?start=0

+ bin/smoke -test.v -test.parallel=1 --cluster
=== RUN   Test
=== RUN   Test/Common
=== RUN   Test/Common/APIAvailable
=== RUN   Test/Cluster
=== RUN   Test/Cluster/AllNodesRunning
=== RUN   Test/Cluster/GetIdentityLogs
=== RUN   Test/Cluster/AllPodsRunning
Sending interrupt signal to process
Terminated
script returned exit code 143

[sur] addressed in #1283

@Quentin-M
Copy link
Contributor

exit code 143 represents Jenkins killing the job because it exceeded the defined timeout.

@kans
Copy link
Contributor

kans commented Jun 30, 2017

exit code 143 also means we don't see any output from go test which means we can't tell the difference between a timeout and an error.

@s-urbaniak
Copy link
Contributor Author

addressed timeout in #1054 (comment) in #1283 as a stop-gap solution as long as @mxinden finalizes research on the testing frameworks.

@s-urbaniak
Copy link
Contributor Author

https://jenkins-tectonic-installer.prod.coreos.systems/blue/rest/organizations/jenkins/pipelines/tectonic-installer/branches/PR-1247/runs/30/nodes/39/steps/90/log/?start=0

    --- FAIL: Test/Cluster (899.35s)
        --- FAIL: Test/Cluster/AllNodesRunning (600.00s)
        	cluster_test.go:71: node ip-10-0-30-219.us-west-2.compute.internal not ready
        	smoke_test.go:112: failed with error: expected 7 nodes, got 1
        	smoke_test.go:113: retrying in 10s
        	cluster_test.go:71: node ip-10-0-30-219.us-west-2.compute.internal not ready
        	smoke_test.go:112: failed with error: expected 7 nodes, got 1
        	smoke_test.go:113: retrying in 10s
        	cluster_test.go:71: node ip-10-0-30-219.us-west-2.compute.internal not ready
        	smoke_test.go:112: failed with error: expected 7 nodes, got 1
        	smoke_test.go:113: retrying in 10s
        	cluster_test.go:71: node ip-10-0-30-219.us-west-2.compute.internal not ready
        	smoke_test.go:112: failed with error: expected 7 nodes, got 1
        	smoke_test.go:113: retrying in 10s
        	cluster_test.go:67: node ip-10-0-30-219.us-west-2.compute.internal ready
        	smoke_test.go:112: failed with error: expected 7 nodes, got 1
        	smoke_test.go:113: retrying in 10s
        	cluster_test.go:67: node ip-10-0-30-219.us-west-2.compute.internal ready
        	smoke_test.go:112: failed with error: expected 7 nodes, got 1
        	smoke_test.go:113: retrying in 10s
        	cluster_test.go:67: node ip-10-0-30-219.us-west-2.compute.internal ready
        	smoke_test.go:112: failed with error: expected 7 nodes, got 1
        	smoke_test.go:113: retrying in 10s
        	cluster_test.go:67: node ip-10-0-30-219.us-west-2.compute.internal ready
        	cluster_test.go:71: node ip-10-0-79-102.us-west-2.compute.internal not ready
        	smoke_test.go:112: failed with error: expected 7 nodes, got 2
        	smoke_test.go:113: retrying in 10s
        	cluster_test.go:67: node ip-10-0-30-219.us-west-2.compute.internal ready
        	cluster_test.go:71: node ip-10-0-37-13.us-west-2.compute.internal not ready
        	cluster_test.go:71: node ip-10-0-57-57.us-west-2.compute.internal not ready
        	cluster_test.go:71: node ip-10-0-79-102.us-west-2.compute.internal not ready
        	cluster_test.go:71: node ip-10-0-84-177.us-west-2.compute.internal not ready
        	smoke_test.go:112: failed with error: expected 7 nodes, got 5
        	smoke_test.go:113: retrying in 10s
        	cluster_test.go:67: node ip-10-0-30-219.us-west-2.compute.internal ready
        	cluster_test.go:71: node ip-10-0-37-13.us-west-2.compute.internal not ready
        	cluster_test.go:71: node ip-10-0-57-57.us-west-2.compute.internal not ready
        	cluster_test.go:71: node ip-10-0-65-15.us-west-2.compute.internal not ready
        	cluster_test.go:71: node ip-10-0-79-102.us-west-2.compute.internal not ready
        	cluster_test.go:71: node ip-10-0-84-177.us-west-2.compute.internal not ready
        	smoke_test.go:112: failed with error: expected 7 nodes, got 6
        	smoke_test.go:113: retrying in 10s
...
        	smoke_test.go:112: failed with error: expected 7 nodes, got 6
        	smoke_test.go:113: retrying in 10s
        	cluster_test.go:92: Failed to find 7 ready nodes in 10m0s.

@robszumski robszumski modified the milestones: Sprint 5, Sprint 4 Jul 10, 2017
@rithujohn191
Copy link
Contributor

need to split this into smaller issues

@kans
Copy link
Contributor

kans commented Jul 10, 2017

We see these often now:

Waiting for matchbox...

+ sleep 5


++ curl --silent --fail -k http://matchbox.example.com:8080

+ echo 'Waiting for matchbox...'

Waiting for matchbox...

+ sleep 5

++ curl --silent --fail -k http://matchbox.example.com:8080

+ echo 'Waiting for matchbox...'

https://jenkins-tectonic-installer.prod.coreos.systems/blue/organizations/jenkins/tectonic-installer/detail/PR-1341/2/pipeline

@dghubble
Copy link
Member

The devnet helper script has some utilities other than create.

sudo -S -E ./scripts/devnet status
sudo rkt list
systemctl status dev-matchbox
journalctl -u dev-matchbox

I see you guys are launching matchbox with rkt containers. FYI, there is a docker mode now as well if you are interested in future.

@kans
Copy link
Contributor

kans commented Jul 17, 2017

We hit a failed matchbox unit . The unit is killed and fails upon restarting.

edit: from #1408

@dghubble
Copy link
Member

dghubble commented Jul 17, 2017

Hm, we'd want to figure out what killed the unit. The devnet script is written to bring up the two pods with devnet create and bring them down with devnet destroy. I'd expect the rkt rm to handle removing the old pod on restarts but this isn't a systemd unit, its a systemd-run of a transient unit. That rkt rm isn't called on restart.

You may also try the docker mode and avoid setting up a complete rkt environment, if you update /etc/hosts entries to point to 172.17.0.X (docker0) instead of 172.18.0.X (metal0).

export CONTAINER_RUNTIME=docker
sudo -E ./scripts/devnet create
sudo -E ./scripts/devnet destroy

@Quentin-M
Copy link
Contributor

Quentin-M commented Jul 17, 2017

Agreed that we'd like to switch to use the Docker runtime to avoid the CNI cleanup workarounds (no offense). We could then drop the rkt setup in the Jenkins script.

@dghubble
Copy link
Member

While the docker setup is probably the way this project should go, do note that it is an easier out of box experience that masks the same difficulty - the need for the containers to have known IP addresses. Docker does this just by assigning IPs in the order in which containers are created (rather than explicitly like rkt). I believe you can request specific IPs with a custom docker bridge, but then you have the same setup difficulty you had before. Just be mindful - IPs must be known because we're using containers to setup a virtual bridge that is a bare-metal simulation environment and docker will happen to give your container IPs in creation order - if you don't cleanup properly, they won't be what you expect.

@cpanato
Copy link
Contributor

cpanato commented Jul 18, 2017

I don't know if here is the right place to post the issue I saw yesterday (17.07.2017). if not please let me know and I will delete the post.

this issue happened in the Azure:

module.vnet.azurerm_network_security_rule.worker_ingress_heapster: Still creating... (1m20s elapseError applying plan:

1 error(s) occurred:

* module.vnet.azurerm_network_security_rule.master_ingress_kubelet_secure_from_worker: 1 error(s) occurred:

* azurerm_network_security_rule.master_ingress_kubelet_secure_from_worker: network.SecurityRulesClient#CreateOrUpdate: Failure sending request: StatusCode=200 -- Original Error: Long running operation terminated with status 'Failed': Code="InternalServerError" Message="An error occurred."

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.
make: *** [apply] Error 1

[sur] filed #1457

@s-urbaniak
Copy link
Contributor Author

s-urbaniak commented Jul 19, 2017

CI failure for AWS:

https://jenkins-tectonic-installer.prod.coreos.systems/blue/organizations/jenkins/tectonic-installer/detail/PR-1293/7/pipeline

1 error(s) occurred:

* aws_route53_zone.tectonic-int: 1 error(s) occurred:

* aws_route53_zone.tectonic-int: timeout while waiting for state to become 'INSYNC' (last state: 'PENDING', timeout: 10m0s)

[sur] filed #1458

@cpanato
Copy link
Contributor

cpanato commented Jul 20, 2017

CI failure on Azure:

https://jenkins-tectonic-installer.prod.coreos.systems/blue/organizations/jenkins/tectonic-installer/detail/PR-1389/8/pipeline/56

Error applying plan:

1 error(s) occurred:

* module.vnet.azurerm_lb_probe.console-lb (destroy): 1 error(s) occurred:

* azurerm_lb_probe.console-lb: Error Creating/Updating LoadBalancer network.LoadBalancersClient#CreateOrUpdate: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code="RetryableError" Message="A retryable error occurred." Details=[{"code":"ReferencedResourceNotProvisioned","message":"Cannot proceed with operation because resource /subscriptions/****/resourceGroups/tectonic-cluster-example-pr-1389-801234567890/providers/Microsoft.Network/networkInterfaces/example-pr-1389-801234567890-master-0/ipConfigurations/example-pr-1389-801234567890-MasterIPConfiguration used by resource /subscriptions/****/resourceGroups/tectonic-cluster-example-pr-1389-801234567890/providers/Microsoft.Network/loadBalancers/example-pr-1389-801234567890-api-lb is not in Succeeded state. Resource is in Deleting state and the last operation that updated/is updating the resource is DeleteNicOperation."}]

[sur]: filed #1459

@cpanato
Copy link
Contributor

cpanato commented Jul 21, 2017

CI failure on Azure:

https://jenkins-tectonic-installer.prod.coreos.systems/blue/organizations/jenkins/tectonic-installer/detail/PR-1436/29/pipeline/57

Error applying plan:

2 error(s) occurred:

* module.vnet.azurerm_lb_probe.ssh-lb (destroy): 1 error(s) occurred:

* azurerm_lb_probe.ssh-lb: Error Creating/Updating LoadBalancer network.LoadBalancersClient#CreateOrUpdate: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code="RetryableError" Message="A retryable error occurred." Details=[{"code":"ReferencedResourceNotProvisioned","message":"Cannot proceed with operation because resource /subscriptions/****/resourceGroups/tectonic-cluster-exper-pr-1436-29012345678901/providers/Microsoft.Network/networkInterfaces/exper-pr-1436-29012345678901-master-0/ipConfigurations/exper-pr-1436-29012345678901-MasterIPConfiguration used by resource /subscriptions/****/resourceGroups/tectonic-cluster-exper-pr-1436-29012345678901/providers/Microsoft.Network/loadBalancers/exper-pr-1436-29012345678901-api-lb is not in Succeeded state. Resource is in Deleting state and the last operation that updated/is updating the resource is DeleteNicOperation."}]
* module.vnet.azurerm_lb_probe.api-lb (destroy): 1 error(s) occurred:

* azurerm_lb_probe.api-lb: Error Creating/Updating LoadBalancer network.LoadBalancersClient#CreateOrUpdate: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code="RetryableError" Message="A retryable error occurred." Details=[{"code":"ReferencedResourceNotProvisioned","message":"Cannot proceed with operation because resource /subscriptions/****/resourceGroups/tectonic-cluster-exper-pr-1436-29012345678901/providers/Microsoft.Network/networkInterfaces/exper-pr-1436-29012345678901-master-0/ipConfigurations/exper-pr-1436-29012345678901-MasterIPConfiguration used by resource /subscriptions/****/resourceGroups/tectonic-cluster-exper-pr-1436-29012345678901/providers/Microsoft.Network/loadBalancers/exper-pr-1436-29012345678901-api-lb is not in Succeeded state. Resource is in Deleting state and the last operation that updated/is updating the resource is DeleteNicOperation."}]

[sur]: filed #1459

@s-urbaniak
Copy link
Contributor Author

I am closing this umbrella issue in favor of dedicated issues marked as kind/flake: https://github.com/coreos/tectonic-installer/labels/kind%2Fflake.

Please submit/comment on the existing issues or submit a new one using the kind/flake label.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants