Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: routes deleted+added every route reconcilation interval with 1.16.0 #470

Closed
apricote opened this issue Jul 17, 2023 · 1 comment · Fixed by #472 or #464
Closed

fix: routes deleted+added every route reconcilation interval with 1.16.0 #470

apricote opened this issue Jul 17, 2023 · 1 comment · Fixed by #472 or #464
Assignees
Labels
bug Something isn't working

Comments

@apricote
Copy link
Member

apricote commented Jul 17, 2023

Multiple customers are currently experiencing a bug on v1.16.0 where HCCM is deleting and then immediately adding back routes once route reconciliation interval. This is breaking the routing functionality for them and causes avoidable strain on our backend.

The only PR that might have introduced this issue is #432

We do not yet know how to reproduce this issue. If you are experiencing this issue, we would be glad to get some answers to these questions from you:

  • How did you deploy HCCM?
  • Do you have multiple instance/pods/replicas of HCCM running?
  • Do you have a single Kubernetes cluster or multiple?
  • Can you share your Deployment yaml for HCCM?
  • Can you share the log output for HCCM?

You can answer here, send them to my work email julian.toelle <at> hetzner-cloud.de or open a new support ticket and reference this issue.

Example request log:

2023-07-17T14:12:05.157137109Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  POST /v1/networks/1234567/actions/add_route HTTP/2.0
2023-07-17T14:12:04.534248716Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  POST /v1/networks/1234567/actions/add_route HTTP/2.0
2023-07-17T14:12:03.924169308Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  GET /v1/networks/1234567 HTTP/2.0
2023-07-17T14:12:03.828632293Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  GET /v1/networks/1234567 HTTP/2.0
2023-07-17T14:12:03.50750702Z   hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  GET /v1/networks/1234567 HTTP/2.0
2023-07-17T14:11:04.710422641Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  POST /v1/networks/1234567/actions/delete_route HTTP/2.0
2023-07-17T14:11:04.178799871Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  POST /v1/networks/1234567/actions/delete_route HTTP/2.0
2023-07-17T14:11:03.8442466Z    hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  GET /v1/networks/1234567 HTTP/2.0
2023-07-17T14:11:03.77779143Z   hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  GET /v1/networks/1234567 HTTP/2.0
2023-07-17T14:11:03.654467776Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  GET /v1/networks/1234567 HTTP/2.0
2023-07-17T14:10:08.320918569Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  POST /v1/networks/1234567/actions/add_route HTTP/2.0
2023-07-17T14:10:07.570347718Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  GET /v1/networks/1234567 HTTP/2.0
2023-07-17T14:10:07.322956545Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  POST /v1/networks/1234567/actions/add_route HTTP/2.0
2023-07-17T14:10:06.75039004Z   hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  GET /v1/networks/1234567 HTTP/2.0
2023-07-17T14:10:06.683931317Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  POST /v1/networks/1234567/actions/delete_route HTTP/2.0
2023-07-17T14:10:06.065576808Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  POST /v1/networks/1234567/actions/delete_route HTTP/2.0
2023-07-17T14:10:05.659330707Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  GET /v1/networks/1234567 HTTP/2.0
2023-07-17T14:09:04.809704743Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  POST /v1/networks/1234567/actions/add_route HTTP/2.0
2023-07-17T14:09:04.337724012Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  POST /v1/networks/1234567/actions/add_route HTTP/2.0
2023-07-17T14:09:03.905251024Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  GET /v1/networks/1234567 HTTP/2.0
2023-07-17T14:09:03.836665674Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  GET /v1/networks/1234567 HTTP/2.0
2023-07-17T14:09:03.672084401Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  GET /v1/networks/1234567 HTTP/2.0
2023-07-17T14:08:04.320716314Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  POST /v1/networks/1234567/actions/delete_route HTTP/2.0
2023-07-17T14:08:03.967796614Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  POST /v1/networks/1234567/actions/delete_route HTTP/2.0
2023-07-17T14:08:03.805897484Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  GET /v1/networks/1234567 HTTP/2.0
2023-07-17T14:08:03.675158189Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  GET /v1/networks/1234567 HTTP/2.0
2023-07-17T14:08:03.567790583Z  hcloud-cloud-controller/v1.16.0 hcloud-go/1.46.1  GET /v1/networks/1234567 HTTP/2.0
@apricote apricote added the bug Something isn't working label Jul 17, 2023
@apricote apricote self-assigned this Jul 17, 2023
@apricote
Copy link
Member Author

Based on customer logs we believe we found the issue.

14:41:51.704311       1 route_controller.go:216] action for Node "node-1" with CIDR "10.244.7.0/24": "add"
14:41:51.704324       1 route_controller.go:216] action for Node "node-2" with CIDR "10.244.5.0/24": "add"
14:41:51.704375       1 route_controller.go:232] route should be deleted, spec: exist: false, action: "", Node "node-in-other-cluster-in-same-project-2", CIDR "10.244.7.0/24"
14:41:51.704398       1 route_controller.go:232] route should be deleted, spec: exist: false, action: "", Node "node-in-other-cluster-in-same-project-2", CIDR "10.244.5.0/24"

hccm thinks that the existing routes in the network actually belong to Nodes in another cluster (with a different network) in the same project.

Looking at the code it seems that when we map the Cloud Network routes to cloud-provider routes we lookup the node for the route based on the private IP target. This fails because our lookup map not only contains servers from the network that hccm targets, but includes all servers in the project. If multiple servers have the same private IP in different networks, a random one will end up in the lookup map.

We have multiple fixes planned for this:

  • Only add servers that make sense to the cache (based on networks if enabled)
  • Only add private ips to the lookup map if they are in the right private network
  • Log warnings if we overwrite a key in the lookup map

apricote added a commit that referenced this issue Jul 18, 2023
In case another server in the project has the same Private IP (in
another network) as one of our cluster nodes, hccm would happily
delete&recreate the route on every route reconciliation.

This fixes the bug by only adding the Private IPs in the correct network
to the AllServersCache ByPrivateIP lookup map.

Fixes #470

Co-authored-by: Jonas Lammler <ljonas@riseup.net>
apricote added a commit that referenced this issue Jul 18, 2023
In case another server in the project has the same Private IP (in
another network) as one of our cluster nodes, hccm would happily
delete&recreate the route on every route reconciliation.

This fixes the bug by only adding the Private IPs in the correct network
to the AllServersCache ByPrivateIP lookup map.

Fixes #470

Co-authored-by: Jonas Lammler <ljonas@riseup.net>
@apricote apricote changed the title fix: routes deleted+added every minute with 1.16.0 fix: routes deleted+added every route reconcilation interval with 1.16.0 Jul 18, 2023
apricote added a commit that referenced this issue Jul 18, 2023
… IP (#472)

In case another server in the project has the same Private IP (in
another network) as one of our cluster nodes, hccm would happily
delete&recreate the route on every route reconciliation.

This fixes the bug by only adding the Private IPs in the correct network
to the AllServersCache ByPrivateIP lookup map.

Fixes #470

Co-authored-by: Jonas Lammler <ljonas@riseup.net>
apricote pushed a commit that referenced this issue Jul 18, 2023
🤖 I have created a release *beep* *boop*
---


##
[1.17.0](v1.16.0...v1.17.0)
(2023-07-18)


### Features

* **helm:** allow to set labels and annotations for podMonitor
([#471](#471))
([5dad655](5dad655))
* upgrade to hcloud-go v2 e4352ec
([5a066a1](5a066a1))


### Bug Fixes

* **helm-chart:** resource namespace and name
([#462](#462))
([0c4eee6](0c4eee6))
* **routes:** deleting wrong routes when other server has same private
IP
([#472](#472))
([5461038](5461038)),
closes
[#470](#470)

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants