-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: initial assignment of alias IP in hcloud (Hetzner) #8493
Conversation
58ea7a4
to
3d2e794
Compare
Prep for Talos PR 8493 integration with commented-out config for enhanced networking. Refs: siderolabs/talos#8493
Prep for Talos PR 8493 integration with commented-out config for enhanced networking. Refs: siderolabs/talos#8493
Prep for Talos PR 8493 integration with commented-out config for enhanced networking. Refs: siderolabs/talos#8493
handler.logger.Error("error assigning Hetzner Cloud floating IP to server: floating IP is not found", zap.String("vip", handler.vip), | ||
zap.Int64("device_id", handler.deviceID), zap.Int64("network_id", handler.networkID)) | ||
|
||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is not an error anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, That's the only way the bootstrap worked.
I'm not sure if there is a better way.
Edit:
And yes, it's a misconfiguration on a not available Floating IP, but I don't think that every misconfiguration leads to an error that prevents to complete the bootstrap, right? An error is still logged in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does it prevent the bootstrap?
If the Acquire fails, it should be retried (if not, it's a different bug), so once whatever thing on Hetzner side is resolved, it should acquire successfully. If we return nil
without actually attaching an IP, this leads to a misconfiguration - Talos thinks it's ok, while it's not actually working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I tried it again and I was wrong, sorry. Bootstrapping is successful.
But I still get the logspam on every control-plane-node, regardless of whether I assign an alias IP when creating the server or not:
user: warning: [2024-04-01T12:20:12.956235002Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning
\"10.0.1.100\" to server 45293174: floating IP is not found", "link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-01T12:20:14.026183002Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning
\"10.0.1.100\" to server 45293174: floating IP is not found", "link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-01T12:20:15.348476002Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning
\"10.0.1.100\" to server 45293174: floating IP is not found", "link": "eth1", "ip": "10.0.1.100"}
So what I can say is that when the Alias VIP is "first" initialized in Talos, the networkID
is 0 (i.e. not set). This means it skips the Alias IP
block and goes into the Floating IP
block, where of course it cannot find the floating IP because the one passed is an alias IP.
Once it is initialized, the networkID
is >0 (i.e. set) in subsequent runs and everything is working fine.
I don't know exactly where the AliasIP
and NetworkID
is set in the vip *VIP
. It comes from the operator loop
, but I don't know where it is started. Maybe that is where the problem is. Can you tell me where vip *VIP
is set? Then I'll try to analyze it further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the network ID should be set automatically from Hetzner metadata:
talos/internal/app/machined/pkg/controllers/network/operator/vip/hcloud.go
Lines 176 to 190 in e7d8041
spec.NetworkID = 0 | |
for _, privnet := range server.PrivateNet { | |
network, _, err := client.Network.GetByID(ctx, privnet.Network.ID) | |
if err != nil { | |
return fmt.Errorf("error getting network info: %w", err) | |
} | |
if network.IPRange.Contains(vip.AsSlice()) { | |
spec.NetworkID = privnet.Network.ID | |
break | |
} | |
} | |
the operator spec is not re-created unless there's some change that would force it to be re-generated
talosctl get operatorspecs -o yaml
to see what's there right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I tried it again:
the AliasIP is attached by me manually on server creation:
These are the operatorspecs:
❯ talosctl get operatorspecs -o yaml -n 10.0.1.101
node: 10.0.1.101
metadata:
namespace: network
type: OperatorSpecs.net.talos.dev
id: dhcp4/eth0
version: 2
owner: network.OperatorMergeController
phase: running
created: 2024-04-03T21:17:23Z
updated: 2024-04-03T21:17:26Z
spec:
operator: dhcp4
linkName: eth0
requireUp: false
dhcp4:
routeMetric: 1024
layer: platform
---
node: 10.0.1.101
metadata:
namespace: network
type: OperatorSpecs.net.talos.dev
id: dhcp4/eth1
version: 1
owner: network.OperatorMergeController
phase: running
created: 2024-04-03T21:17:28Z
updated: 2024-04-03T21:17:28Z
spec:
operator: dhcp4
linkName: eth1
requireUp: true
dhcp4:
routeMetric: 1024
layer: configuration
---
node: 10.0.1.101
metadata:
namespace: network
type: OperatorSpecs.net.talos.dev
id: vip/eth1
version: 1
owner: network.OperatorMergeController
phase: running
created: 2024-04-03T21:17:28Z
updated: 2024-04-03T21:17:28Z
spec:
operator: vip
linkName: eth1
requireUp: true
vip:
ip: 10.0.1.100
gratuitousARP: false
hcloud:
deviceID: 639543
networkID: 0
apiToken: xxxx
layer: configuration
I'll have another look at the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just tested something else:
I have commented out the vip part in the machine config. The error messages then stopped. Then I commented it in again and the result is as follows:
user: warning: [2024-04-03T21:26:51.35453309Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
"link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-03T21:26:52.52451809Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
"link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-03T21:26:53.64788609Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
"link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-03T21:26:54.70894809Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
"link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-03T21:26:56.44824509Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
"link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-03T21:26:57.282386188Z]: [talos] apply config request: mode auto(no_reboot)
user: warning: [2024-04-03T21:26:57.288244188Z]: [talos] node IP skipped, please use .machine.kubelet.nodeIP to provide explicit subnet for the node IP {"component": "controller-runtime", "controller": "k8s.NodeIPController", "address": "10.0.1.101"}
user: warning: [2024-04-03T21:26:57.292704188Z]: [talos] service[kubelet](Stopping): Sending SIGTERM to task kubelet (PID 3972, container kubelet)
user: warning: [2024-04-03T21:26:57.294289188Z]: [talos] removed address 10.0.1.100/32 from "eth1" {"component": "controller-runtime", "controller": "network.AddressSpecController"}
user: warning: [2024-04-03T21:26:57.319458188Z]: [talos] updated etcd peer URLs {"component": "controller-runtime", "controller": "etcd.AdvertisedPeerController", "new_peer_urls": ["https://10.0.1.100:2380", "https://10.0.1.101:2380"], "member_id":
15664038745852706440}
user: warning: [2024-04-03T21:26:57.327769188Z]: [talos] updated etcd peer URLs {"component": "controller-runtime", "controller": "etcd.AdvertisedPeerController", "new_peer_urls": ["https://10.0.1.101:2380"], "member_id": 15664038745852706440}
user: warning: [2024-04-03T21:26:57.409243188Z]: [talos] service[kubelet](Finished): Service finished successfully
user: warning: [2024-04-03T21:26:57.410459188Z]: [talos] service[kubelet](Starting): Starting service
user: warning: [2024-04-03T21:26:57.411070188Z]: [talos] service[kubelet](Waiting): Waiting for service "cri" to be "up", time sync, network
user: warning: [2024-04-03T21:26:57.413742188Z]: [talos] service[kubelet](Failed): Condition failed: context canceled
user: warning: [2024-04-03T21:26:57.414781188Z]: [talos] service[kubelet](Starting): Starting service
user: warning: [2024-04-03T21:26:57.415342188Z]: [talos] service[kubelet](Waiting): Waiting for service "cri" to be "up", time sync, network
user: warning: [2024-04-03T21:26:57.416269188Z]: [talos] service[kubelet](Preparing): Running pre state
user: warning: [2024-04-03T21:26:57.427521188Z]: [talos] service[kubelet](Preparing): Creating service runner
user: warning: [2024-04-03T21:26:57.523997188Z]: [talos] service[kubelet](Running): Started task kubelet (PID 5374) for container kubelet
user: warning: [2024-04-03T21:26:59.443244188Z]: [talos] service[kubelet](Running): Health check successful
user: warning: [2024-04-03T21:27:07.308966188Z]: [talos] controller failed {"component": "controller-runtime", "controller": "kubeaccess.EndpointController", "error": "error getting endpoints: Get \"https://127.0.0.1:7445/api/v1/namespaces/default/endpoints/talos\":
net/http: TLS handshake timeout"}
user: warning: [2024-04-03T21:27:20.195789188Z]: [talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "Get \"https://127.0.0.1:7445/api/v1/nodes?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dcontrol-
plane-1&resourceVersion=3950&timeout=8m48s&timeoutSeconds=528&watch=true\": http2: client connection lost"}
user: warning: [2024-04-03T21:27:42.453139188Z]: [talos] apply config request: mode auto(no_reboot)
user: warning: [2024-04-03T21:27:43.711860188Z]: [talos] cleared previous Hetzner Cloud IP alias {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "vip": "10.0.1.100", "device_id": 639543, "status": "success"}
user: warning: [2024-04-03T21:27:44.448118188Z]: [talos] assigned Hetzner Cloud alias IP {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "vip": "10.0.1.100", "device_id": 639543, "network_id": 4075410,
"status": "success"}
user: warning: [2024-04-03T21:27:44.450763188Z]: [talos] enabled shared IP {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-03T21:27:44.453683188Z]: [talos] assigned address {"component": "controller-runtime", "controller": "network.AddressSpecController", "address": "10.0.1.100/32", "link": "eth1"}
Operatorspecs:
❯ talosctl get operatorspecs -o yaml -n 10.0.1.101
node: 10.0.1.101
metadata:
namespace: network
type: OperatorSpecs.net.talos.dev
id: dhcp4/eth0
version: 2
owner: network.OperatorMergeController
phase: running
created: 2024-04-03T21:17:23Z
updated: 2024-04-03T21:17:26Z
spec:
operator: dhcp4
linkName: eth0
requireUp: false
dhcp4:
routeMetric: 1024
layer: platform
---
node: 10.0.1.101
metadata:
namespace: network
type: OperatorSpecs.net.talos.dev
id: dhcp4/eth1
version: 1
owner: network.OperatorMergeController
phase: running
created: 2024-04-03T21:17:28Z
updated: 2024-04-03T21:17:28Z
spec:
operator: dhcp4
linkName: eth1
requireUp: true
dhcp4:
routeMetric: 1024
layer: configuration
---
node: 10.0.1.101
metadata:
namespace: network
type: OperatorSpecs.net.talos.dev
id: vip/eth1
version: 1
owner: network.OperatorMergeController
phase: running
created: 2024-04-03T21:27:42Z
updated: 2024-04-03T21:27:42Z
spec:
operator: vip
linkName: eth1
requireUp: true
vip:
ip: 10.0.1.100
gratuitousARP: false
hcloud:
deviceID: 639543
networkID: 4075410
apiToken: xxx
layer: configuration
Working as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not hcloud experts, so we won't be able to resolve it without your (or someone else) help.
The issue is (I guess) failure to find NetworkID, but I have no idea why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as soon as I find some time, I'll take a closer look at it again. It could have been that you have a direct idea of what the problem could be. Many thanks for the response!
Is there anything that can be done or that we can talk about to make this pr mergeable? Are there any concerns or similar? |
I don't have much insight into how Hetzner works to comment more, maybe @sergelogvinov has better ideas? |
Anyways will wait for next week before merging, will trust you on that one. |
I noticed that while looking at siderolabs#8493, but I don't know if this problem actually happened in real life. If acquiring a VIP fails (which can only fail for Equinix/HCloud, not L2 ARP announce), we should not set the leader flag, as it would make the controller announce the IP, while it shouldn't do that. If this call fails, there's no matching call to de-announce on failure. The bug would show up as two nodes having same VIP assigned on the host. Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
I noticed that while looking at siderolabs#8493, but I don't know if this problem actually happened in real life. If acquiring a VIP fails (which can only fail for Equinix/HCloud, not L2 ARP announce), we should not set the leader flag, as it would make the controller announce the IP, while it shouldn't do that. If this call fails, there's no matching call to de-announce on failure. The bug would show up as two nodes having same VIP assigned on the host. Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
I took a few more hours to take a closer look at the problem. I definitely have a lead: The error is that by not setting the NetworkId, the operator thinks it is a As soon as I then restart a control plane (and it doesn't matter which one, whether set with AliasIP or not), the method So I suspect that an attempt is made to set the VIP or alias IP too early, causing the operator to enter an error loop and thus never call the Unfortunately, I don't have enough knowledge about the operator and how Talos works in general to be able to continue here. I am therefore dependent on help. The code in this PR apparently fixes the problem because it breaks the error loop. However, this is probably not a good fix. Hence the question: Should I open an issue to discuss further and record new insights? |
if you have a diff with the logs you added, and the logs, I'd be interested to look into that. Probably adding lots of logs on almost every line would be helpful. |
Ok, thank you very much! I have added these logs: 9304eb6 You can search for The logs: I hope this is what you need. If something is missing, please let me know. |
Thanks, from my understanding the problem is here: 9304eb6#diff-d26059bbcbafd16b63a5bce677e196bed6c47b1da962d4b41656fd735631fbe3R181-R188 This is the log line:
This is response from HCloud API, and it seems to indicate there are no private networks? |
so far it looks like it's a failure on the HCloud API side (?) or some inconsistency. The only way out I see is to supply |
Yes, that's exactly what I mean. But it's there later. I think that at the time when I'll attach a new log in a few minutes. In it I will have restarted the CP once after the bootstrap. Then the assignment also works. In my theory it should work if Adding the |
Ok, this time I didn't even have to restart. It just worked with the identical configuration. So it really seems to be something race condition related. Here is the log: The first call is correct this time:
|
@smira I would like to update the PR to introduce the possibility to provide the |
yes, for sure! |
I have probably found a much better solution! I have looked at the whole logic again and actually the problem only exists if there is no private network attached. As soon as it is there, the assignment of the alias IP works perfectly. So why not throw an error as long as the private network is not yet there? There should always be one anyway, especially if you want to use an alias IP. The only problem now could be that you never have a private network attached. We could add a check whether it is a public or private vip. But I think that case is very unlikely. I am attaching a log in which you can see the behavior implemented here in the PR.
until it is attached:
what do you think? |
I'm not an expert there, but it looks like there were two kinds of IPs, and one required network ID, and another didn't? Certainly not making changes to the machine config sounds better. @sergelogvinov as you're the author of the original code (iirc), wdyt? |
Thats correct. This is exactly the reason why the This would also make the machine config more complex, because one may/should only specify a |
so if we always wait for the private network from Hetzner API, does it mean it would be a regression or not? |
It's a deterioration if one does not have a private network and wants to use a public VIP. That would not be possible. I will try to find a way to solve the problem. |
73afe99
to
f983d4f
Compare
I have added a check for private IPs. Now the NetworkId is set to 0 if the VIP is a public/floating IP. If it is a private/alias IP, there must be a private network that has the IP in its range, otherwise an error message is thrown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, this looks great to me!
I fixed some linting errors, but otherwise the code is exactly same. |
The assignment of private networks happens in the hetzner cloud after starting the server and therefore often after querying the network information when assigning VIPs. If an alias IP is to be set but no private network is yet available, an error message is now thrown, until the private network is assigned. Previously, no error message was thrown and the network ID was set to 0, which means that the VIP is regarded as a public floating IP in the further code and not as a private alias IP. Signed-off-by: Marcel Richter <mail@mrclrchtr.de> Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
/m |
Thank you all very much! |
I noticed that while looking at siderolabs#8493, but I don't know if this problem actually happened in real life. If acquiring a VIP fails (which can only fail for Equinix/HCloud, not L2 ARP announce), we should not set the leader flag, as it would make the controller announce the IP, while it shouldn't do that. If this call fails, there's no matching call to de-announce on failure. The bug would show up as two nodes having same VIP assigned on the host. Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com> (cherry picked from commit 5c0f74b)
Hi @mrclrchtr I am having the same issue.
What exactly do I have to do to get the VIP attached? config:
Edit: Talos version 1.7.5 |
AliasIP is working in Talos 1.7.5 Fixed in siderolabs/talos#8493
It's working in https://github.com/hcloud-talos/terraform-hcloud-talos/blob/main/talos_patch_control_plane.tf The only difference is dhcp:
Obvious question: Does the token work? 😜 Maybe another hint: |
AliasIP is working in Talos 1.7.5 Fixed in siderolabs/talos#8493
Pull Request
What? (description)
As described in #3599, assigning an alias IP in Hetzner Cloud at bootstrap of the cluster does (sometimes) not work.
The assignment of private networks happens in the hetzner cloud after starting the server and therefore often after querying the network information when assigning VIPs.
If an alias IP is to be set but no private network is yet available, an error message is now thrown, until the private network is assigned.
Previously, no error message was thrown and the network ID was set to 0, which means that the VIP is regarded as a public floating IP in the further code and not as a private alias IP.
In addition, some logs are added that make it much easier to find out what is happening.
Acceptance
Please use the following checklist:
make conformance
) -> GPG Identity failedmake fmt
)make lint
)make docs
)make unit-tests
)