Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: initial assignment of alias IP in hcloud (Hetzner) #8493

Merged
merged 1 commit into from
Jun 7, 2024

Conversation

mrclrchtr
Copy link
Contributor

@mrclrchtr mrclrchtr commented Mar 25, 2024

Pull Request

What? (description)

As described in #3599, assigning an alias IP in Hetzner Cloud at bootstrap of the cluster does (sometimes) not work.

The assignment of private networks happens in the hetzner cloud after starting the server and therefore often after querying the network information when assigning VIPs.

If an alias IP is to be set but no private network is yet available, an error message is now thrown, until the private network is assigned.

Previously, no error message was thrown and the network ID was set to 0, which means that the VIP is regarded as a public floating IP in the further code and not as a private alias IP.

In addition, some logs are added that make it much easier to find out what is happening.

Acceptance

Please use the following checklist:

  • you linked an issue (if applicable)
  • you included tests (if applicable)
  • you ran conformance (make conformance) -> GPG Identity failed
  • you formatted your code (make fmt)
  • you linted your code (make lint)
  • you generated documentation (make docs)
  • you ran unit-tests (make unit-tests)

@mrclrchtr mrclrchtr changed the title fix: fix initial assignment of alias IP in hcloud fix: initial assignment of alias IP in hcloud Mar 25, 2024
@mrclrchtr mrclrchtr changed the title fix: initial assignment of alias IP in hcloud fix: initial assignment of alias IP in hcloud (Hetzner) Mar 25, 2024
mrclrchtr added a commit to hcloud-talos/terraform-hcloud-talos that referenced this pull request Mar 25, 2024
Prep for Talos PR 8493 integration with commented-out config for enhanced networking.

Refs: siderolabs/talos#8493
mrclrchtr added a commit to hcloud-talos/terraform-hcloud-talos that referenced this pull request Mar 25, 2024
Prep for Talos PR 8493 integration with commented-out config for enhanced networking.

Refs: siderolabs/talos#8493
mrclrchtr added a commit to hcloud-talos/terraform-hcloud-talos that referenced this pull request Mar 25, 2024
Prep for Talos PR 8493 integration with commented-out config for enhanced networking.

Refs: siderolabs/talos#8493
handler.logger.Error("error assigning Hetzner Cloud floating IP to server: floating IP is not found", zap.String("vip", handler.vip),
zap.Int64("device_id", handler.deviceID), zap.Int64("network_id", handler.networkID))

return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is not an error anymore?

Copy link
Contributor Author

@mrclrchtr mrclrchtr Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, That's the only way the bootstrap worked.

I'm not sure if there is a better way.

Edit:
And yes, it's a misconfiguration on a not available Floating IP, but I don't think that every misconfiguration leads to an error that prevents to complete the bootstrap, right? An error is still logged in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does it prevent the bootstrap?

If the Acquire fails, it should be retried (if not, it's a different bug), so once whatever thing on Hetzner side is resolved, it should acquire successfully. If we return nil without actually attaching an IP, this leads to a misconfiguration - Talos thinks it's ok, while it's not actually working.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I tried it again and I was wrong, sorry. Bootstrapping is successful.

But I still get the logspam on every control-plane-node, regardless of whether I assign an alias IP when creating the server or not:

user: warning: [2024-04-01T12:20:12.956235002Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning
 \"10.0.1.100\" to server 45293174: floating IP is not found", "link": "eth1", "ip": "10.0.1.100"}
 user: warning: [2024-04-01T12:20:14.026183002Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning
 \"10.0.1.100\" to server 45293174: floating IP is not found", "link": "eth1", "ip": "10.0.1.100"}
 user: warning: [2024-04-01T12:20:15.348476002Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning
 \"10.0.1.100\" to server 45293174: floating IP is not found", "link": "eth1", "ip": "10.0.1.100"}

So what I can say is that when the Alias VIP is "first" initialized in Talos, the networkID is 0 (i.e. not set). This means it skips the Alias ​​IP block and goes into the Floating IP block, where of course it cannot find the floating IP because the one passed is an alias IP.

Once it is initialized, the networkID is >0 (i.e. set) in subsequent runs and everything is working fine.

I don't know exactly where the AliasIP and NetworkID is set in the vip *VIP. It comes from the operator loop, but I don't know where it is started. Maybe that is where the problem is. Can you tell me where vip *VIP is set? Then I'll try to analyze it further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the network ID should be set automatically from Hetzner metadata:

spec.NetworkID = 0
for _, privnet := range server.PrivateNet {
network, _, err := client.Network.GetByID(ctx, privnet.Network.ID)
if err != nil {
return fmt.Errorf("error getting network info: %w", err)
}
if network.IPRange.Contains(vip.AsSlice()) {
spec.NetworkID = privnet.Network.ID
break
}
}

the operator spec is not re-created unless there's some change that would force it to be re-generated

talosctl get operatorspecs -o yaml to see what's there right now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I tried it again:

the AliasIP is attached by me manually on server creation:
image

These are the operatorspecs:

❯ talosctl get operatorspecs -o yaml -n 10.0.1.101
node: 10.0.1.101
metadata:
    namespace: network
    type: OperatorSpecs.net.talos.dev
    id: dhcp4/eth0
    version: 2
    owner: network.OperatorMergeController
    phase: running
    created: 2024-04-03T21:17:23Z
    updated: 2024-04-03T21:17:26Z
spec:
    operator: dhcp4
    linkName: eth0
    requireUp: false
    dhcp4:
        routeMetric: 1024
    layer: platform
---
node: 10.0.1.101
metadata:
    namespace: network
    type: OperatorSpecs.net.talos.dev
    id: dhcp4/eth1
    version: 1
    owner: network.OperatorMergeController
    phase: running
    created: 2024-04-03T21:17:28Z
    updated: 2024-04-03T21:17:28Z
spec:
    operator: dhcp4
    linkName: eth1
    requireUp: true
    dhcp4:
        routeMetric: 1024
    layer: configuration
---
node: 10.0.1.101
metadata:
    namespace: network
    type: OperatorSpecs.net.talos.dev
    id: vip/eth1
    version: 1
    owner: network.OperatorMergeController
    phase: running
    created: 2024-04-03T21:17:28Z
    updated: 2024-04-03T21:17:28Z
spec:
    operator: vip
    linkName: eth1
    requireUp: true
    vip:
        ip: 10.0.1.100
        gratuitousARP: false
        hcloud:
            deviceID: 639543
            networkID: 0
            apiToken: xxxx
    layer: configuration

I'll have another look at the code.

Copy link
Contributor Author

@mrclrchtr mrclrchtr Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just tested something else:
I have commented out the vip part in the machine config. The error messages then stopped. Then I commented it in again and the result is as follows:

user: warning: [2024-04-03T21:26:51.35453309Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
 "link": "eth1", "ip": "10.0.1.100"}
 user: warning: [2024-04-03T21:26:52.52451809Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
 "link": "eth1", "ip": "10.0.1.100"}
 user: warning: [2024-04-03T21:26:53.64788609Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
 "link": "eth1", "ip": "10.0.1.100"}
 user: warning: [2024-04-03T21:26:54.70894809Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
 "link": "eth1", "ip": "10.0.1.100"}
 user: warning: [2024-04-03T21:26:56.44824509Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
 "link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-03T21:26:57.282386188Z]: [talos] apply config request: mode auto(no_reboot)
 user: warning: [2024-04-03T21:26:57.288244188Z]: [talos] node IP skipped, please use .machine.kubelet.nodeIP to provide explicit subnet for the node IP {"component": "controller-runtime", "controller": "k8s.NodeIPController", "address": "10.0.1.101"}
 user: warning: [2024-04-03T21:26:57.292704188Z]: [talos] service[kubelet](Stopping): Sending SIGTERM to task kubelet (PID 3972, container kubelet)
 user: warning: [2024-04-03T21:26:57.294289188Z]: [talos] removed address 10.0.1.100/32 from "eth1" {"component": "controller-runtime", "controller": "network.AddressSpecController"}
 user: warning: [2024-04-03T21:26:57.319458188Z]: [talos] updated etcd peer URLs {"component": "controller-runtime", "controller": "etcd.AdvertisedPeerController", "new_peer_urls": ["https://10.0.1.100:2380", "https://10.0.1.101:2380"], "member_id":
 15664038745852706440}
 user: warning: [2024-04-03T21:26:57.327769188Z]: [talos] updated etcd peer URLs {"component": "controller-runtime", "controller": "etcd.AdvertisedPeerController", "new_peer_urls": ["https://10.0.1.101:2380"], "member_id": 15664038745852706440}
 user: warning: [2024-04-03T21:26:57.409243188Z]: [talos] service[kubelet](Finished): Service finished successfully
 user: warning: [2024-04-03T21:26:57.410459188Z]: [talos] service[kubelet](Starting): Starting service
 user: warning: [2024-04-03T21:26:57.411070188Z]: [talos] service[kubelet](Waiting): Waiting for service "cri" to be "up", time sync, network
 user: warning: [2024-04-03T21:26:57.413742188Z]: [talos] service[kubelet](Failed): Condition failed: context canceled
 user: warning: [2024-04-03T21:26:57.414781188Z]: [talos] service[kubelet](Starting): Starting service
 user: warning: [2024-04-03T21:26:57.415342188Z]: [talos] service[kubelet](Waiting): Waiting for service "cri" to be "up", time sync, network
 user: warning: [2024-04-03T21:26:57.416269188Z]: [talos] service[kubelet](Preparing): Running pre state
 user: warning: [2024-04-03T21:26:57.427521188Z]: [talos] service[kubelet](Preparing): Creating service runner
 user: warning: [2024-04-03T21:26:57.523997188Z]: [talos] service[kubelet](Running): Started task kubelet (PID 5374) for container kubelet
 user: warning: [2024-04-03T21:26:59.443244188Z]: [talos] service[kubelet](Running): Health check successful
 user: warning: [2024-04-03T21:27:07.308966188Z]: [talos] controller failed {"component": "controller-runtime", "controller": "kubeaccess.EndpointController", "error": "error getting endpoints: Get \"https://127.0.0.1:7445/api/v1/namespaces/default/endpoints/talos\":
 net/http: TLS handshake timeout"}
 user: warning: [2024-04-03T21:27:20.195789188Z]: [talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "Get \"https://127.0.0.1:7445/api/v1/nodes?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dcontrol-
 plane-1&resourceVersion=3950&timeout=8m48s&timeoutSeconds=528&watch=true\": http2: client connection lost"}
 user: warning: [2024-04-03T21:27:42.453139188Z]: [talos] apply config request: mode auto(no_reboot)
 user: warning: [2024-04-03T21:27:43.711860188Z]: [talos] cleared previous Hetzner Cloud IP alias {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "vip": "10.0.1.100", "device_id": 639543, "status": "success"}
 user: warning: [2024-04-03T21:27:44.448118188Z]: [talos] assigned Hetzner Cloud alias IP {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "vip": "10.0.1.100", "device_id": 639543, "network_id": 4075410,
 "status": "success"}
 user: warning: [2024-04-03T21:27:44.450763188Z]: [talos] enabled shared IP {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "link": "eth1", "ip": "10.0.1.100"}
 user: warning: [2024-04-03T21:27:44.453683188Z]: [talos] assigned address {"component": "controller-runtime", "controller": "network.AddressSpecController", "address": "10.0.1.100/32", "link": "eth1"}

Operatorspecs:

❯ talosctl get operatorspecs -o yaml -n 10.0.1.101
node: 10.0.1.101
metadata:
    namespace: network
    type: OperatorSpecs.net.talos.dev
    id: dhcp4/eth0
    version: 2
    owner: network.OperatorMergeController
    phase: running
    created: 2024-04-03T21:17:23Z
    updated: 2024-04-03T21:17:26Z
spec:
    operator: dhcp4
    linkName: eth0
    requireUp: false
    dhcp4:
        routeMetric: 1024
    layer: platform
---
node: 10.0.1.101
metadata:
    namespace: network
    type: OperatorSpecs.net.talos.dev
    id: dhcp4/eth1
    version: 1
    owner: network.OperatorMergeController
    phase: running
    created: 2024-04-03T21:17:28Z
    updated: 2024-04-03T21:17:28Z
spec:
    operator: dhcp4
    linkName: eth1
    requireUp: true
    dhcp4:
        routeMetric: 1024
    layer: configuration
---
node: 10.0.1.101
metadata:
    namespace: network
    type: OperatorSpecs.net.talos.dev
    id: vip/eth1
    version: 1
    owner: network.OperatorMergeController
    phase: running
    created: 2024-04-03T21:27:42Z
    updated: 2024-04-03T21:27:42Z
spec:
    operator: vip
    linkName: eth1
    requireUp: true
    vip:
        ip: 10.0.1.100
        gratuitousARP: false
        hcloud:
            deviceID: 639543
            networkID: 4075410
            apiToken: xxx
    layer: configuration

Working as expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not hcloud experts, so we won't be able to resolve it without your (or someone else) help.

The issue is (I guess) failure to find NetworkID, but I have no idea why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as soon as I find some time, I'll take a closer look at it again. It could have been that you have a direct idea of what the problem could be. Many thanks for the response!

@mrclrchtr
Copy link
Contributor Author

Is there anything that can be done or that we can talk about to make this pr mergeable? Are there any concerns or similar?

@frezbo
Copy link
Member

frezbo commented Mar 30, 2024

Is there anything that can be done or that we can talk about to make this pr mergeable? Are there any concerns or similar?

I don't have much insight into how Hetzner works to comment more, maybe @sergelogvinov has better ideas?

@frezbo
Copy link
Member

frezbo commented Mar 30, 2024

Is there anything that can be done or that we can talk about to make this pr mergeable? Are there any concerns or similar?

I don't have much insight into how Hetzner works to comment more, maybe @sergelogvinov has better ideas?

Anyways will wait for next week before merging, will trust you on that one.

@smira smira mentioned this pull request Apr 3, 2024
smira added a commit to smira/talos that referenced this pull request Apr 3, 2024
I noticed that while looking at siderolabs#8493, but I don't know if this problem
actually happened in real life.

If acquiring a VIP fails (which can only fail for Equinix/HCloud, not L2
ARP announce), we should not set the leader flag, as it would make the
controller announce the IP, while it shouldn't do that.

If this call fails, there's no matching call to de-announce on failure.

The bug would show up as two nodes having same VIP assigned on the host.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
smira added a commit to smira/talos that referenced this pull request Apr 3, 2024
I noticed that while looking at siderolabs#8493, but I don't know if this problem
actually happened in real life.

If acquiring a VIP fails (which can only fail for Equinix/HCloud, not L2
ARP announce), we should not set the leader flag, as it would make the
controller announce the IP, while it shouldn't do that.

If this call fails, there's no matching call to de-announce on failure.

The bug would show up as two nodes having same VIP assigned on the host.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
@mrclrchtr
Copy link
Contributor Author

I took a few more hours to take a closer look at the problem.

I definitely have a lead:
I added some logs in the GetNetworkAndDeviceIDs method. The strange thing is that this method does not seem to be called during the first bootstrap. I could not see the logs. So the NetworkID is not set and the error loops start in all CP's

The error is that by not setting the NetworkId, the operator thinks it is a Public VIP and not Alias IP. Of course, it cannot find this in the API call.

As soon as I then restart a control plane (and it doesn't matter which one, whether set with AliasIP or not), the method GetNetworkAndDeviceIDs is called, the NetworkID is set correctly and all error log loops in all control planes stop.

So I suspect that an attempt is made to set the VIP or alias IP too early, causing the operator to enter an error loop and thus never call the GetNetworkAndDeviceIDs method.

Unfortunately, I don't have enough knowledge about the operator and how Talos works in general to be able to continue here. I am therefore dependent on help.

The code in this PR apparently fixes the problem because it breaks the error loop. However, this is probably not a good fix. Hence the question: Should I open an issue to discuss further and record new insights?

@smira
Copy link
Member

smira commented May 30, 2024

I added some logs in the GetNetworkAndDeviceIDs method. The strange thing is that this method does not seem to be called during the first bootstrap. I could not see the logs. So the NetworkID is not set and the error loops start in all CP's

if you have a diff with the logs you added, and the logs, I'd be interested to look into that.

Probably adding lots of logs on almost every line would be helpful.

@mrclrchtr
Copy link
Contributor Author

mrclrchtr commented May 31, 2024

Ok, thank you very much!

I have added these logs: 9304eb6

You can search for ##################HCLOUD-DEBUG################## to find all added logs.

The logs:
controller-runtime.log
(I replaced the token)

I hope this is what you need. If something is missing, please let me know.

@smira
Copy link
Member

smira commented May 31, 2024

Thanks, from my understanding the problem is here: 9304eb6#diff-d26059bbcbafd16b63a5bce677e196bed6c47b1da962d4b41656fd735631fbe3R181-R188

This is the log line:

2024-05-31T12:34:44.331Z �[33mWARN�[0m ##################HCLOUD-DEBUG################## private networks {"component": "controller-runtime", "controller": "network.OperatorVIPConfigController", "names": "[]"}

This is response from HCloud API, and it seems to indicate there are no private networks?

@smira
Copy link
Member

smira commented May 31, 2024

so far it looks like it's a failure on the HCloud API side (?) or some inconsistency. The only way out I see is to supply NetworkID via the machine configuration, so that detection code doesn't run.

@mrclrchtr
Copy link
Contributor Author

Yes, that's exactly what I mean. But it's there later. I think that at the time when GetNetworkAndDeviceIDs is performed (2 times in the log), maybe the private network is not attached yet? (unfortunately I am not very familiar with this area...)

I'll attach a new log in a few minutes. In it I will have restarted the CP once after the bootstrap. Then the assignment also works.

In my theory it should work if GetNetworkAndDeviceIDs is called later.

Adding the NetworkID manually should also solve the problem. It should already be known beforehand.

@mrclrchtr
Copy link
Contributor Author

mrclrchtr commented May 31, 2024

Ok, this time I didn't even have to restart. It just worked with the identical configuration. So it really seems to be something race condition related.

Here is the log:
controller-runtime.log

The first call is correct this time:

2024-05-31T13:14:49.218Z �[33mWARN�[0m ##################HCLOUD-DEBUG################## private networks {"component": "controller-runtime", "controller": "network.OperatorVIPConfigController", "names": "[{0x40013e3180 10.0.1.101 [10.0.1.100] 86:00:00:8e:16:8c}]"}

@mrclrchtr
Copy link
Contributor Author

@smira I would like to update the PR to introduce the possibility to provide the NetworkID via the machine configuration. Would you accept the PR?

@smira
Copy link
Member

smira commented Jun 3, 2024

@smira I would like to update the PR to introduce the possibility to provide the NetworkID via the machine configuration. Would you accept the PR?

yes, for sure!

@mrclrchtr
Copy link
Contributor Author

I have probably found a much better solution! I have looked at the whole logic again and actually the problem only exists if there is no private network attached. As soon as it is there, the assignment of the alias IP works perfectly. So why not throw an error as long as the private network is not yet there? There should always be one anyway, especially if you want to use an alias IP.

The only problem now could be that you never have a private network attached. We could add a check whether it is a public or private vip. But I think that case is very unlikely.

I am attaching a log in which you can see the behavior implemented here in the PR.

dmesg.log

user: warning: [2024-06-03T17:45:26.28582841Z]: [talos] controller failed {"component": "controller-runtime", "controller": "network.OperatorVIPConfigController", "error": "1 error occurred:\n\t* no private Hetzner Cloud networks assigned to server\n\n"}

until it is attached:

user: warning: [2024-06-03T17:45:45.33610241Z]: [talos] found private network for vip (alias IP) {"component": "controller-runtime", "controller": "network.OperatorVIPConfigController", "vip": "10.0.1.100", "network_id": 4299690}

what do you think?

@smira
Copy link
Member

smira commented Jun 6, 2024

I have probably found a much better solution! I have looked at the whole logic again and actually the problem only exists if there is no private network attached. As soon as it is there, the assignment of the alias IP works perfectly. So why not throw an error as long as the private network is not yet there? There should always be one anyway, especially if you want to use an alias IP.

I'm not an expert there, but it looks like there were two kinds of IPs, and one required network ID, and another didn't?

Certainly not making changes to the machine config sounds better.

@sergelogvinov as you're the author of the original code (iirc), wdyt?

@mrclrchtr
Copy link
Contributor Author

mrclrchtr commented Jun 6, 2024

I'm not an expert there, but it looks like there were two kinds of IPs, and one required network ID, and another didn't?

Thats correct. This is exactly the reason why the NetworkId is set to 0 for the Public VIP.

This would also make the machine config more complex, because one may/should only specify a NetworkId for the private Alias IP.

@smira
Copy link
Member

smira commented Jun 6, 2024

Thats correct. This is exactly the reason why the NetworkId is set to 0 for the Public VIP.

so if we always wait for the private network from Hetzner API, does it mean it would be a regression or not?

@mrclrchtr
Copy link
Contributor Author

It's a deterioration if one does not have a private network and wants to use a public VIP. That would not be possible.

I will try to find a way to solve the problem.

@mrclrchtr
Copy link
Contributor Author

I have added a check for private IPs. Now the NetworkId is set to 0 if the VIP is a public/floating IP. If it is a private/alias IP, there must be a private network that has the IP in its range, otherwise an error message is thrown.

Copy link
Member

@smira smira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, this looks great to me!

@smira
Copy link
Member

smira commented Jun 7, 2024

I fixed some linting errors, but otherwise the code is exactly same.

The assignment of private networks happens in the hetzner cloud after
starting the server and therefore often after querying the network
information when assigning VIPs.

If an alias IP is to be set but no private network is yet available, an
error message is now thrown, until the private network is assigned.

Previously, no error message was thrown and the
network ID was set to 0, which means that the VIP
is regarded as a public floating IP in the further
code and not as a private alias IP.

Signed-off-by: Marcel Richter <mail@mrclrchtr.de>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
@smira
Copy link
Member

smira commented Jun 7, 2024

/m

@talos-bot talos-bot merged commit 5e66e11 into siderolabs:main Jun 7, 2024
47 checks passed
@mrclrchtr
Copy link
Contributor Author

Thank you all very much!

@frezbo frezbo mentioned this pull request Jun 19, 2024
smira added a commit to smira/talos that referenced this pull request Jul 22, 2024
I noticed that while looking at siderolabs#8493, but I don't know if this problem
actually happened in real life.

If acquiring a VIP fails (which can only fail for Equinix/HCloud, not L2
ARP announce), we should not set the leader flag, as it would make the
controller announce the IP, while it shouldn't do that.

If this call fails, there's no matching call to de-announce on failure.

The bug would show up as two nodes having same VIP assigned on the host.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 5c0f74b)
@jorikseldeslachts
Copy link

jorikseldeslachts commented Jul 29, 2024

Hi @mrclrchtr I am having the same issue.
Still getting the following log line and the alias ip is never attached to the server:

[talos] found private network for private vip (alias IP) {"component": "controller-runtime", "controller": "network.OperatorVIPConfigController", "vip": "172.16.1.100", "network_id": 4461855}

What exactly do I have to do to get the VIP attached?

config:

      - interface: eth1
        vip:
          ip: 172.16.1.100
          hcloud:
            apiToken: xxxx

Edit: Talos version 1.7.5

mrclrchtr added a commit to hcloud-talos/terraform-hcloud-talos that referenced this pull request Jul 29, 2024
AliasIP is working in Talos 1.7.5

Fixed in siderolabs/talos#8493
@mrclrchtr
Copy link
Contributor Author

mrclrchtr commented Jul 29, 2024

It's working in https://github.com/hcloud-talos/terraform-hcloud-talos/blob/main/talos_patch_control_plane.tf

The only difference is dhcp:

      - interface: eth1
        dhcp: true
        vip:
          ip: 172.16.1.100
          hcloud:
            apiToken: xxxx

Obvious question: Does the token work? 😜

Maybe another hint:
You don't have to create the alias IP beforehand. It is created automatically.

mrclrchtr added a commit to hcloud-talos/terraform-hcloud-talos that referenced this pull request Jul 29, 2024
AliasIP is working in Talos 1.7.5

Fixed in siderolabs/talos#8493
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backported
Development

Successfully merging this pull request may close these issues.

5 participants