Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firewalld ns-wireguard service name conflict #6958

Closed
DavidePrincipi opened this issue Jun 24, 2024 · 3 comments
Closed

Firewalld ns-wireguard service name conflict #6958

DavidePrincipi opened this issue Jun 24, 2024 · 3 comments
Assignees
Labels
verified All test cases were verified successfully

Comments

@DavidePrincipi
Copy link
Member

DavidePrincipi commented Jun 24, 2024

After a failed join attempt, the node RL2 is left in an invalid state: it cannot rejoin the cluster or become the first node of a new cluster.

Steps to reproduce

  • Install NS8 on a host RL1 with an invalid domain suffix, e.g. dp.test
  • Initialize RL1 as leader node, with the invalid domain suffix
  • After cluster initialization, go to the Nodes page and change the leader FQDN to a valid one, e.g. rl1.dp.nethserver.net
  • Join the second node, RL2: the join-node fails.

Expected behavior

I expect the join works, or I can recover from the error by some means.

Actual behavior

  1. Despite the error message, RL2 UI shows a link to the leader node, giving me the impression that the join in the end was successful.

  2. If I reload the page, RL2 shows again the initial choice screen to choose among create cluster, join node, restore from backup.

  3. If I choose create-cluster, the create-cluster procedure configures RL2 as leader of a new cluster, but a conflict on the ns-wireguard firewall service occurs.

In RL2 journal, the original join failure

Jun 24 07:25:05 rl2 agent@cluster[31660]: task/cluster/0aead8ad-0578-4cdd-b255-73b9cf71df4f: join-node/30start_replication is starting
Jun 24 07:25:05 rl2 traefik[32066]: 80.17.99.73 - - [24/Jun/2024:07:25:05 +0000] "GET /cluster-admin/api/cluster/task/0aead8ad-0578-4cdd-b255-73b9cf71df4f/context HTTP/2.0" 200 309 "-" "-" 177 "ApiServer-https@file" "http://127.0.0.1:9311>
Jun 24 07:25:06 rl2 redis[31501]: 1:M 24 Jun 2024 07:25:06.060 * 1 changes in 5 seconds. Saving...
Jun 24 07:25:06 rl2 redis[31501]: 1:M 24 Jun 2024 07:25:06.061 * Background saving started by pid 31
Jun 24 07:25:06 rl2 redis[31501]: 31:C 24 Jun 2024 07:25:06.068 * DB saved on disk
Jun 24 07:25:06 rl2 redis[31501]: 31:C 24 Jun 2024 07:25:06.068 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
Jun 24 07:25:06 rl2 redis[31501]: 1:M 24 Jun 2024 07:25:06.161 * Background saving terminated with success
Jun 24 07:25:07 rl2 agent@cluster[31660]: sed -i -e '/^AGENT_ID=/c\AGENT_ID=node/2' -e '/^REDIS_USER=/c\REDIS_USER=node/2' /var/lib/nethserver/node/state/agent.env
Jun 24 07:25:07 rl2 agent@cluster[31660]: Traceback (most recent call last):
Jun 24 07:25:07 rl2 agent@cluster[31660]:   File "/var/lib/nethserver/cluster/actions/join-node/30start_replication", line 63, in <module>
Jun 24 07:25:07 rl2 agent@cluster[31660]:     cluster.vpn.initialize_wgconf(ip_address, peer={
Jun 24 07:25:07 rl2 agent@cluster[31660]:   File "/usr/local/agent/pypkg/cluster/vpn.py", line 36, in initialize_wgconf
Jun 24 07:25:07 rl2 agent@cluster[31660]:     peer_ep_address = socket.getaddrinfo(peer_hostname, peer_port, proto=socket.IPPROTO_UDP)[0][4][0]
Jun 24 07:25:07 rl2 agent@cluster[31660]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 24 07:25:07 rl2 agent@cluster[31660]:   File "/usr/lib64/python3.11/socket.py", line 962, in getaddrinfo
Jun 24 07:25:07 rl2 agent@cluster[31660]:     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
Jun 24 07:25:07 rl2 agent@cluster[31660]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 24 07:25:07 rl2 agent@cluster[31660]: socket.gaierror: [Errno -2] Name or service not known
Jun 24 07:25:08 rl2 agent@cluster[31660]: task/cluster/0aead8ad-0578-4cdd-b255-73b9cf71df4f: action "join-node" status is "aborted" (1) at step 30start_replication

The action create-cluster on RL2 fails with

Error: NAME_CONFLICT: new_service(): 'ns-wireguard'

Components

  • Core 2.8.4

See also

Discussion (PVT) https://mattermost.nethesis.it/nethesis/pl/rqr3abki53rr9ngsrxpeow835h


Thanks to @nrauso

@DavidePrincipi
Copy link
Member Author

DavidePrincipi commented Jun 26, 2024

Test case 0

  • Install two nodes with the Core testing release 2.8.5-dev.2
  • Set an invalid domain in the leader node FQDN
  • Try to join a node: the invalid FQDN is shown in the join validation error
  • Proceed with test case 1, by changing the leader FQDN as written above in the bug description

Test case 1

Check the join works after fixing the VPN endpoint with this command (assuming 1 is the NODE_ID of leader):

redis-cli hset node/1/vpn endpoint rl1.dp.nethserver.net:55820

The bug is fixed if the worker node is still capable of joining the cluster after a failed attempt.

@DavidePrincipi DavidePrincipi moved this from 🏗 In progress to 👀 Testing in NethServer Jun 26, 2024
@DavidePrincipi DavidePrincipi added the testing Packages are available from testing repositories label Jun 27, 2024
@nrauso nrauso self-assigned this Jun 27, 2024
@nrauso
Copy link

nrauso commented Jun 27, 2024

test case 0: VERIFIED

In the event of an invalid domain for the leader, the join attempts generate a clear error:

join01
join02

test case 1: VERIFIED

Once the new, correct FQDN for the leader is set and the VPN endpoint is fixed in redis, the join works flawlessly.

@nrauso nrauso added verified All test cases were verified successfully and removed testing Packages are available from testing repositories labels Jun 27, 2024
@DavidePrincipi
Copy link
Member Author

@github-project-automation github-project-automation bot moved this from 👀 Testing to ✅ Done in NethServer Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
verified All test cases were verified successfully
Projects
Status: Done
Development

No branches or pull requests

2 participants