Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS Resolving issues with deploying workload to k3s #4486

Closed
1 task done
ajvn opened this issue Nov 13, 2021 · 13 comments
Closed
1 task done

DNS Resolving issues with deploying workload to k3s #4486

ajvn opened this issue Nov 13, 2021 · 13 comments

Comments

@ajvn
Copy link

ajvn commented Nov 13, 2021

Environmental Info:
K3s Version:

k3s version v1.22.3+k3s1 (61a2aab2)
go version go1.16.8

Happens with 1.21.4, 1.21.5, and 1.21.6 as well, across RCs, haven't checked other versions.

Node(s) CPU architecture, OS, and Version:

Linux RPI4-2 5.11.0-1021-raspi #22-Ubuntu SMP PREEMPT Wed Oct 6 17:30:38 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
NAME="Ubuntu"
VERSION="21.04 (Hirsute Hippo)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 21.04"
VERSION_ID="21.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=hirsute
UBUNTU_CODENAME=hirsute

Cluster Configuration:

RPI4s, 8 and 4gb versions.

NAME     STATUS   ROLES                  AGE   VERSION
rpi4-0   Ready    control-plane,master   20d   v1.22.3+k3s1
rpi4-2   Ready    agent                  20d   v1.22.3+k3s1
rpi4-1   Ready    agent                  20d   v1.22.3+k3s1

Output from /etc/hosts:

# Localhost block
127.0.0.1 localhost
192.168.0.150 rpi4-0.localhost.localdomain
192.168.0.151 rpi4-1.localhost.localdomain
192.168.0.152 rpi4-2.localhost.localdomain
192.168.0.200 rpi4-nfs.localhost.localdomain

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Describe the bug:

Problem happens while trying to deploy Pihole to the cluster, by being unable to resolve public hostnames of different container image registries.
I've only tried PiHole deployment, but I assume any other would fail with same issue.

20s         Warning   Failed  pod/pihole-78d8dbbb75-5rltq      Failed to pull image "pihole/pihole:2021.10.1": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/pihole/pihole:2021.10.1": failed to resolve reference "docker.io/pihole/pihole:2021.10.1": failed to do request: Head "https://registry-1.docker.io/v2/pihole/pihole/manifests/2021.10.1": dial tcp: lookup registry-1.docker.io: Try again
11s         Warning   Failed  pod/svclb-pihole-dns-tcp-hkmhc   Failed to pull image "rancher/klipper-lb:v0.3.4": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/rancher/klipper-lb:v0.3.4": failed to resolve reference "docker.io/rancher/klipper-lb:v0.3.4": failed to do request: Head "https://registry-1.docker.io/v2/rancher/klipper-lb/manifests/v0.3.4": dial tcp: lookup registry-1.docker.io: Try again

Here's more visual way to observe issue across the nodes. This happens when I ping Google before, and during the deployment:

╰─➤  ping google.com
PING google.com (142.250.186.174) 56(84) bytes of data.
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=1 ttl=115 time=34.9 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=2 ttl=115 time=28.4 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=3 ttl=115 time=27.5 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=4 ttl=115 time=24.7 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=5 ttl=115 time=24.7 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=6 ttl=115 time=29.0 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=7 ttl=115 time=22.3 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=8 ttl=115 time=22.0 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=9 ttl=115 time=23.0 ms
64 bytes from fra24s08-in-f14.1e100.net (142.250.186.174): icmp_seq=10 ttl=115 time=22.6 ms
64 bytes from 142.250.186.174: icmp_seq=11 ttl=115 time=23.1 ms
64 bytes from 142.250.186.174: icmp_seq=12 ttl=115 time=22.2 ms
64 bytes from 142.250.186.174: icmp_seq=13 ttl=115 time=24.6 ms
64 bytes from 142.250.186.174: icmp_seq=14 ttl=115 time=22.0 ms
64 bytes from 142.250.186.174: icmp_seq=15 ttl=115 time=23.7 ms
64 bytes from 142.250.186.174: icmp_seq=16 ttl=115 time=22.2 ms
64 bytes from 142.250.186.174: icmp_seq=17 ttl=115 time=26.1 ms
64 bytes from 142.250.186.174: icmp_seq=18 ttl=115 time=24.6 ms
64 bytes from 142.250.186.174: icmp_seq=19 ttl=115 time=24.2 ms
64 bytes from 142.250.186.174: icmp_seq=20 ttl=115 time=25.2 ms
^C64 bytes from 142.250.186.174: icmp_seq=21 ttl=115 time=27.5 ms

--- google.com ping statistics ---
21 packets transmitted, 21 received, 0% packet loss, time 60193ms
rtt min/avg/max/mdev = 22.029/24.981/34.886/3.057 ms

After deployment is removed, it starts working normally again. While it's trying to deploy, DNS will be completely broken across the nodes.

Before deployment/after deployment removal:

telnet registry-1.docker.io 80
Trying 52.204.76.244...
Connected to registry-1.docker.io.
Escape character is '^]'.

Steps To Reproduce:

I'm using Ansible to setup cluster:

  • Task for preparing master node:
---
- name: Install sqlite3 to enable K3S state backups
  apt:
    name: sqlite3
    state: present

- name: Create Rancher configuration directory
  ansible.builtin.file:
    path: /etc/rancher/k3s
    state: directory
    mode: '0755'

- name: Upload server configuration file
  ansible.builtin.copy:
    src: ../extras/server-config.yaml
    dest: /etc/rancher/k3s/config.yaml
    owner: root
    group: root
    mode: '0400'

- name: Ensure agent-token value is present in config file
  ansible.builtin.lineinfile:
    path: /etc/rancher/k3s/config.yaml
    line: 'agent-token: {{ agent_token }}'
  no_log: True

- name: Upload systemd service file
  ansible.builtin.copy:
    src: ../extras/k3s-server.service
    dest: /etc/systemd/system/k3s.service
    owner: root
    group: root
    mode: '0644'

- name: Setup systemd service
  ansible.builtin.systemd:
    name: k3s.service
    state: started
    enabled: yes
    daemon_reload: yes
  • Task for preparing agent nodes:
---
- name: Create Rancher configuration directory
  ansible.builtin.file:
    path: /etc/rancher/k3s
    state: directory
    mode: '0755'

- name: Upload agent configuration file
  ansible.builtin.copy:
    src: ../extras/agent-config.yaml
    dest: /etc/rancher/k3s/config.yaml
    owner: root
    group: root
    mode: '0400'

- name: Ensure agent-token value is present in config file
  ansible.builtin.lineinfile:
    path: /etc/rancher/k3s/config.yaml
    line: 'token: {{ agent_token }}'
  no_log: True

- name: Upload systemd service file
  ansible.builtin.copy:
    src: ../extras/k3s-agent.service
    dest: /etc/systemd/system/k3s.service
    owner: root
    group: root
    mode: '0644'

- name: Setup systemd service
  ansible.builtin.systemd:
    name: k3s.service
    state: started
    enabled: yes
    daemon_reload: yes
  • Task for preparation all of the RPIs:
---
- name: Upload hosts file
  ansible.builtin.copy:
    src: ../extras/hosts
    dest: /etc/hosts
    owner: root
    group: root
    mode: '0644'

- name: Download k3s binary
  get_url:
    url: '{{ k3s_download_url }}'
    dest: /usr/local/bin/k3s
    checksum: '{{ k3s_download_checksum }}'
    mode: '0744'

- name: Install nfs-common package
  apt:
    name: nfs-common
    state: present
  • Master config:
datastore-endpoint: "sqlite"
disable:
  - "local-storage"
write-kubeconfig-mode: "0600"
node-label:
  - "node=admin"
agent-token: "<token>" #This is being provided during the playbook run, from the vault.
  • Agent config:
server: "https://rpi4-0.localhost.localdomain:6443"
node-label:
  - "node=agent"
token: "<token>" #This is being provided during the playbook run, from the vault.
  • Master systemd service:
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/etc/systemd/system/k3s.service.env
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service'
ExecStart=/usr/local/bin/k3s server --config /etc/rancher/k3s/config.yaml
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target
  • Agent systemd service:
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/etc/systemd/system/k3s.service.env
ExecStart=/usr/local/bin/k3s agent --config /etc/rancher/k3s/config.yaml
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

Expected behavior:

Being able to deploy workload with working DNS.

Actual behavior:

Not being able to deploy workload because of broken DNS.

Additional context / logs:

Some logs from master k3s:

Nov 13 07:28:05 RPI4-0 k3s[2133853]: time="2021-11-13T07:28:05Z" level=info msg="Handling backend connection request [rpi4-1]"
Nov 13 07:28:06 RPI4-0 k3s[2133853]: I1113 07:28:06.595401 2133853 kubelet_volumes.go:160] "Cleaned up orphaned pod volumes dir" podUID=0883c307-a89d-4177-bbf5-6c6eafc4afe9 path="/var/lib/kubelet/pods/0883c307-a89d-4177-bbf5-6c6eafc4afe9/volumes"
Nov 13 07:28:12 RPI4-0 k3s[2133853]: I1113 07:28:12.648979 2133853 job_controller.go:406] enqueueing job kube-system/helm-install-traefik
Nov 13 07:28:13 RPI4-0 k3s[2133853]: I1113 07:28:13.861804 2133853 job_controller.go:406] enqueueing job kube-system/helm-install-traefik-crd
Nov 13 07:28:16 RPI4-0 k3s[2133853]: I1113 07:28:16.460564 2133853 event.go:291] "Event occurred" object="kube-system/traefik-97b44b794-4bbcr" kind="Pod" apiVersion="" type="Normal" reason="TaintManagerEviction" message="Cancelling deletion of Pod kube-system/traefik-97b44b794-4bbcr"
Nov 13 07:28:16 RPI4-0 k3s[2133853]: I1113 07:28:16.460736 2133853 event.go:291] "Event occurred" object="kube-system/helm-install-traefik--1-wcff5" kind="Pod" apiVersion="" type="Normal" reason="TaintManagerEviction" message="Cancelling deletion of Pod kube-system/helm-install-traefik--1-wcff5"
Nov 13 07:28:16 RPI4-0 k3s[2133853]: I1113 07:28:16.460803 2133853 event.go:291] "Event occurred" object="kube-system/helm-install-traefik-crd--1-mrqwq" kind="Pod" apiVersion="" type="Normal" reason="TaintManagerEviction" message="Cancelling deletion of Pod kube-system/helm-install-traefik-crd--1-mrqwq"
Nov 13 07:28:18 RPI4-0 k3s[2133853]: E1113 07:28:18.745224 2133853 remote_image.go:114] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"docker.io/rancher/klipper-lb:v0.3.4\": failed to copy: httpReadSeeker: failed open: failed to do request: Get \"https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/62/625882d9991e41d98c4b3d51384d1c8dd99cc36246a81cbfbbdadb9b7828ff3f/data?verify=1636791499-P%2FHfvq3iYd335r0eMn%2FIMbzkn88%3D\": dial tcp: lookup production.cloudflare.docker.com: Try again" image="rancher/klipper-lb:v0.3.4"
Nov 13 07:28:18 RPI4-0 k3s[2133853]: E1113 07:28:18.745503 2133853 kuberuntime_image.go:51] "Failed to pull image" err="rpc error: code = Unknown desc = failed to pull and unpack image \"docker.io/rancher/klipper-lb:v0.3.4\": failed to copy: httpReadSeeker: failed open: failed to do request: Get \"https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/62/625882d9991e41d98c4b3d51384d1c8dd99cc36246a81cbfbbdadb9b7828ff3f/data?verify=1636791499-P%2FHfvq3iYd335r0eMn%2FIMbzkn88%3D\": dial tcp: lookup production.cloudflare.docker.com: Try again" image="rancher/klipper-lb:v0.3.4"
Nov 13 07:28:18 RPI4-0 k3s[2133853]: E1113 07:28:18.745926 2133853 kuberuntime_manager.go:898] container &Container{Name:lb-port-80,Image:rancher/klipper-lb:v0.3.4,Command:[],Args:[],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:lb-port-80,HostPort:80,ContainerPort:80,Protocol:TCP,HostIP:,},},Env:[]EnvVar{EnvVar{Name:SRC_PORT,Value:80,ValueFrom:nil,},EnvVar{Name:DEST_PROTO,Value:TCP,ValueFrom:nil,},EnvVar{Name:DEST_PORT,Value:80,ValueFrom:nil,},EnvVar{Name:DEST_IPS,Value:10.43.234.3,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[NET_ADMIN],Drop:[],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod svclb-traefik-7trbj_kube-system(88ce2f96-0d05-4ea1-8c08-caab28afc45d): ErrImagePull: rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/rancher/klipper-lb:v0.3.4": failed to copy: httpReadSeeker: failed open: failed to do request: Get "https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/62/625882d9991e41d98c4b3d51384d1c8dd99cc36246a81cbfbbdadb9b7828ff3f/data?verify=1636791499-P%2FHfvq3iYd335r0eMn%2FIMbzkn88%3D": dial tcp: lookup production.cloudflare.docker.com: Try again
Nov 13 07:28:18 RPI4-0 k3s[2133853]: E1113 07:28:18.748964 2133853 pod_workers.go:836] "Error syncing pod, skipping" err="[failed to \"StartContainer\" for \"lb-port-80\" with ErrImagePull: \"rpc error: code = Unknown desc = failed to pull and unpack image \\\"docker.io/rancher/klipper-lb:v0.3.4\\\": failed to copy: httpReadSeeker: failed open: failed to do request: Get \\\"https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/62/625882d9991e41d98c4b3d51384d1c8dd99cc36246a81cbfbbdadb9b7828ff3f/data?verify=1636791499-P%2FHfvq3iYd335r0eMn%2FIMbzkn88%3D\\\": dial tcp: lookup production.cloudflare.docker.com: Try again\", failed to \"StartContainer\" for \"lb-port-443\" with ImagePullBackOff: \"Back-off pulling image \\\"rancher/klipper-lb:v0.3.4\\\"\"]" pod="kube-system/svclb-traefik-7trbj" podUID=88ce2f96-0d05-4ea1-8c08-caab28afc45d
Nov 13 07:28:18 RPI4-0 k3s[2133853]: E1113 07:28:18.925672 2133853 pod_workers.go:836] "Error syncing pod, skipping" err="[failed to \"StartContainer\" for \"lb-port-80\" with ImagePullBackOff: \"Back-off pulling image \\\"rancher/klipper-lb:v0.3.4\\\"\", failed to \"StartContainer\" for \"lb-port-443\" with ImagePullBackOff: \"Back-off pulling image \\\"rancher/klipper-lb:v0.3.4\\\"\"]" pod="kube-system/svclb-traefik-7trbj" podUID=88ce2f96-0d05-4ea1-8c08-caab28afc45d
Nov 13 07:28:31 RPI4-0 k3s[2133853]: E1113 07:28:31.976407 2133853 resource_quota_controller.go:413] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: an error on the server ("Internal Server Error: \"/apis/metrics.k8s.io/v1beta1?timeout=32s\": the server could not find the requested resource") has prevented the request from succeeding
Nov 13 07:28:32 RPI4-0 k3s[2133853]: W1113 07:28:32.043970 2133853 garbagecollector.go:703] failed to discover some groups: map[metrics.k8s.io/v1beta1:an error on the server ("Internal Server Error: \"/apis/metrics.k8s.io/v1beta1?timeout=32s\": the server could not find the requested resource") has prevented the request from succeeding]
Nov 13 07:28:39 RPI4-0 k3s[2133853]: I1113 07:28:39.187155 2133853 event.go:291] "Event occurred" object="kube-system/kube-dns" kind="Endpoints" apiVersion="v1" type="Warning" reason="FailedToUpdateEndpoint" message="Failed to update endpoint kube-system/kube-dns: Operation cannot be fulfilled on endpoints \"kube-dns\": the object has been modified; please apply your changes to the latest version and try again"
Nov 13 07:28:39 RPI4-0 k3s[2133853]: I1113 07:28:39.943575 2133853 event.go:291] "Event occurred" object="pihole/pihole" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"cluster.local/nfs-subdir-external-provisioner\" or manually created by system administrator"
Nov 13 07:28:39 RPI4-0 k3s[2133853]: I1113 07:28:39.945143 2133853 event.go:291] "Event occurred" object="pihole/pihole" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"cluster.local/nfs-subdir-external-provisioner\" or manually created by system administrator"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.170912 2133853 event.go:291] "Event occurred" object="pihole/svclb-pihole-dns-tcp" kind="DaemonSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: svclb-pihole-dns-tcp-xjfkp"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.246659 2133853 event.go:291] "Event occurred" object="pihole/svclb-pihole-dns-tcp" kind="DaemonSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: svclb-pihole-dns-tcp-vrpqm"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.264243 2133853 controller.go:611] quota admission added evaluator for: ingresses.networking.k8s.io
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.287714 2133853 event.go:291] "Event occurred" object="kube-system/metrics-server" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled down replica set metrics-server-86cbb8457f to 0"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.308915 2133853 event.go:291] "Event occurred" object="pihole/pihole" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled up replica set pihole-78d8dbbb75 to 1"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.327121 2133853 event.go:291] "Event occurred" object="pihole/svclb-pihole-dns-udp" kind="DaemonSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: svclb-pihole-dns-udp-g7wzl"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.327216 2133853 event.go:291] "Event occurred" object="pihole/svclb-pihole-dns-tcp" kind="DaemonSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: svclb-pihole-dns-tcp-hkmhc"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.336341 2133853 topology_manager.go:200] "Topology Admit Handler"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: W1113 07:28:40.390339 2133853 container.go:586] Failed to update stats for container "/kubepods/besteffort/pod719e5583-5ef1-4fa8-b943-d1aa325d941c": /sys/fs/cgroup/cpuset/kubepods/besteffort/pod719e5583-5ef1-4fa8-b943-d1aa325d941c/cpuset.mems found to be empty, continuing to push stats
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.408076 2133853 event.go:291] "Event occurred" object="kube-system/metrics-server-86cbb8457f" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulDelete" message="Deleted pod: metrics-server-86cbb8457f-94rsl"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.681743 2133853 event.go:291] "Event occurred" object="pihole/svclb-pihole-dns-udp" kind="DaemonSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: svclb-pihole-dns-udp-ttprg"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: E1113 07:28:40.745346 2133853 available_controller.go:524] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.17.169:443/apis/metrics.k8s.io/v1beta1: Get "https://10.43.17.169:443/apis/metrics.k8s.io/v1beta1": dial tcp 10.43.17.169:443: connect: connection refused
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.802135 2133853 event.go:291] "Event occurred" object="pihole/svclb-pihole-dns-udp" kind="DaemonSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: svclb-pihole-dns-udp-ljd2l"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.863705 2133853 topology_manager.go:200] "Topology Admit Handler"
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.902157 2133853 trace.go:205] Trace[1302298980]: "GuaranteedUpdate etcd3" type:*core.Pod (13-Nov-2021 07:28:40.378) (total time: 523ms):
Nov 13 07:28:40 RPI4-0 k3s[2133853]: Trace[1302298980]: ---"Transaction committed" 523ms (07:28:40.901)
Nov 13 07:28:40 RPI4-0 k3s[2133853]: Trace[1302298980]: [523.831973ms] [523.831973ms] END
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.903379 2133853 trace.go:205] Trace[698766046]: "Create" url:/api/v1/namespaces/pihole/pods/svclb-pihole-dns-udp-g7wzl/binding,user-agent:k3s/v1.22.3+k3s1 (linux/arm64) kubernetes/61a2aab/scheduler,audit-id:e4ab70b1-045e-4bcd-8a63-9af39c2fced8,client:127.0.0.1,accept:application/vnd.kubernetes.protobuf, */*,protocol:HTTP/2.0 (13-Nov-2021 07:28:40.377) (total time: 526ms):
Nov 13 07:28:40 RPI4-0 k3s[2133853]: Trace[698766046]: ---"Object stored in database" 524ms (07:28:40.902)
Nov 13 07:28:40 RPI4-0 k3s[2133853]: Trace[698766046]: [526.101945ms] [526.101945ms] END
Nov 13 07:28:40 RPI4-0 k3s[2133853]: I1113 07:28:40.936346 2133853 trace.go:205] Trace[197358589]: "Create" url:/apis/events.k8s.io/v1/namespaces/pihole/events,user-agent:k3s/v1.22.3+k3s1 (linux/arm64) kubernetes/61a2aab/scheduler,audit-id:f8d49856-0dea-4ec7-aa8f-ec5938c8809f,client:127.0.0.1,accept:application/vnd.kubernetes.protobuf, */*,protocol:HTTP/2.0 (13-Nov-2021 07:28:40.342) (total time: 593ms):
Nov 13 07:28:40 RPI4-0 k3s[2133853]: Trace[197358589]: ---"Object stored in database" 593ms (07:28:40.935)
Nov 13 07:28:40 RPI4-0 k3s[2133853]: Trace[197358589]: [593.793712ms] [593.793712ms] END
Nov 13 07:28:41 RPI4-0 k3s[2133853]: I1113 07:28:40.960095 2133853 trace.go:205] Trace[1547416]: "Create" url:/api/v1/namespaces/pihole/pods,user-agent:k3s/v1.22.3+k3s1 (linux/arm64) kubernetes/61a2aab/system:serviceaccount:kube-system:replicaset-controller,audit-id:ccab7655-e9c7-4384-8a5a-5e81bb653eaa,client:127.0.0.1,accept:application/vnd.kubernetes.protobuf, */*,protocol:HTTP/2.0 (13-Nov-2021 07:28:40.383) (total time: 576ms):
Nov 13 07:28:41 RPI4-0 k3s[2133853]: Trace[1547416]: ---"Object stored in database" 563ms (07:28:40.947)
Nov 13 07:28:41 RPI4-0 k3s[2133853]: Trace[1547416]: [576.974211ms] [576.974211ms] END

Backporting

  • Needs backporting to older releases (if it's indeed K3s issue, and not my setup issue)
@manuelbuil
Copy link
Contributor

I have a few questions.

1 - Where are you trying to resolve hostnames, in the node? Or inside a pod? Are both not working when experiencing the problem?

2 - The problem happens when you deploy the pihole deployment. Does it happen with other deployments too?

3 - Can you share how you are installing the pihole deployment please?

@ajvn
Copy link
Author

ajvn commented Nov 15, 2021

1 - On the node(s).
2 - It happens with anything that tries to pull new images from the public repositories, you can see in the logs above that it happens when it tries to pull rancher/klipper-lb:v0.3.4 image as well.
3 - Using this Helm chart https://github.com/MoJo2600/pihole-kubernetes/tree/master/charts/pihole

@manuelbuil
Copy link
Contributor

2 - It happens with anything that tries to pull new images from the public repositories, you can see in the logs above that it happens when it tries to pull rancher/klipper-lb:v0.3.4 image as well.

I think I did not explain myself correctly, let me clarify :). You mentioned After deployment is removed, it starts working normally again. By deployment, I understand you mean the pihole deployment. My question is, what happens if you don't deploy the pihole deployment and deploy something else, is DNS broken too? Or that only happens specifically when deploying pihole?

@ajvn
Copy link
Author

ajvn commented Nov 15, 2021

It used to be any deployment with 1.21.x, but I haven't tested anything else beside pihole with 1.22.x. Let me do that first after I'm done with work, and then I'll report back.

Thanks for taking the time to address this issue.

@dhermanns
Copy link

Same problem here. Worked just fine yesterday. Tried older Chart-Versions down to 2.5.1 with no success.
Could it be just running into docker pull quotas today?

@manuelbuil
Copy link
Contributor

Same problem here. Worked just fine yesterday. Tried older Chart-Versions down to 2.5.1 with no success. Could it be just running into docker pull quotas today?

Also with pihole?

@dhermanns
Copy link

Yes - but drilled it down now. In my case it was a simple dns resolving issue I solved by fixing the nameserver in /etc/resolve.conf.

@ajvn
Copy link
Author

ajvn commented Nov 30, 2021

@manuelbuil Apologies for delay, busy couple of weeks. I've just tried gogs, and downloading images works properly, seems like it's Pihole related. Let me know if you want to investigate this further, if not I'll close the issue and keep investigating on Pihole side.

Thank you.

@manuelbuil
Copy link
Contributor

If it's related to Pihole, I'd prefer to close the issue to avoid confusion :)

@ajvn
Copy link
Author

ajvn commented Nov 30, 2021

Will do. I'll post a solution here if I manage to figure it out.

@ajvn ajvn closed this as completed Nov 30, 2021
@ajvn
Copy link
Author

ajvn commented Dec 21, 2021

Confirming that indeed it wasn't issue with K3S, but rather with PiHole installation, managed to get it working couple of days ago. If anyone is facing similar issue, feel free to reach out, as this is probably not the right place for that kind of help, and solution is a bit complicated.

@diogosilva30
Copy link

@ajvn can you explain your solution? I'm struggling with the same issue. Sorry for commenting it out here, I reckon that this is not the appropriate place, but couldn't find any contact on your github profile

diogosilva30 added a commit to diogosilva30/k3s.dsilva.dev that referenced this issue May 20, 2023
- When deploying pihole on port 53 of kubernetes cluster the cluster would fail on any type of request, dns lookup. Turns out the VM configured DNS was "127.0.0.53" (a local target), instead of an upstream DNS server like Cloudflare or Google.

Refs: k3s-io/k3s#4486 MoJo2600/pihole-kubernetes#88
@ajvn
Copy link
Author

ajvn commented May 20, 2023

@diogosilva30 Hello, unfortunately I don't recall what the fix was, and my pihole Git history starts 2 days after my comment stating I've found the solution.

If you'd like we can continue in the project you referenced in this issue, open an issue there and tag me, maybe we can compare your setup to mine and reverse engineer the differences.

P.S.
One thing from my values.yaml that seems relevant to this (using Mojo2600 chart):

podDnsConfig:
  enabled: true
  policy: "ClusterFirstWithHostNet"
  nameservers:
    - 127.0.0.1
    - 208.67.222.222 # OpenDNS public nameserver

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants