Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rancher doesn't work on VPN / Corporate Proxy #995

Closed
Tracked by #1115
Pycnomerus opened this issue Nov 24, 2021 · 31 comments
Closed
Tracked by #1115

Rancher doesn't work on VPN / Corporate Proxy #995

Pycnomerus opened this issue Nov 24, 2021 · 31 comments
Assignees
Milestone

Comments

@Pycnomerus
Copy link

Rancher Desktop Version

0.6.1

Rancher Desktop K8s Version

1.21.6

What operating system are you using?

Windows

Operating System / Build Version

20H2

What CPU architecture are you using?

x64

Windows User Only

corporate vpn, we also have a required proxy when using that vpn. from what i can gather the rancher-desktop is not aware of the proxy at all.

Actual Behavior

when connected to a VPN the image pulling / dns detection capability of the rancher-desktop wsl stop working.

once off the VPN image pull can resolve on public repo.

Steps to Reproduce

  • connect to a vpn wich require the use of a proxy for internet access.
  • try to resolve any image

Result

time="2021-11-24T15:05:11Z" level=info msg="trying next host" error="failed to do request: Head "https://dev.artifactory.internal.ca/v2/gasp/certificates/manifests/latest\": dial tcp: lookup dev.artifactory.internal.ca on 172.18.240.1:53: no such host" host=dev.artifactory.internal.ca
time="2021-11-24T15:05:11Z" level=fatal msg="failed to resolve reference "dev.artifactory.internal.ca/gasp/certificates:latest": failed to do request: Head "https://dev.artifactory.internal.ca/v2/gasp/certificates/manifests/latest\": dial tcp: lookup dev.artifactory.internal.ca on 172.18.240.1:53: no such host"

Expected Behavior

rancher to use the proxy to resolve it's pull mechanic.

being able to retrieve image once this is done.

Additional Information

No response

@Pycnomerus Pycnomerus added the kind/bug Something isn't working label Nov 24, 2021
@ericpromislow
Copy link
Contributor

Answered by Andrew G on slack: "This is a known issue with Cisco anyconnect and is an issue with WSL itself"

@agschrei
Copy link

@ericpromislow I am currently experiencing the same issue (also on Windows) with a corporate proxy that isn't Cisco AnyConnect. My best guess is that the WSL distro that is launched by Rancher Desktop needs to be made aware of the proxy, but checking the configuration files that get installed in AppData I haven't found any section where I would be able to inject that information.

@mirraxian
Copy link

@jandubois
Copy link
Member

Another link that may explain DNS issues with VPN: microsoft/WSL#1350 (comment)

@Seekatar
Copy link

Seekatar commented Dec 2, 2021

On Win10/WSL2, I am having a similar issue. I have several containers that hit resources over the internet. Some require VPN (Cisco AnyConnect) access. The names for ones that require VPN cannot be resolved. The host name looks like this:

myapp-dev.nonprod.aws.division.company.com

If I go into WSL Ubuntu or rancher-desktop distros, the name can resolve.

I created an Alpine container to test connectivity and it cannot resolve the name. If I add the name to the Alpine hosts file, it works, to the network connectivity is ok, just name resolution.

These containers work fine when built and run from Docker Desktop, podman, or minikube.

@gaktive gaktive mentioned this issue Dec 15, 2021
2 tasks
@cidus
Copy link

cidus commented Dec 21, 2021

I also have issues in Win10/WSL2 with a corporate proxy:
image

We use a non-transparent proxy with Active Directory authentication, so if I need to hardconfig it anywhere, I use a CNTLM localhost proxy.
I export http_proxy and https_proxy env variables for cygwin/mingw apps, but RD seems to not use this.
It seems to be pulling the windows config proxy address correctly, but it cannot authenticate.

If I change the Windows system proxy to use my localhost CNTLM (which does not require authentication), RD makes some progress, downloads the K8s versions and tries to pull the images, but it fails.

For every Distro in WSL I need to export the system http_proxy settings, maybe that's one way to resolve the issue?
However that's not a default windows variable.

Example log contents:

E1221 22:49:54.582373     268 kuberuntime_manager.go:818] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed pulling image \"rancher/mirrored-pause:3.1\": Error response from daemon: Get \"https://registry-1.docker.io/v2/\": context deadline exceeded" pod="kube-system/helm-install-traefik-crd--1-6ttnm"
E1221 22:49:54.582695     268 pod_workers.go:918] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"helm-install-traefik-crd--1-6ttnm_kube-system(b2dc3265-7cf5-48c2-ac48-aa550f8c5c89)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"helm-install-traefik-crd--1-6ttnm_kube-system(b2dc3265-7cf5-48c2-ac48-aa550f8c5c89)\\\": rpc error: code = Unknown desc = failed pulling image \\\"rancher/mirrored-pause:3.1\\\": Error response from daemon: Get \\\"https://registry-1.docker.io/v2/\\\": context deadline exceeded\"" pod="kube-system/helm-install-traefik-crd--1-6ttnm" podUID=b2dc3265-7cf5-48c2-ac48-aa550f8c5c89
E1221 22:49:54.665659     268 remote_runtime.go:116] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed pulling image \"rancher/mirrored-pause:3.1\": Error response from daemon: Get \"https://registry-1.docker.io/v2/\": dial tcp 52.55.168.20:443: i/o timeout"
E1221 22:49:54.665933     268 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed pulling image \"rancher/mirrored-pause:3.1\": Error response from daemon: Get \"https://registry-1.docker.io/v2/\": dial tcp 52.55.168.20:443: i/o timeout" pod="kube-system/helm-install-traefik--1-2gxsv"
E1221 22:49:54.666180     268 kuberuntime_manager.go:818] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed pulling image \"rancher/mirrored-pause:3.1\": Error response from daemon: Get \"https://registry-1.docker.io/v2/\": dial tcp 52.55.168.20:443: i/o timeout" pod="kube-system/helm-install-traefik--1-2gxsv"
E1221 22:49:54.666481     268 pod_workers.go:918] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"helm-install-traefik--1-2gxsv_kube-system(dbdfd57a-806f-41ba-8a71-9d074c1c7e25)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"helm-install-traefik--1-2gxsv_kube-system(dbdfd57a-806f-41ba-8a71-9d074c1c7e25)\\\": rpc error: code = Unknown desc = failed pulling image \\\"rancher/mirrored-pause:3.1\\\": Error response from daemon: Get \\\"https://registry-1.docker.io/v2/\\\": dial tcp 52.55.168.20:443: i/o timeout\"" pod="kube-system/helm-install-traefik--1-2gxsv" podUID=dbdfd57a-806f-41ba-8a71-9d074c1c7e25
E1221 22:49:55.682253     268 remote_runtime.go:116] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed pulling image \"rancher/mirrored-pause:3.1\": Error response from daemon: Get \"https://registry-1.docker.io/v2/\": dial tcp 52.55.168.20:443: i/o timeout"
E1221 22:49:55.682597     268 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed pulling image \"rancher/mirrored-pause:3.1\": Error response from daemon: Get \"https://registry-1.docker.io/v2/\": dial tcp 52.55.168.20:443: i/o timeout" pod="kube-system/metrics-server-9cf544f65-dwxjk"
E1221 22:49:55.683016     268 kuberuntime_manager.go:818] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed pulling image \"rancher/mirrored-pause:3.1\": Error response from daemon: Get \"https://registry-1.docker.io/v2/\": dial tcp 52.55.168.20:443: i/o timeout" pod="kube-system/metrics-server-9cf544f65-dwxjk"
E1221 22:49:55.683341     268 pod_workers.go:918] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"metrics-server-9cf544f65-dwxjk_kube-system(1be77d99-304a-4564-9306-3605b2c763a9)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"metrics-server-9cf544f65-dwxjk_kube-system(1be77d99-304a-4564-9306-3605b2c763a9)\\\": rpc error: code = Unknown desc = failed pulling image \\\"rancher/mirrored-pause:3.1\\\": Error response from daemon: Get \\\"https://registry-1.docker.io/v2/\\\": dial tcp 52.55.168.20:443: i/o timeout\"" pod="kube-system/metrics-server-9cf544f65-dwxjk" podUID=1be77d99-304a-4564-9306-3605b2c763a9
E1221 22:50:09.352650     268 resource_quota_controller.go:413] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
W1221 22:50:09.838525     268 garbagecollector.go:703] failed to discover some groups: map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]

@ram3shpala
Copy link

I have the same issue when connected to VPN rancher desktop doesn't connect to my organization image repo and also the it can't connect to public image repos. I tried the same with docker desktop on WSL2 connecting to VPN, it works fine.

@autrk
Copy link

autrk commented Jan 3, 2022

A hint which may help:

The proxy variables must be made in the distro "rancher-desktop". Since it is an alpine distro without non-root user, I set the proxy variables in the file /etc/profile. Additionally I had to change the /etc/resolv.conf.

@jandubois jandubois added this to the v1.0.0 milestone Jan 12, 2022
@gaktive gaktive modified the milestones: v1.0.0, v1.1.0 Jan 15, 2022
@louzadod
Copy link

Hi @ckihm , I could not find any "rancher-desktop" distro in WSL after running the Rancher Desktop installer file "Rancher.Desktop.Setup.1.0.0-beta.1.exe".

Is it hidden?

Thanks in advance.

@autrk
Copy link

autrk commented Jan 25, 2022

Hi @louzadod,
it is visible. You should be able to see it in the result of the following command: wsl -l -v

@cidus
Copy link

cidus commented Jan 25, 2022

Hi @ckihm, thanks for the pointers, I've set the HTTP_PROXY / HTTPS_PROXY variables in /etc/profile of the rancher-desktop WSL distro but docker/k3s still cannot download images, any other clue? The proxy is correct, curl works.

@cidus
Copy link

cidus commented Jan 26, 2022

Looking further into this, the alpine distro used as base for rancher-desktop uses openrc to manage the service calls, and openrc has no way of passing environment variables other than editing the init.d and/or conf.d files. However, RD replaces these files on each startup, so editing them is useless.

There's actually one piece of config that could help in /etc/rc.conf, and that's rc_env_allow in which you can define a list of variables to be passed to all services, but it has no effect. It may be ignored or the variables are set only for the wsl-helper process, which is what's actually being called in the init.d/docker script.

@mook-as
Copy link
Contributor

mook-as commented Jan 26, 2022

Setting rc_env_allow should work, as long as you define the variables in /etc/environment (the file doesn't exist by default). However, we're not guaranteeing it will stay working in the future (it's an implementation detail rather than an intentionally designed feature); in the mean time, though, using it shouldn't have any issues.

I've mentioned this before in #1267 (comment).

@cidus
Copy link

cidus commented Jan 26, 2022

Thanks @mook-as, I was using /etc/profile to export the http_proxy variables.
So I moved them to /etc/environment, and now I can see the variables are part of the k3s process /environ.
However, they are not part of neither the dockerd nor containerd /environ... is dockerd not started by openrc?

@mook-as
Copy link
Contributor

mook-as commented Jan 26, 2022

dockerd is started by OpenRC (via wsl-helper docker-proxy start … as found in /etc/init.d/docker), so that should be there provided rc_env_allow is set appropriately.

Similary, containerd is started via k3s (unless you're using the docker backend, in which case it's started by dockerd).

@cidus
Copy link

cidus commented Jan 27, 2022

I don't know if it's something that changed from beta1 to 1.0.0, but I was able to make it work now.
I put the export http_proxy commands in /etc/rc.conf and configured rc_env_allow to pass them.
Thank you @mook-as and @ckihm for your help. I can now pull images from the internet using moby, and docker-cli.

I'm still getting the kubernetes error "http: not supported, Expected https:" though

@lukaszmat
Copy link

I also encounter the same problem on Windows 10. Thanks @cidus for the idea. I am not sure if /etc/rc.conf was designed for adding such export commands to it.
But it really works for containerd, k3s and dockerd processes even if I leave rc_env_allow unset.

@eligne-pro
Copy link

Hi

Having another symptoma, linked with http-proxy variables in the w10 host.
If those are not set, rancher desktop starts, but after it cannot work correctly as it fails due to proxy.
Workaround seems to be available through @cidus answer (not tried it yet).
But....
we need to have http-proxy and https-proxy variables set on the corporate w10 machine, as other tools are relying on them to connect to internet.
In this case, on my side rancher desktop does not even start. I suspect this is because everything is now routed to the proxy, including the connection to the rancher wsl2. no-proxy does not help here, because the ip address of the rancher-desktop wsl instance varies between executions.

Is there a way for rancher-desktop to ignore the proxy configuration of the host to contact its underlying wsl2 instance? eg connection to dockerd etc?

@fm-knopfler
Copy link

As previously mentioned, adding the following lines to /etc/rc.conf did the trick for me. I use dockerd.

#corporate proxy configuration
rc_env_allow="http_proxy http_proxy no_proxy"
export http_proxy="http://mycompany.com:8080/"
export https_proxy="http://mycompany.com:8080/"

Note that you also need to have your DNS in /etc/resolv.conf configured and madesure WSL doesn't overwrite DNS settings by adding this option to /etc/wsl.conf.

[network]
generateResolvConf = false

@ericpromislow
Copy link
Contributor

Thanks for that info!

@gaktive gaktive removed this from the Later milestone Feb 22, 2022
@justinp-kr
Copy link

Hi all, I'd appreciate any help. I'm running Rancher Desktop 1.1.1 on Windows 10 Enterprise 20H2 19042.1526. I edited files /etc/rc.conf, /etc/environment, /etc/profile, /etc/resolv.conf, /etc/wsl.conf using:

wsl -d rancher-desktop -- vi [path]

I restarted distribution by switching container runtime in Rancher Desktop between containerd and dockerd. Then I confirmed:

wsl -d rancher-desktop -- uptime

When I look at environment, I don't see any of the variables:

wsl -d rancher-desktop -- env

I tried with and without export. I do see my changes persist after I restart distribution. But no variables in environment and Rancher Desktop continues to display:

Error: Could not fetch releases: Proxy Authentication Required

I should see variables in environment, correct? Does some component not yet support format http://[username]:[password]@[hostname][:port]? There are no special chars in my username, password, hostname, or port.

@fm-knopfler
Copy link

fm-knopfler commented Apr 28, 2022

It seems neither the solution with rc.conf or setting the proxy env variables in the provisioning script is working in 1.2.1.

My provisioning script that worked in previous releases.

#!/bin/sh
ip="192.168.1.10"
if ! grep -q "$ip" /etc/dnsmasq.d/data-resolv-conf; then
  echo "nameserver $ip" >> /etc/dnsmasq.d/data-resolv-conf
fi

ip="1.1.1.1"
if ! grep -q "$ip" /etc/dnsmasq.d/data-resolv-conf; then
  echo "nameserver $ip" >> /etc/dnsmasq.d/data-resolv-conf
fi

prox='proxy.mycompany.com'
ping -c 3 $prox 2>/dev/null 1>/dev/null
if [ "$?" = 0 ]
then
	echo "rc_env_allow=http_proxy http_proxy no_proxy" >> /etc/environment
	echo "http_proxy=http://proxy.mycompany.com:8080/" >> /etc/environment
	echo "https_proxy=http://proxy.mycompany.com:8080/" >> /etc/environment
fi

@jandubois jandubois modified the milestones: v1.3.0, Later Apr 28, 2022
@Nino-K
Copy link
Member

Nino-K commented Apr 28, 2022

We have implemented an experimental feature to address this issue, please take a look at here on how to enable it.

Feel free to provide us with some feedback as we are planning to expose this feature as non-experimental in our upcoming release.

@jcageman
Copy link

jcageman commented May 3, 2022

I've installed version 1.3.0 on windows 10, haven't gotten rancher-desktop to work with my current vpn in previous versions and have set "experimentalHostResolver": true to test if it is now working. As soon as i am connected to vpn (checkpoint vpn with route all traffic through gateway enabled) the following command stops working:

nerdctl pull nginx
docker.io/library/nginx:latest: resolving      |--------------------------------------|
elapsed: 29.9s                  total:   0.0 B (0.0 B/s)
INFO[0030] trying next host                              error="failed to do request: Head \"https://registry-1.docker.io/v2/library/nginx/manifests/latest\": dial tcp 52.21.28.242:443: i/o timeout" host=registry-1.docker.io
ERRO[0030] active check failed                           error="context canceled"
FATA[0030] failed to resolve reference "docker.io/library/nginx:latest": failed to do request: Head "https://registry-1.docker.io/v2/library/nginx/manifests/latest": dial tcp 52.21.28.242:443: i/o timeout

Also kubernetes won't start anymore as soon as i am connected to vpn.

@Nino-K
Copy link
Member

Nino-K commented May 3, 2022

@jcageman are you seeing the same results with "experimentalHostResolver": false set to false or are you getting a different set of errors. It looks like FATA[0030] failed to resolve reference comes from containrd simply failing on the name lookup and it does an I/O timeout.

I suspect all traffic is being routed through the security gateway and nothing is being resolved by the host-resolver itself and the VPN client might treat host-resolver traffic as DNS Hijacking. Are you able to verify that you have this configuration for split DNS enabled?

@jcageman
Copy link

jcageman commented May 3, 2022

@jcageman are you seeing the same results with "experimentalHostResolver": false set to false or are you getting a different set of errors. It looks like FATA[0030] failed to resolve reference comes from containrd simply failing on the name lookup and it does an I/O timeout.

I suspect all traffic is being routed through the security gateway and nothing is being resolved by the host-resolver itself and the VPN client might treat host-resolver traffic as DNS Hijacking. Are you able to verify that you have this configuration for split DNS enabled?

Regarding the "experimentalHostResolver": false, i don't see any difference. I also had the exact same issue in the previous version of rancher-desktop, which is basically the same to this option right? I have created a ticket at our IT department if they have the split DNS enabled, let me come back to you on that. I have also seen this issue with docker desktop (which i was trying out previously). I believe its a general WSL2 issue i am having with VPN (found many threads regarding it, but no working solution). I have verified in the mean time that the box "route all traffic through gateway" is for sure causing the behavior, if i use the VPN without that it works fine.

@jcageman
Copy link

jcageman commented May 4, 2022

@jcageman are you seeing the same results with "experimentalHostResolver": false set to false or are you getting a different set of errors. It looks like FATA[0030] failed to resolve reference comes from containrd simply failing on the name lookup and it does an I/O timeout.
I suspect all traffic is being routed through the security gateway and nothing is being resolved by the host-resolver itself and the VPN client might treat host-resolver traffic as DNS Hijacking. Are you able to verify that you have this configuration for split DNS enabled?

Regarding the "experimentalHostResolver": false, i don't see any difference. I also had the exact same issue in the previous version of rancher-desktop, which is basically the same to this option right? I have created a ticket at our IT department if they have the split DNS enabled, let me come back to you on that. I have also seen this issue with docker desktop (which i was trying out previously). I believe its a general WSL2 issue i am having with VPN (found many threads regarding it, but no working solution). I have verified in the mean time that the box "route all traffic through gateway" is for sure causing the behavior, if i use the VPN without that it works fine.

I've got a reply from our IT department:

By default our VPN uses split-dns. So only services within our private cloud are routed through the VPN. All other public cloud and the general Internet is not routed through the tunnel. If that public cloud service is whitelisted for internal ip's only, you need to have an internal IP address and therefore enable the route all traffic option. Like you mentioned. But that seems to break the Rancher software.

So for sure we are using split-dns and i think if i tick the box "route all traffic through gateway" it just means everything is routed through the VPN connection. I will also have an internal IP in this case, which is exactly the reason why i have it enabled, since many places within my company require that internal IP.

@jandubois jandubois modified the milestones: Next, Later May 20, 2022
@gaktive gaktive modified the milestones: Next, Later Jul 19, 2022
@gaktive gaktive modified the milestones: Next, Later Aug 30, 2022
@advanceboy
Copy link

The cause of the VPN communication failure is not the Rancher Desktop, but the WSL and the WinNAT on Hyper-V that the WSL uses.
It appears that DNS responses on WinNAT are not reaching VMs under Hyper-V due to the rooting metric configured by the VPN solutions.

Reference: VPN に繋ぐと WSL2 や Hyper-V VM でネットワークに繋がらなくなる問題を解消する

The above page also describes the solution, but we quote an excerpt here.

  1. Launch Windows PowerShell with administrator and execute the following code. Change the $InterfaceName variable according to the VPN solution you are using.
    If one run does not succeed, run the last line of this code over and over until the rooting metric is rewritten.

    $InterfaceName = 'Cisco AnyConnect';
    # define function
    function Get-NetworkAddress ([Parameter(Mandatory, ValueFromPipelineByPropertyName)][string]$IPAddress, [Parameter(Mandatory, ValueFromPipelineByPropertyName)][int]$PrefixLength) {
        process {
            [pscustomobject]@{
                Addr = $IPAddress;
                Prfx = $PrefixLength;
                NwAddr = [ipaddress]::Parse($IPAddress).Address -band [uint64][BitConverter]::ToUInt32([System.Linq.Enumerable]::Reverse([BitConverter]::GetBytes([uint32](0xFFFFFFFFL -shl (32 - $PrefixLength) -band 0xFFFFFFFFL))), 0);
            };
        }
    }
    # extend route metric
    $targets = Get-NetAdapter | Where-Object InterfaceDescription -Match 'Hyper-V Virtual Ethernet Adapter' | Get-NetIPAddress -AddressFamily IPv4 | Get-NetworkAddress;
    Get-NetAdapter | Where-Object InterfaceDescription -Match $InterfaceName | Get-NetRoute -AddressFamily IPv4 | Select-Object -PipelineVariable rt | Where-Object { $targets | Where-Object { $_.NwAddr -eq (Get-NetworkAddress $rt.DestinationPrefix.Split('/')[0] $_.Prfx).NwAddr } } | Set-NetRoute -RouteMetric 6000;

Using this method, you should be able to connect to proxies in the intranet from within the container.

@Nino-K
Copy link
Member

Nino-K commented Oct 27, 2022

@advanceboy thanks for the update, we have a solution here, however, this is a temporary workaround. We are working on a more robust permanent solution.

@Nino-K
Copy link
Member

Nino-K commented Apr 26, 2023

we have introduced an experimental #3810 in 1.8.1 that should fix your VPN issue. The feature will be fully baked in our next few upcoming releases. As I mentioned it is experimental and the downside is the port forwarding for all the publish ports has to be performed manually as mentioned here: #4096 (comment)

@Nino-K
Copy link
Member

Nino-K commented Jun 20, 2023

You can now enable the new network using rdctl:

rdctl set --experimental.virtual-machine.networking-tunnel=true

This should allow Rancher Desktop to function correctly behind a VPN. I'm going to close this issue, feel free to re-open if this suggestion is not solving the issue.

@Nino-K Nino-K closed this as completed Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests