Configure a single general purpose and more permissive network policy #741

consideRatio · 2021-10-06T16:07:55Z

Issue update

We currently look to try block all outbound egress besides to local IP ranges and the cloud metadata server.

@sgibson91 and I debugged a failure to make an outbound access from a user pod to an FTP server exposed via port 21.

We pinpointed the issue to be that we override the z2jh Helm chart's default value for singleuser.networkPolicy.egress. The Z2JH default is very permissive, it allows all outbound traffic except for traffic to a specific IP (169.254.169.254) associated with a so called cloud metadata server.

By overriding singleuser.networkPolicy.egress, we are reducing the egress permissions to exclude outbound FTP access.

Discussion points

Do we want to retain this opt-in system to allow ports?
1. If we want to retain this opt-in system to allow more egress ports, how to we make users aware about this, and how to we manage such opt-in requests?

Reference

Tasks

Reconfigure singleuser network policies #774
Allow singleuser pods to talk to ingress controller #782
Allow traffic to autohttps pod from singleuser #788
Document our network policies:
- for Hub Administrators: Document hubs network access docs#109
- and for 2i2c Hub Engineers: Add docs about network access #783

Discussion captured from video conv

@choldgraf mention it will be perceived as something is broken by users who doesn't understand this ahead of time.

The text was updated successfully, but these errors were encountered:

yuvipanda · 2021-10-06T16:17:22Z

berkeley-dsep-infra/datahub#2825 has more details on the FTP situation, and in berkeley-dsep-infra/datahub#2825 there's discussion about just removing the outbound port restrictions. I think my inclination is that as long as we don't allow arbitrary access inside the cluster (where the user server can just hit the prometheus server, for example!) port restrictions are pointless. So if we can allow unrestricted access to the internet but not local cluster network we're all good.

choldgraf · 2021-10-07T03:26:21Z

So if we can allow unrestricted access to the internet but not local cluster network we're all good.

This sounds like a good approach to me - most people that users assume that "everything is possible", so if we restrict things then it'll usually be confusing to folks. So I'd be +1 on making most outbound traffic possible by default and the action would be a community rep asking for something to be restricted for some reason.

sgibson91 · 2021-10-07T08:51:38Z

So if we can allow unrestricted access to the internet but not local cluster network we're all good.

If I understand correctly, this is z2jh's default setting. So the implementation of this would look like a PR that removes all the networkPolicy keys from under singleuser in basehub and daskhub, right?

yuvipanda · 2021-10-07T09:43:29Z

@sgibson91 I think so, but I want to see how it interacts with non Jupyterhub things in cluster. Primarily, I want to verify that we can't hit the Prometheus or grafana servers from user pods, as that can leak information. There is no password auth or anything for Prometheus, so we have to rely on network policy here.

sgibson91 · 2021-10-07T09:58:08Z

One way to test would be to spin up a test cluster with the support chart and a hub with all ports open and see if we can hack it

consideRatio · 2021-10-07T11:45:38Z

I think the right call may be to have tight allowances on what is allowed to access prometheus rather than specific exceptions for what the user pods cant submit traffic to unless they are general, like no local ip ranges.

So one idea is to set a netpol saying allow 0.0.0.0/0 (all) except: and list 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16 (private ip ranges) and the cloud metadata server also listed in thr z2jh defaults (169.254.169.254). Thats a simple rule that may work reliably without exceptions, or not.

z2jh defaults is to allow all egress but to that cloud metadata server ip. Note that i dont know much about that and are afraid it may be an overloaded word used for different things - but i refer to the ip blocked by z2jh by default: 169.254.169.254.

yuvipanda · 2021-10-07T20:58:31Z

@sgibson91 I think an easier way is to just test this on the pilot hubs cluster by locally doing a deploy just to the staging hub, starting a server there, and trying to access the prometheus server. Even better would be to run nmap from inside the user server, and scan everywhere to see what it can hit.

sgibson91 · 2021-10-08T09:06:52Z

The config I locally deployed to 2i2c-staging to test this is represented in this draft PR: #746

I then spun up my user server, opened a terminal and tried to ssh to the prometheus server IP address (that I got from kubectl -n support get svc). The result was that the connection timed out.

I also tried to run nmap as well but it wasn't available and I had trouble installing it.

damianavila · 2021-10-08T15:33:54Z

I also tried to run nmap as well but it wasn't available and I had trouble installing it.

I guess you may need to use an image with nmap already installed...

damianavila · 2021-10-08T15:35:03Z

So one idea is to set a netpol saying allow 0.0.0.0/0 (all) except: and list 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16 (private ip ranges) and the cloud metadata server also listed in thr z2jh defaults (169.254.169.254). Thats a simple rule that may work reliably without exceptions, or not.

This is an interesting idea...

yuvipanda · 2021-10-08T18:40:42Z

Thanks for testing this, @sgibson91!

I looked at which ports the prometheus service exposes with k -n support get svc:

support-prometheus-server                    ClusterIP      10.0.21.196   <none>         80/TCP                       2y40d

so that's port 80, while ssh only tests port 22. So you can test it with curl, which I think should be installed - curl http://<service-ip>:80.

We should also check if we can access the pod directly! By looking at k -n support get pod -l component=server,app=prometheus -o yaml, I can get its IP (.status.podIP) as well as the container port where the server is exposed (in this case, 9090). So you can test access to it with curl http://<pod-ip>:9090 as well.

Sorry if my use of the word 'access' caused confusion here - I just meant network access, not ssh!

yuvipanda · 2021-10-08T18:41:00Z

NetworkPolicies often operate at the level of pods not services, hence the need to test both.

choldgraf · 2021-10-08T18:45:22Z

Is this issue also something that we could test as a part of CI/CD? (e.g., via #722 + some test suite that were run on a new hub?). It would be really nice if we could tell our user communities that things like "security policies for the hubs" are also tested as part of our deployments, so we could have tests that did thins like "spin up a server, try to do something hub admins generally don't want people to do, confirm that it isn't possible in the staging hub"

sgibson91 · 2021-10-08T20:16:44Z

I think we should look into this tool for security testing https://github.com/armosec/kubescape

damianavila · 2021-10-09T01:09:04Z

I think we should look into this tool for security testing https://github.com/armosec/kubescape

💯 , super interesting!!

sgibson91 · 2021-10-11T16:41:50Z

I looked at which ports the prometheus service exposes with k -n support get svc:
support-prometheus-server                    ClusterIP      10.0.21.196   <none>         80/TCP                       2y40d
so that's port 80, while ssh only tests port 22. So you can test it with curl, which I think should be installed - curl http://<service-ip>:80.

We should also check if we can access the pod directly! By looking at k -n support get pod -l component=server,app=prometheus -o yaml, I can get its IP (.status.podIP) as well as the container port where the server is exposed (in this case, 9090). So you can test access to it with curl http://<pod-ip>:9090 as well.

@yuvipanda yup, both of those are reachable

damianavila · 2021-10-12T19:46:53Z

mmm... so I guess we would need to go with @consideRatio's idea?

sgibson91 · 2021-10-13T11:28:08Z

mmm... so I guess we would need to go with @consideRatio's idea?

Yes, I will try this now.

FYI, I came across this cool resource while thinking about how to write the proposed netpol https://github.com/ahmetb/kubernetes-network-policy-recipes

sgibson91 · 2021-10-13T12:09:20Z

We use the Calico network policy plugin, right? So we can use something like action: Deny and it will be valid for us? It will make the config easier to list IPs we are blocking instead of everything we're allowing except those we don't want.

(I'm quite new to writing network policies for Kubernetes so I'm not sure if I'm approaching this correctly?)

consideRatio · 2021-10-13T12:12:52Z

We use the Calico network policy plugin, right? So we can use something like action: Deny and it will be valid for us? It will make the config easier to list IPs we are blocking instead of everything we're allowing except those we don't want.

NetworkPolicy as a k8s official resource (enforced by calico), does not support denying resources, only allowing. A pod's egress targetted by a netpol will make no traffic besides whats allowed in a netpol, to be allowed.

If a pod's egress isnt targetted by any netpol, it is unrestricted and all is allowed.

So the act of targetting a pod restricts it to whats allowed explicitly, and you cant subtract/dent permissions, only add.

choldgraf · 2021-10-25T18:10:07Z

I agree with this assessment! It seems like the major next steps on this one are to document the changes implemented in #774 for hub admins and engineers. I've updated the top comment with those remaining tasks. @sgibson91 are you OK to make those docs additions as well?

sgibson91 · 2021-10-26T16:04:23Z

Reopening until I've tackled docs

yuvipanda · 2021-10-26T19:22:00Z

@sgibson91 also, in https://staging.us-central1-b.gcp.pangeo.io, the python code can't actually access curl https://staging.us-central1-b.gcp.pangeo.io. So curl https://staging.us-central1-b.gcp.pangeo.io fails by hanging forever.

sgibson91 · 2021-10-27T09:33:18Z

Thanks @yuvipanda - this is the block I need to add, right? yuvipanda/datahub@e8544b0#diff-c7ff155658f0273e3fd2df49dee1551363b934660d0cb6e169938288442eba05R97-R105

consideRatio · 2021-10-27T09:39:43Z

Hmmm? Python code from within the k8s cluster can't curl https://staging.us-central1-b.gcp.pangeo.io - why? Isn't that a request to the public IP it translates to?

sgibson91 · 2021-10-27T09:47:10Z

I opened #782

sgibson91 · 2021-10-27T10:48:23Z

Documentation PRs:

For Hub Admins: Document hubs network access docs#109
For Hub Engineers: Add docs about network access #783

I wasn't really sure what to write for the engineering docs 🤷🏻‍♀️

yuvipanda · 2021-10-27T12:33:55Z

@sgibson91 yep, that block adds support for talking to the hub public IP when using ingress. We'd probably also need a another block for talking to the public IP when it's using autohttps rather than nginx-ingress tho.

@consideRatio I think the packets somehow get routed internally and don't go out into the internet and back? Not sure :)

consideRatio · 2021-10-27T13:39:47Z

Perhaps the logic is along the lines that...

the domain is translated to a public ip
there is some local knowledge, that traffic to the public ip from within, can just as well go directly to the local IP directly instead
traffic would get blocked by accessing the local ip

sgibson91 · 2021-10-28T09:15:09Z

We'd probably also need a another block for talking to the public IP when it's using autohttps rather than nginx-ingress tho.

IIRC, autohttps is deployed into it's own pod in the hub namespace, yes? So we could achieve this with a podSelector rule like the below:

  - to:
      - podSelector:
          matchLabels:
            name: autohttps

yuvipanda · 2021-10-28T09:42:29Z

@sgibson91 yeah, I eventually ended up with something like that in https://github.com/berkeley-dsep-infra/datahub/blob/6f6506d12ecdd9799ba599d3bf878a00815a6c89/hub/values.yaml#L113. However, I also needed to specify the namespace, which is a bit annoying - ideally i'd want to be able to say 'pods in any namespace with these labels', and that's what the docs suggest we should be able to do, but alas that doesn't seem to work?

sgibson91 · 2021-10-28T09:48:21Z

However, I also needed to specify the namespace

Huh, even though the autohttps and user pods are in the same namespace? Or am I wrong in that assumption?

yuvipanda · 2021-10-28T11:28:36Z

@sgibson91 ah, in the berkeley case there are like ten hubs so each hub I want to be able to hit other hub endpoints too if necessary. So I think the same namespace case works, but I need additional namespaces to work too

sgibson91 · 2021-10-28T12:10:54Z

I opened #788 Hopefully this is the last piece! 🤞🏻

sgibson91 · 2021-10-28T13:29:34Z

I think I've hit all the major points of this now so I'm going to close this and we can open more issues as things arise 🙌🏻

This was referenced Oct 7, 2021

Add comments to basehub and daskhub chart value files #743

Merged

Grant FTP access to Pangeo hubs #744

Closed

[Incident] Pangeo Hub does not have FTP access #740

Closed

sgibson91 added impact: high labels Oct 8, 2021

sgibson91 mentioned this issue Oct 11, 2021

Team Sync - Monday, October 11th 2i2c-org/team-compass#261

Closed

choldgraf moved this to Needs input 🙌 in Sprint Board Oct 12, 2021

choldgraf added this to Sprint Board Oct 12, 2021

choldgraf moved this from Needs input 🙌 to Todo 👍 in Sprint Board Oct 25, 2021

sgibson91 moved this from Todo 👍 to In Progress ⚡ in Sprint Board Oct 26, 2021

sgibson91 closed this as completed in #774 Oct 26, 2021

Repository owner moved this from In Progress ⚡ to Done 🎉 in Sprint Board Oct 26, 2021

sgibson91 reopened this Oct 26, 2021

sgibson91 moved this from Done 🎉 to Todo 👍 in Sprint Board Oct 26, 2021

This was referenced Oct 27, 2021

Document hubs network access 2i2c-org/docs#109

Merged

Add docs about network access #783

Merged

sgibson91 moved this from Todo 👍 to Needs input 🙌 in Sprint Board Oct 27, 2021

choldgraf removed needs: review labels Oct 27, 2021

sgibson91 mentioned this issue Oct 28, 2021

Allow traffic to autohttps pod from singleuser #788

Merged

sgibson91 closed this as completed Oct 28, 2021

Repository owner moved this from Needs review 👀 to Done 🎉 in Sprint Board Oct 28, 2021

sgibson91 mentioned this issue Oct 29, 2021

Blog post: Pangeo hub deployment 2i2c-org/team-compass#244

Open

2 tasks

Configure a single general purpose and more permissive network policy #741

Configure a single general purpose and more permissive network policy #741

Comments

consideRatio commented Oct 6, 2021 • edited by sgibson91 Loading

Issue update

Discussion points

Reference

Tasks

Discussion captured from video conv

yuvipanda commented Oct 6, 2021

choldgraf commented Oct 7, 2021

sgibson91 commented Oct 7, 2021

yuvipanda commented Oct 7, 2021

sgibson91 commented Oct 7, 2021

consideRatio commented Oct 7, 2021

yuvipanda commented Oct 7, 2021

sgibson91 commented Oct 8, 2021

damianavila commented Oct 8, 2021

damianavila commented Oct 8, 2021

yuvipanda commented Oct 8, 2021

yuvipanda commented Oct 8, 2021

choldgraf commented Oct 8, 2021

sgibson91 commented Oct 8, 2021

damianavila commented Oct 9, 2021 • edited Loading

sgibson91 commented Oct 11, 2021

damianavila commented Oct 12, 2021

sgibson91 commented Oct 13, 2021

sgibson91 commented Oct 13, 2021

consideRatio commented Oct 13, 2021 • edited Loading

choldgraf commented Oct 25, 2021

sgibson91 commented Oct 26, 2021

yuvipanda commented Oct 26, 2021

sgibson91 commented Oct 27, 2021

consideRatio commented Oct 27, 2021

sgibson91 commented Oct 27, 2021

sgibson91 commented Oct 27, 2021

yuvipanda commented Oct 27, 2021

consideRatio commented Oct 27, 2021

sgibson91 commented Oct 28, 2021

yuvipanda commented Oct 28, 2021

sgibson91 commented Oct 28, 2021

yuvipanda commented Oct 28, 2021

sgibson91 commented Oct 28, 2021

sgibson91 commented Oct 28, 2021

consideRatio commented Oct 6, 2021 •

edited by sgibson91

Loading

damianavila commented Oct 9, 2021 •

edited

Loading

consideRatio commented Oct 13, 2021 •

edited

Loading