Hubble on the management cluster #1594

teemow · 2022-11-03T08:41:34Z

As we transition to cilium on all providers it would be great to add Hubble on the management cluster to get better visibility of the network on each workload cluster.

Value

Faster debugging of network problems
Identify misbehaving services quicker
Help the customer to find bottlenecks in their services

See https://github.com/cilium/hubble for more details.

Please also allow customers to access Hubble.

teemow · 2022-11-03T08:44:22Z

@weatherhog imo this is getting important. More and more providers are switched to cilium and this would give us much more visibility into the network. Do you think Cabbage is a good fit for this?

weatherhog · 2022-11-03T15:12:28Z

@teemow I think it could be a good fit. The main question is, how far are we with cilium? Last time we wanted to test for example linkerd on cilium based clusters the release got stopped and reverted. Do we have stable cilium installations at the moment?

teemow · 2022-11-03T15:25:06Z

Afaik CAPO, CAPG are cilium and in production with customers already. CAPA currently being tested already and CAPVCD is also being worked on. Vintage AWS has been reverted but the MCs are with cilium already afaik. So it should be possible to test this already.

Please confirm @cornelius-keller @alex-dabija

alex-dabija · 2022-11-07T14:48:26Z

Afaik CAPO, CAPG are cilium and in production with customers already. CAPA currently being tested already and CAPVCD is also being worked on. Vintage AWS has been reverted but the MCs are with cilium already afaik. So it should be possible to test this already.

CAPG is not yet in production. We are currently deploying the first GA management cluster for our customer.

Yes, you can test Hubble on CAPG and CAPA clusters. Both are using Cilium.

cornelius-keller · 2022-11-08T09:51:07Z

CAPO is in production using cilium as well - only difference is that we didn't replace kube-proxy by cilium yet.

mcharriere · 2022-11-15T15:21:25Z

I've been checking a bit the current status and from what I can see:

Hubble is enabled by default
Hubble Relay is disabled by default (but enabled on CAPA).
Hubble UI is disabled by default
Hubble metrics are disabled by default.

Now, the question is what exactly we want to enable and how would like to expose it to our customers.

Hubble relay is the API for the UI and the CLI. I'm not sure if could safely expose it because the CLI has no auth capabilities.
Hubble UI could be expose with an ingress (internal preferably) but has no auth.
- It might be possible to use oauth-proxy to add auth.
- I can't find any way to restrict the visibility to a single namespace, although that might no be a problem.
Enabling Hubble metrics and scrape them with prometheus should be simple, although we might want to go easy with it to avoid overloading our mon stack.

weatherhog · 2022-12-06T11:04:30Z

@giantswarm/team-cabbage please look into hubble and into the questions provided by @mcharriere

if there are more questions please add them to this issue

ced0ps · 2022-12-06T12:29:51Z

UI could be made available similarly to anything else that's available through opsctl open *. If it can't be safely exposed CLI should be accessed through port-forwarding. We should see if there is an actual use case for either or if we just want to rely on the metrics for now
I think we should get some experience with it's capabilities to decide on how to expose UI/CLI. Metrics should be shared as they become available.
This decision should be made by atlas as they best understand the capabilities of our observability stack.

teemow · 2022-12-08T14:27:43Z

Would it be a lot of work to learn with the customer what is useful and what not? I'd suggest to share it from the beginning.

@weatherhog let's talk to Atlas then. Maybe we have some low hanging fruits to eg create two prometheus instances per cluster or similar. And then we use a separate prometheus just for the network metrics. But not sure if this is possible. But somehow Atlas needs to enable teams to add more metrics soon.

webwurst · 2023-01-16T19:14:40Z

It looks like the hubble-relay-service is provided via node-port? To access that api upstream documentation shows usage of hubble observe. I couldn't find any details about authentication/authorization and still wondering if there is maybe some access restriction on the cillium level? Probably something to ask on their Slack channel, haven't done that yet.
```
hubble.relay.service: {"nodePort":31234,"type":"ClusterIP"}
```

It's about the same, I think:

hubble.ui.service: {"nodePort":31235,"type":"ClusterIP"}

Hubble metrics are disabled by default and can be enabled in a kind of fine-granular way. But we still don't know what to expect for each :) A dedicated Prometheus instance would be helpful here. In case of high resource usage we can scale down and have time to think about adjustment before trying again.

weatherhog · 2023-01-17T10:59:25Z

@TheoBrigitte any chance to have a second prometheus instance for this Hubble use case? Lets have a chat about this.

weatherhog · 2023-02-27T14:06:15Z

@TheoBrigitte just a ping to get a reply on our question.

T-Kukawka · 2023-02-27T14:09:36Z

@paurosello can u provide feedback on your Hubble usage during the Stability Sprint?

TheoBrigitte · 2023-02-27T15:45:00Z

@TheoBrigitte any chance to have a second prometheus instance for this Hubble use case? Lets have a chat about this.

I do not understand why we would need another Prometheus instance dedicated to Hubble when we already have Prometheus servers in place.

teemow · 2023-02-28T08:14:23Z

@TheoBrigitte the question was if our prometheus setup can stand a lot more networking metrics from the workload clusters. Maybe @paurosello or @whites11 can say a bit more as they have now also looked into this. Do we enable all the metrics with cilium already? Is that independent of using the hubble UI? Did we test if with v19 the metrics have increased and if yes how much? Will our prometheus setup still work on the big clusters if we upgrade to cilium?

paurosello · 2023-02-28T09:43:14Z

We have been playing with cilium and hubble during this stability sprint and I would say enabling hubble should not be a problem (we have a PR for it already to change the values in the app that should be released shortly) due to the following reasons:

Hubble can be used without prometheus scraping. Scraping is only needed to get the grafana dashboards.
We removed aws-cni metrics from prometheus, freeing some metrics already.
The biggest metric generator is coming from the agents (which we are scraping already).
As we have the servicemonitors defined in an app https://github.com/giantswarm/cilium-servicemonitors-app we could disable hubble monitoring if needed.
If all above fails and we need to reduce metrics cilium has the nice feature of allowing us to disable anything not needed.

@teemow as for your questions:

we are already scraping agents, which are the ones that produce more metrics. We want to scrape the operator and hubble too (there are not many interesting metrics here, but we need to explore further).
yes, enabling hubble and prometheus scraping is isolated.
did not check if MC are consuming more metrics but it has been scraping cilium-agents for months already.

For now the plan is to use port-forward to access it and if we find it useful improve the setup with ingress and oauth. This will be enabled on MC and WC

TheoBrigitte · 2023-03-06T14:47:18Z

Ok, so if you are already shipping metrics into Prometheus since months, I guess we are fine right, or is there anything you need from Atlas ?

T-Kukawka · 2023-03-07T15:42:12Z

PR to enable hubble/monitoring for cilium: giantswarm/cilium-app#56

teemow · 2023-03-07T16:36:08Z

@TheoBrigitte cilium isn't released for vintage yet. It is comming with v19. I think @paurosello is talking about the management clusters only.

@T-Kukawka we need to make sure that we know prometheus doesn't break if we add cilium to the big clusters.

T-Kukawka · 2023-03-07T16:41:42Z

Sure @teemow let me add an issue so we test it with Atlas when we have release ready

T-Kukawka · 2023-03-08T13:55:21Z

general Cilium v1.13 work for Phoenix: #2131

whites11 · 2023-03-09T12:35:43Z

this is running on gaia and gremlin now

TheoBrigitte · 2023-03-10T16:59:17Z

@T-Kukawka can you please schedule a session ? so we work out together how Prometheus behave with Cilium.

T-Kukawka · 2023-03-13T10:05:41Z

We will move with testing this week so when we are ready we will involve atlas for WCs testing, we can sync here; https://github.com/giantswarm/giantswarm/issues/26139

On MCs it is already running on Gaia and Gremlin so you can check this async

weatherhog · 2023-03-14T16:26:02Z

removing this von cabbage as this is in phoenix

whites11 · 2023-04-11T12:28:23Z

hubble is running in all MCs now.

whites11 · 2023-04-12T14:54:10Z

I tried exposing hubble ui on the MCs through ingress, there are a bunch of problems:

hubble UI runs in kube-system, where we don't have oauth2-proxy so authentication is not trivial
kube-system namespace is more restricted than others, so http01 challenge does not work out of the box if we create the ingress there
I tried creating the ingress in monitoring ns (where oauth2-proxy runs) using externalname service to proxy to hubble UI but for some reason it does not work

Gacko · 2023-04-12T14:59:59Z

I tried creating the ingress in monitoring ns (where oauth2-proxy runs) using externalname service to proxy to hubble UI but for some reason it does not work

Probably because following external name services is disabled by default in nginx-ingress-controller.

https://github.com/giantswarm/nginx-ingress-controller-app/blob/main/helm/nginx-ingress-controller-app/values.yaml#L192
https://github.com/kubernetes/ingress-nginx/blob/main/docs/user-guide/cli-arguments.md: See --disable-svc-external-name

whites11 · 2023-04-12T15:34:44Z

I tried creating the ingress in monitoring ns (where oauth2-proxy runs) using externalname service to proxy to hubble UI but for some reason it does not work

Probably because following external name services is disabled by default in nginx-ingress-controller.

https://github.com/giantswarm/nginx-ingress-controller-app/blob/main/helm/nginx-ingress-controller-app/values.yaml#L192 https://github.com/kubernetes/ingress-nginx/blob/main/docs/user-guide/cli-arguments.md: See --disable-svc-external-name

Oh thanks, I thought it was something along those lines but couldn't track it down

whites11 · 2023-04-13T12:04:26Z

hubble+auth available for vintage in giantnetes-terraform 14.13.0. Deployed in gaia, gremlin, giraffe for now (will be enabled everywhere with next round of MC updates)

whites11 · 2023-04-20T07:50:07Z

@teemow something I don't get from this ticket: how would it be possible to see hubble data from the workload clusters in the MC?

teemow · 2023-04-24T11:35:23Z

@whites11 if hubble doesn't support multi-cluster then we probably have to think about either adding hubble per WC on the MC or to have hubble on each WC.

FYI @weatherhog (connectivity) @TheoBrigitte (monitoring) what do you think about it?

Imo it would be helpful not only for platform teams but also for development teams to get access and see the connectivity between different services.

T-Kukawka · 2023-04-24T11:41:32Z

Afaik @teemow the Huble is enabled by default in our Cilium already: https://github.com/giantswarm/cilium-app/blob/main/CHANGELOG.md#080---2023-03-08. This means there will be Huble on each WC and MC(should already be the case)

teemow · 2023-04-24T11:46:08Z

@T-Kukawka but we need to decide how we want to enable it for customers. I am not against having it on the WCs but I'd like to discuss this a bit. We could also have the UI on the MC. And we're constantly moving more functionality into the MC to have more of a single pane of glass.

TheoBrigitte · 2023-04-24T16:44:21Z

From what I see here I do not understand the implication of this solution in our observability stack. I think having a session to explain what this is about would be best for me to comprehend this better and see how we move forward on this.

T-Kukawka · 2023-04-25T07:31:18Z

i think we can have a joined session with Phoenix/Cabbage/Atlas but then topic should be continued by 2 other teams and should not block the release. This could come next as an improvement as in theory that is available already.

teemow · 2023-05-04T13:21:59Z

@T-Kukawka do you organize a session then?

T-Kukawka · 2023-05-04T16:30:02Z

send out an invite for 11.05, 13:30

T-Kukawka · 2023-05-11T12:10:15Z

Meeting notes:

Hubble and Hubble UI are enabled by default.
The Hubble UI by default has no authentication, we have exposed it via ingress to enable authentication. We have provided custom config for the helmchart via terraform template on vintage. This way we have oauth SSO login as for other services.
The solution is not perfect.
It is not clear where it should go right now. The early conversations were initially that Cilium will go to cabbage. We do not know the constraints with the CAPI releases where the release process is still undefined.
Cilium is a critical part of the architecture where it is a neccessity for any functionality. It might better fit into the team that overarches all the common componenets for all clusters. We have to think of the dependencies on the clusters functionalities. Container runtime or storage is a similar interface. Question is if we want this in Kaas or Platform team.

Next steps:

Product will discuss where the components as cilium and similar belong. We need a definition of the releases, the baseline for running clusters as well as a general approach how kaas/platform team can collaborate effectively to take the decision.

teemow · 2023-05-12T15:44:54Z

@T-Kukawka related issues are here:

T-Kukawka · 2023-05-15T13:08:27Z

Removing from Phoenix board - adding to Turtles as the 'responsible'

Rotfuks · 2023-06-20T14:54:39Z

As hubble is rolled out we will close this issue for now and create if needed for any further steps with hubble a follow up epic in the turtles board

teemow added the team/cabbage Team Cabbage label Nov 3, 2022

mcharriere added the needs/refinement Needs refinement in order to be actionable label Nov 22, 2022

weatherhog removed the needs/refinement Needs refinement in order to be actionable label Jan 17, 2023

gawertm mentioned this issue Feb 27, 2023

Replace kube proxy by cilium and ebpf #1072

Closed

whites11 self-assigned this Mar 9, 2023

architectbot added the team/phoenix Team Phoenix label Mar 9, 2023

weatherhog removed the team/cabbage Team Cabbage label Mar 14, 2023

whites11 removed their assignment May 15, 2023

T-Kukawka added team/turtles Team Turtles and removed team/phoenix Team Phoenix labels May 15, 2023

Rotfuks closed this as completed Jun 20, 2023

Hubble on the management cluster #1594

Hubble on the management cluster #1594

Comments

teemow commented Nov 3, 2022

Value

teemow commented Nov 3, 2022

weatherhog commented Nov 3, 2022

teemow commented Nov 3, 2022

alex-dabija commented Nov 7, 2022

cornelius-keller commented Nov 8, 2022

mcharriere commented Nov 15, 2022 • edited Loading

weatherhog commented Dec 6, 2022 • edited Loading

ced0ps commented Dec 6, 2022 • edited Loading

teemow commented Dec 8, 2022

webwurst commented Jan 16, 2023

weatherhog commented Jan 17, 2023

weatherhog commented Feb 27, 2023

T-Kukawka commented Feb 27, 2023

TheoBrigitte commented Feb 27, 2023

teemow commented Feb 28, 2023

paurosello commented Feb 28, 2023 • edited Loading

TheoBrigitte commented Mar 6, 2023

T-Kukawka commented Mar 7, 2023

teemow commented Mar 7, 2023

T-Kukawka commented Mar 7, 2023

T-Kukawka commented Mar 8, 2023

whites11 commented Mar 9, 2023

TheoBrigitte commented Mar 10, 2023

T-Kukawka commented Mar 13, 2023

weatherhog commented Mar 14, 2023

whites11 commented Apr 11, 2023

whites11 commented Apr 12, 2023

Gacko commented Apr 12, 2023 • edited Loading

whites11 commented Apr 12, 2023

whites11 commented Apr 13, 2023

whites11 commented Apr 20, 2023

teemow commented Apr 24, 2023

T-Kukawka commented Apr 24, 2023

teemow commented Apr 24, 2023

TheoBrigitte commented Apr 24, 2023

T-Kukawka commented Apr 25, 2023

teemow commented May 4, 2023

T-Kukawka commented May 4, 2023 • edited Loading

T-Kukawka commented May 11, 2023

teemow commented May 12, 2023

T-Kukawka commented May 15, 2023

Rotfuks commented Jun 20, 2023

mcharriere commented Nov 15, 2022 •

edited

Loading

weatherhog commented Dec 6, 2022 •

edited

Loading

ced0ps commented Dec 6, 2022 •

edited

Loading

paurosello commented Feb 28, 2023 •

edited

Loading

Gacko commented Apr 12, 2023 •

edited

Loading

T-Kukawka commented May 4, 2023 •

edited

Loading