Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hubble on the management cluster #1594

Closed
teemow opened this issue Nov 3, 2022 · 42 comments
Closed

Hubble on the management cluster #1594

teemow opened this issue Nov 3, 2022 · 42 comments
Labels
team/turtles Team Turtles

Comments

@teemow
Copy link
Member

teemow commented Nov 3, 2022

As we transition to cilium on all providers it would be great to add Hubble on the management cluster to get better visibility of the network on each workload cluster.

Value

  • Faster debugging of network problems
  • Identify misbehaving services quicker
  • Help the customer to find bottlenecks in their services

See https://github.com/cilium/hubble for more details.

Please also allow customers to access Hubble.

@teemow teemow added the team/cabbage Team Cabbage label Nov 3, 2022
@teemow
Copy link
Member Author

teemow commented Nov 3, 2022

@weatherhog imo this is getting important. More and more providers are switched to cilium and this would give us much more visibility into the network. Do you think Cabbage is a good fit for this?

@weatherhog
Copy link

@teemow I think it could be a good fit. The main question is, how far are we with cilium? Last time we wanted to test for example linkerd on cilium based clusters the release got stopped and reverted. Do we have stable cilium installations at the moment?

@teemow
Copy link
Member Author

teemow commented Nov 3, 2022

Afaik CAPO, CAPG are cilium and in production with customers already. CAPA currently being tested already and CAPVCD is also being worked on. Vintage AWS has been reverted but the MCs are with cilium already afaik. So it should be possible to test this already.

Please confirm @cornelius-keller @alex-dabija

@alex-dabija
Copy link

Afaik CAPO, CAPG are cilium and in production with customers already. CAPA currently being tested already and CAPVCD is also being worked on. Vintage AWS has been reverted but the MCs are with cilium already afaik. So it should be possible to test this already.

CAPG is not yet in production. We are currently deploying the first GA management cluster for our customer.

Yes, you can test Hubble on CAPG and CAPA clusters. Both are using Cilium.

@cornelius-keller
Copy link
Contributor

CAPO is in production using cilium as well - only difference is that we didn't replace kube-proxy by cilium yet.

@mcharriere
Copy link

mcharriere commented Nov 15, 2022

I've been checking a bit the current status and from what I can see:

Now, the question is what exactly we want to enable and how would like to expose it to our customers.

  1. Hubble relay is the API for the UI and the CLI. I'm not sure if could safely expose it because the CLI has no auth capabilities.
  2. Hubble UI could be expose with an ingress (internal preferably) but has no auth.
    • It might be possible to use oauth-proxy to add auth.
    • I can't find any way to restrict the visibility to a single namespace, although that might no be a problem.
  3. Enabling Hubble metrics and scrape them with prometheus should be simple, although we might want to go easy with it to avoid overloading our mon stack.

@mcharriere mcharriere added the needs/refinement Needs refinement in order to be actionable label Nov 22, 2022
@weatherhog
Copy link

weatherhog commented Dec 6, 2022

@giantswarm/team-cabbage please look into hubble and into the questions provided by @mcharriere

if there are more questions please add them to this issue

@ced0ps
Copy link
Contributor

ced0ps commented Dec 6, 2022

  1. UI could be made available similarly to anything else that's available through opsctl open *. If it can't be safely exposed CLI should be accessed through port-forwarding. We should see if there is an actual use case for either or if we just want to rely on the metrics for now
  2. I think we should get some experience with it's capabilities to decide on how to expose UI/CLI. Metrics should be shared as they become available.
  3. This decision should be made by atlas as they best understand the capabilities of our observability stack.

@teemow
Copy link
Member Author

teemow commented Dec 8, 2022

Would it be a lot of work to learn with the customer what is useful and what not? I'd suggest to share it from the beginning.

@weatherhog let's talk to Atlas then. Maybe we have some low hanging fruits to eg create two prometheus instances per cluster or similar. And then we use a separate prometheus just for the network metrics. But not sure if this is possible. But somehow Atlas needs to enable teams to add more metrics soon.

@webwurst
Copy link

  1. It looks like the hubble-relay-service is provided via node-port? To access that api upstream documentation shows usage of hubble observe. I couldn't find any details about authentication/authorization and still wondering if there is maybe some access restriction on the cillium level? Probably something to ask on their Slack channel, haven't done that yet.
    hubble.relay.service: {"nodePort":31234,"type":"ClusterIP"}
    
  2. It's about the same, I think:
    hubble.ui.service: {"nodePort":31235,"type":"ClusterIP"}
    
  3. Hubble metrics are disabled by default and can be enabled in a kind of fine-granular way. But we still don't know what to expect for each :) A dedicated Prometheus instance would be helpful here. In case of high resource usage we can scale down and have time to think about adjustment before trying again.

@weatherhog
Copy link

@TheoBrigitte any chance to have a second prometheus instance for this Hubble use case? Lets have a chat about this.

@weatherhog weatherhog removed the needs/refinement Needs refinement in order to be actionable label Jan 17, 2023
@weatherhog
Copy link

@TheoBrigitte just a ping to get a reply on our question.

@T-Kukawka
Copy link
Contributor

@paurosello can u provide feedback on your Hubble usage during the Stability Sprint?

@TheoBrigitte
Copy link
Member

@TheoBrigitte any chance to have a second prometheus instance for this Hubble use case? Lets have a chat about this.

I do not understand why we would need another Prometheus instance dedicated to Hubble when we already have Prometheus servers in place.

@teemow
Copy link
Member Author

teemow commented Feb 28, 2023

@TheoBrigitte the question was if our prometheus setup can stand a lot more networking metrics from the workload clusters. Maybe @paurosello or @whites11 can say a bit more as they have now also looked into this. Do we enable all the metrics with cilium already? Is that independent of using the hubble UI? Did we test if with v19 the metrics have increased and if yes how much? Will our prometheus setup still work on the big clusters if we upgrade to cilium?

@paurosello
Copy link

paurosello commented Feb 28, 2023

We have been playing with cilium and hubble during this stability sprint and I would say enabling hubble should not be a problem (we have a PR for it already to change the values in the app that should be released shortly) due to the following reasons:

  • Hubble can be used without prometheus scraping. Scraping is only needed to get the grafana dashboards.
  • We removed aws-cni metrics from prometheus, freeing some metrics already.
  • The biggest metric generator is coming from the agents (which we are scraping already).
  • As we have the servicemonitors defined in an app https://github.com/giantswarm/cilium-servicemonitors-app we could disable hubble monitoring if needed.
  • If all above fails and we need to reduce metrics cilium has the nice feature of allowing us to disable anything not needed.

@teemow as for your questions:

  1. we are already scraping agents, which are the ones that produce more metrics. We want to scrape the operator and hubble too (there are not many interesting metrics here, but we need to explore further).
  2. yes, enabling hubble and prometheus scraping is isolated.
  3. did not check if MC are consuming more metrics but it has been scraping cilium-agents for months already.

For now the plan is to use port-forward to access it and if we find it useful improve the setup with ingress and oauth. This will be enabled on MC and WC

@TheoBrigitte
Copy link
Member

Ok, so if you are already shipping metrics into Prometheus since months, I guess we are fine right, or is there anything you need from Atlas ?

@T-Kukawka
Copy link
Contributor

PR to enable hubble/monitoring for cilium: giantswarm/cilium-app#56

@teemow
Copy link
Member Author

teemow commented Mar 7, 2023

@TheoBrigitte cilium isn't released for vintage yet. It is comming with v19. I think @paurosello is talking about the management clusters only.

@T-Kukawka we need to make sure that we know prometheus doesn't break if we add cilium to the big clusters.

@T-Kukawka
Copy link
Contributor

Sure @teemow let me add an issue so we test it with Atlas when we have release ready

@T-Kukawka
Copy link
Contributor

general Cilium v1.13 work for Phoenix: #2131

@whites11 whites11 self-assigned this Mar 9, 2023
@architectbot architectbot added the team/phoenix Team Phoenix label Mar 9, 2023
@whites11
Copy link

whites11 commented Mar 9, 2023

this is running on gaia and gremlin now

@TheoBrigitte
Copy link
Member

@T-Kukawka can you please schedule a session ? so we work out together how Prometheus behave with Cilium.

@T-Kukawka
Copy link
Contributor

We will move with testing this week so when we are ready we will involve atlas for WCs testing, we can sync here; https://github.com/giantswarm/giantswarm/issues/26139

On MCs it is already running on Gaia and Gremlin so you can check this async

@weatherhog weatherhog removed the team/cabbage Team Cabbage label Mar 14, 2023
@weatherhog
Copy link

removing this von cabbage as this is in phoenix

@whites11
Copy link

hubble is running in all MCs now.

@whites11
Copy link

I tried exposing hubble ui on the MCs through ingress, there are a bunch of problems:

  • hubble UI runs in kube-system, where we don't have oauth2-proxy so authentication is not trivial
  • kube-system namespace is more restricted than others, so http01 challenge does not work out of the box if we create the ingress there
  • I tried creating the ingress in monitoring ns (where oauth2-proxy runs) using externalname service to proxy to hubble UI but for some reason it does not work

@Gacko
Copy link
Member

Gacko commented Apr 12, 2023

I tried creating the ingress in monitoring ns (where oauth2-proxy runs) using externalname service to proxy to hubble UI but for some reason it does not work

Probably because following external name services is disabled by default in nginx-ingress-controller.

https://github.com/giantswarm/nginx-ingress-controller-app/blob/main/helm/nginx-ingress-controller-app/values.yaml#L192
https://github.com/kubernetes/ingress-nginx/blob/main/docs/user-guide/cli-arguments.md: See --disable-svc-external-name

@whites11
Copy link

I tried creating the ingress in monitoring ns (where oauth2-proxy runs) using externalname service to proxy to hubble UI but for some reason it does not work

Probably because following external name services is disabled by default in nginx-ingress-controller.

https://github.com/giantswarm/nginx-ingress-controller-app/blob/main/helm/nginx-ingress-controller-app/values.yaml#L192 https://github.com/kubernetes/ingress-nginx/blob/main/docs/user-guide/cli-arguments.md: See --disable-svc-external-name

Oh thanks, I thought it was something along those lines but couldn't track it down

@whites11
Copy link

hubble+auth available for vintage in giantnetes-terraform 14.13.0. Deployed in gaia, gremlin, giraffe for now (will be enabled everywhere with next round of MC updates)

@whites11
Copy link

@teemow something I don't get from this ticket: how would it be possible to see hubble data from the workload clusters in the MC?

@teemow
Copy link
Member Author

teemow commented Apr 24, 2023

@whites11 if hubble doesn't support multi-cluster then we probably have to think about either adding hubble per WC on the MC or to have hubble on each WC.

FYI @weatherhog (connectivity) @TheoBrigitte (monitoring) what do you think about it?

Imo it would be helpful not only for platform teams but also for development teams to get access and see the connectivity between different services.

screen_2023-04-24-13-33-44

@T-Kukawka
Copy link
Contributor

Afaik @teemow the Huble is enabled by default in our Cilium already: https://github.com/giantswarm/cilium-app/blob/main/CHANGELOG.md#080---2023-03-08. This means there will be Huble on each WC and MC(should already be the case)

@teemow
Copy link
Member Author

teemow commented Apr 24, 2023

@T-Kukawka but we need to decide how we want to enable it for customers. I am not against having it on the WCs but I'd like to discuss this a bit. We could also have the UI on the MC. And we're constantly moving more functionality into the MC to have more of a single pane of glass.

@TheoBrigitte
Copy link
Member

From what I see here I do not understand the implication of this solution in our observability stack. I think having a session to explain what this is about would be best for me to comprehend this better and see how we move forward on this.

@T-Kukawka
Copy link
Contributor

i think we can have a joined session with Phoenix/Cabbage/Atlas but then topic should be continued by 2 other teams and should not block the release. This could come next as an improvement as in theory that is available already.

@teemow
Copy link
Member Author

teemow commented May 4, 2023

@T-Kukawka do you organize a session then?

@T-Kukawka
Copy link
Contributor

T-Kukawka commented May 4, 2023

send out an invite for 11.05, 13:30

@T-Kukawka
Copy link
Contributor

Meeting notes:

  • Hubble and Hubble UI are enabled by default.
  • The Hubble UI by default has no authentication, we have exposed it via ingress to enable authentication. We have provided custom config for the helmchart via terraform template on vintage. This way we have oauth SSO login as for other services.
    The solution is not perfect.
  • It is not clear where it should go right now. The early conversations were initially that Cilium will go to cabbage. We do not know the constraints with the CAPI releases where the release process is still undefined.
  • Cilium is a critical part of the architecture where it is a neccessity for any functionality. It might better fit into the team that overarches all the common componenets for all clusters. We have to think of the dependencies on the clusters functionalities. Container runtime or storage is a similar interface. Question is if we want this in Kaas or Platform team.

Next steps:

  • Product will discuss where the components as cilium and similar belong. We need a definition of the releases, the baseline for running clusters as well as a general approach how kaas/platform team can collaborate effectively to take the decision.

@teemow
Copy link
Member Author

teemow commented May 12, 2023

@whites11 whites11 removed their assignment May 15, 2023
@T-Kukawka
Copy link
Contributor

Removing from Phoenix board - adding to Turtles as the 'responsible'

@T-Kukawka T-Kukawka added team/turtles Team Turtles and removed team/phoenix Team Phoenix labels May 15, 2023
@Rotfuks
Copy link
Contributor

Rotfuks commented Jun 20, 2023

As hubble is rolled out we will close this issue for now and create if needed for any further steps with hubble a follow up epic in the turtles board

@Rotfuks Rotfuks closed this as completed Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/turtles Team Turtles
Projects
None yet
Development

No branches or pull requests