Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Healthz bad gateway when pod/service ip take time to propagate #23067

Closed
batleforc opened this issue Jul 31, 2024 · 14 comments
Closed

Healthz bad gateway when pod/service ip take time to propagate #23067

batleforc opened this issue Jul 31, 2024 · 14 comments
Assignees
Labels
area/install Issues related to installation, including offline/air gap and initial setup kind/bug Outline of a bug - must adhere to the bug report template. severity/P2 Has a minor but important impact to the usage or development of the system. status/analyzing An issue has been proposed and it is currently being analyzed for effort and implementation approach

Comments

@batleforc
Copy link

Describe the bug

During the startup process of workspace's pod, sometimes the ip take time to propagate and the double call of the healthz endpoint immediately came back with a bad gateway.

Che version

7.88

Steps to reproduce

  1. Start a DevSpaces/Eclipse Che env on an openshift/kubernetes cluster that can take some time to propagate the ip address of the service/pod.
  2. Start a workspace
  3. if you have luck the workspace will start likely in the second, if not the workspace will take approximatively 5/10 min or more to start.

Expected behavior

Don't wait 5 more minute when, the cluster take a short time to propagate the corresponding ip (like most of our case) and wait the 5 more minute when side resources take time loading.

Runtime

Kubernetes (vanilla), OpenShift

Screenshots

No response

Installation method

chectl/latest, chectl/next, OperatorHub

Environment

Linux, Amazon

Eclipse Che Logs

No response

Additional context

No response

@batleforc batleforc added the kind/bug Outline of a bug - must adhere to the bug report template. label Jul 31, 2024
@che-bot che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Jul 31, 2024
@ibuziuk ibuziuk added status/analyzing An issue has been proposed and it is currently being analyzed for effort and implementation approach area/install Issues related to installation, including offline/air gap and initial setup severity/P2 Has a minor but important impact to the usage or development of the system. and removed status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. labels Aug 5, 2024
@ibuziuk
Copy link
Member

ibuziuk commented Aug 5, 2024

@batleforc exposure of the route should be relatively fast, could you please clarify when exactly are you facing this issue ( 5/10 min for the route to be accessible). The default hard startup timeout is 5 mins and at this point we do not plan to change it.

@batleforc
Copy link
Author

Due to this case, we upped the timeout to 900s.
In theory, it should be fast, but I encountered the case either on an Openshif on AWS (with like ~7 user) and on Kubernetes on bare metal (1 to 4 user). The initial two call of the healthz endpoint end up immediately returning with a bad gateway from the main gateway and the user has at least 5 min to wait (the case of the 10 minute isn't narrowed down precisely, but we need to reduce this one first)

We found out that the propagation of the service's ip to the targeted pod take some time and some time came a little bit after the pod are up but not soon enough for the backend. That's why I add a little retry in eclipse-che/che-operator#1874 that should cover the propagation time, but I would love to make it a parameter that the end user could tune in case of pretty slow CNI.

@batleforc
Copy link
Author

To debug that, we used the different pod to debug the full chain of acknowledgement that the deployment is ready for the next step of startup. And have seen that either we need to add a little time in between the two call of the health on the backend side, or we add retry directly in the gateway. (need test with replacing the different element in the kube)

@batleforc
Copy link
Author

Is it possible to have help in order to check if the change added in eclipse-che/che-operator#1874 can fix the problem we encounter ? (Building the image mostly and a possible case on how we can make the retry healthz modular https://github.com/eclipse-che/che-operator/pull/1874/files#diff-ebca2eefe12f7ba4a722c53d574ba1b2adee412909da8cdbc974c8f7fcbfb02fR655 ?)

@tolusha
Copy link
Contributor

tolusha commented Sep 5, 2024

Hello
Please try this image based on the PR
quay.io/abazko/operator:23067

@tolusha
Copy link
Contributor

tolusha commented Sep 10, 2024

Hello @batleforc
Does it work for you?

@batleforc
Copy link
Author

Hello @tolusha ,
i've set it up but i think i need to fine tune the initial Interval

@batleforc
Copy link
Author

Is the provided image (quay.io/abazko/operator:23067) automatically updated ?

@tolusha
Copy link
Contributor

tolusha commented Sep 10, 2024

Unfortunately now.
You can build the image by the following command:
make docker-build docker-push IMG=<IMAGE_NAME> SKIP_TESTS=true

@batleforc
Copy link
Author

So, the build seems okay, but I encounter a Client.Timeout exceeded while awaiting headers and I can't find where the devworkspace-controller-manager does the call to the healthz endpoint

@batleforc
Copy link
Author

batleforc commented Sep 16, 2024

@tolusha So I found out the connected dot.
While setup in my own env (and starting like 10/20 workspace) it works, and I can't reproduce the case where I stay stuck because the endpoint had two consecutive bad gateways for 5 minute (but ended up overloading the cluster 🤣 ). I have both the che-operator and devworkspace-operator setup with corresponding branch (Work-on-timeout) .

@tolusha
Copy link
Contributor

tolusha commented Sep 18, 2024

Hello @batleforc
Thank you for the information.
So, does it mean that PR is good to review and merge?

@batleforc
Copy link
Author

batleforc commented Sep 18, 2024

Hello @tolusha
For me yes
But it will need both pr

@AObuchow
Copy link

devfile/devworkspace-operator#1321 has now been merged, which seems to resolve this issue. This change will appear when DevWorkspace Operator 0.32.0 is released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/install Issues related to installation, including offline/air gap and initial setup kind/bug Outline of a bug - must adhere to the bug report template. severity/P2 Has a minor but important impact to the usage or development of the system. status/analyzing An issue has been proposed and it is currently being analyzed for effort and implementation approach
Development

No branches or pull requests

6 participants