Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eclipse che - volume mount error while launching more than 5 workspace at a time #19355

Closed
5 of 7 tasks
andr-azeez opened this issue Mar 22, 2021 · 7 comments
Closed
5 of 7 tasks
Labels
area/che-server kind/bug Outline of a bug - must adhere to the bug report template. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. severity/P2 Has a minor but important impact to the usage or development of the system.

Comments

@andr-azeez
Copy link

andr-azeez commented Mar 22, 2021

Describe the bug

Logged in 10 different users at the same time and launched 10 workspaces of each users at a time. 3 - 5 users are able to launch the workspace successfully, for remaining users getting time out error in mount volume, some users keep on loading the workspace and nothing is initialised in log window.

Che version

  • latest

Advanced configuration:

ss

Runtime

  • kubernetes (include output of kubectl version)
    • kubectl version - Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-18T16:12:00Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
      Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.12-gke.1210", GitCommit:"199a41188dc0ca5d6d95b1cc7e8ba96e05f9dd0a", GitTreeState:"clean", BuildDate:"2021-02-05T18:03:16Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}

Installation method

  • chectl
    • chectl installation command - sudo chectl server:deploy --installer=helm --platform=k8s --domain=cl-ide.domain.com --multiuser --installer=operator
    • chectl version - chectl/7.27.1 linux-x64 node-v12.21.0

Environment

  • my computer
    • Linux
  • Cloud
    • GCE
      • Google Kubernete Engine (GKE) - with 2 nodes running
      • Node Configuration - 2 core 8 GB memory Machine with 30 GB hardisk
      • Auto scale is enabaled

Screenshots

Screenshot from 2021-03-22 16-57-43

Steps to reproduce

  • Login 10 users at a time.
  • launch 10 different workspace at time.

Expected behavior

It should be able to launch more than 50 workspaces at a time.

@andr-azeez andr-azeez added the kind/bug Outline of a bug - must adhere to the bug report template. label Mar 22, 2021
@che-bot che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Mar 22, 2021
@ericwill ericwill added severity/P2 Has a minor but important impact to the usage or development of the system. area/che-server and removed status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. labels Mar 22, 2021
@vadirajspringpeople
Copy link

Even I am facing this issue, requesting to fix the issue ASAP.

@andr-azeez
Copy link
Author

Same test is repeated again with the upgraded machine as
Machine size: 8 core vcpu / 16 gb memory
Nodes: 3
Auto Scaling : Enabled

8 users are able to login successfully, 2 users got error as in screenshot

1sc

2sc

Is there any concurrent queing issue or it has to be updated somewhere in advanced configuration?

@sleshchenko
Copy link
Member

sleshchenko commented Mar 26, 2021

I believe there is not much Che can do with it.
I assume K8s behaves bad when one K8s namespace has many configmaps, which then should be mount to pod.
Che could merge all data to one configmap, but then you'll hit the same issue but with more running workspaces.
The only solution: use per-user namespace strategy. It's workspaceNamespaceDefault: <username>-che.
Note also that all workspace in one namespace, is going to be fully deprecated soon #19365

@andr-azeez
Copy link
Author

andr-azeez commented Mar 26, 2021

Tried with the following config,

server:
workspaceNamespaceDefault: <username>-che
storage:
pvcClaimSize: 1Gi
pvcStrategy: common

again facing the same issue, while loading 10 workspace at a time. It launches ~6 workspace successfully, remaining fails.

It takes more time in pvc attachment while launching workspace. Is there any way where we can pre-attach the pvc for all the workspace?

@sleshchenko
Copy link
Member

sleshchenko commented Mar 26, 2021

Well, the issue you faced is not about PVC but about configmap cache, which is K8s internals.
=( that namespace strategy did not help.

The last thing you can try to workaround the issue: remove FailedMount from unrecoverable events, in CheCluster CR it should be like the following:

spec:
  server:
    customCheProperties:
      CHE_INFRA_KUBERNETES_WORKSPACE__UNRECOVERABLE__EVENTS: FailedScheduling,MountVolume.SetUpfailed,Failed to pull image,FailedCreate,ReplicaSetCreateError

Then workspace won't fail immediately after warning happened and maybe it will go further.

You may find a solution for you cluster and you search kubernetes failed to sync configmap cache timed out waiting for the condition, I see some issues on github created.

@andr-azeez
Copy link
Author

andr-azeez commented Mar 31, 2021

Changed the configuration as follows:

  server:
    customCheProperties:
      CHE_INFRA_KUBERNETES_PVC_JOBS_MEMORYLIMIT: 756Mi
      CHE_INFRA_KUBERNETES_WORKSPACE__UNRECOVERABLE__EVENTS: FailedScheduling,MountVolume.SetUpfailed,Failed to pull image,FailedCreate,ReplicaSetCreateError`

Now tested with 20 users and launched 20 workspaces of random stacks.
Most of the workspace skipped the error with kubernetes failed to sync configmap cache timed out but able to launch the workspace successfully.
Nearly 15 - 17 workspace were launched successfully and nearly 4 - 5 workspace got workspace launch time out error.

Next step is to test about to launch 50 workspace simultaneously, and I will post the result here once the test is done.

@che-bot
Copy link
Contributor

che-bot commented Oct 12, 2021

Issues go stale after 180 days of inactivity. lifecycle/stale issues rot after an additional 7 days of inactivity and eventually close.

Mark the issue as fresh with /remove-lifecycle stale in a new comment.

If this issue is safe to close now please do so.

Moderators: Add lifecycle/frozen label to avoid stale mode.

@che-bot che-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 12, 2021
@che-bot che-bot closed this as completed Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/che-server kind/bug Outline of a bug - must adhere to the bug report template. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. severity/P2 Has a minor but important impact to the usage or development of the system.
Projects
None yet
Development

No branches or pull requests

5 participants