Skip to content
This repository has been archived by the owner on Mar 17, 2021. It is now read-only.

Various bugs have been found in 'OpenShiftConnector' while fixing "Error while pumping stream" issue #198

Closed
5 tasks
ibuziuk opened this issue Jul 17, 2017 · 6 comments

Comments

@ibuziuk
Copy link
Member

ibuziuk commented Jul 17, 2017

See https://issues.jboss.org/browse/CHE-180
NOTE: The original "Error while pumping stream" issue was fixed, but @snjeza found a few other issues during investigation - https://issues.jboss.org/browse/CHE-180?focusedCommentId=13396715&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13396715

@ibuziuk ibuziuk changed the title Variou improvements in OpenShiftConnector Various bugs / improvements in OpenShiftConnector has been found while fixing "Error while pumping stream" issue Jul 17, 2017
@ibuziuk ibuziuk changed the title Various bugs / improvements in OpenShiftConnector has been found while fixing "Error while pumping stream" issue Various bugs have been found in 'OpenShiftConnector' while fixing "Error while pumping stream" issue Jul 17, 2017
@amisevsk
Copy link
Collaborator

Regarding

getEvents does nothing. I have replaced it with openShiftClient.events().inNamespace().list().getItems()

OpenShiftConnector#getEvents() is used only by DockerInstanceStopDetector. I've looked into making this work numerous times, and I can't really see a good, elegant way to adapt OpenShift events to the Docker events Che expects. There has been talk of improving this via the SPI (e.g. not relying on a getEvents method, and instead letting the impl side of the SPI decide how workspace health is detected), and I think postponing this issue until there is a more clear picture of how that will work is a good idea.

The issues I encountered with getEvents() are

  • It seems like pod health events are only reported when the liveness probe pings the container -- this means that often, the relevant events don't happen quickly enough to be useful
  • The oom event we're trying to detect is only one of various possible ways Che can crash due to oom. More often agents are killed, and getEvents() won't help us here.
  • The way oom detection is done in DockerInstanceStopDetector is a bit of a workaround specific to Docker. It's hard to replicate this in OS without ugly hacks.
  • The entire issue of Che oom stuff is kind of moot at this point. Che currently pings workspaces for health, and even when I managed to hack together a version of getEvents() that does what we expect, the wsagent ping request would often detect issues before the oom event ever fired. Since setting termination grace period on workspace pods to 0, the prompt that appears as a result of the wsagent ping (workspace is no longer responding, with close/restart buttons) does work, although it could be communicated better.

@l0rd
Copy link
Contributor

l0rd commented Oct 10, 2017

We won't fix that in the openshiftconnector but with SPI OpenShift implementation

@ibuziuk
Copy link
Member Author

ibuziuk commented Nov 23, 2017

@l0rd after talking with @skabashnyuk today we might have performance problems with multi-tenant che-server due to connection leaks [1] (looks like mt che-server could become unusable even for a few concurrent users)

[1] fabric8io/kubernetes-client#741

@l0rd
Copy link
Contributor

l0rd commented Nov 23, 2017

@skabashnyuk can you help to reproduce that with rh-che? How many concurrent users? Is there any particular actions (like creating workspaces, executing commands...) that the concurrent users should do before we observe this behavior?

@l0rd
Copy link
Contributor

l0rd commented Nov 24, 2017

Looks related to eclipse-che/che#7418

@l0rd l0rd removed this from the Sprint #135 milestone Apr 18, 2018
@garagatyi
Copy link

Seems that this issue is outdated.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants