Various bugs have been found in 'OpenShiftConnector' while fixing "Error while pumping stream" issue #198

ibuziuk · 2017-07-17T14:20:38Z

See https://issues.jboss.org/browse/CHE-180
NOTE: The original "Error while pumping stream" issue was fixed, but @snjeza found a few other issues during investigation - https://issues.jboss.org/browse/CHE-180?focusedCommentId=13396715&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13396715

getEvents does nothing. I have replaced it with openShiftClient.events().inNamespace().list().getItems()
I have replaced LogWatch with getLog in getContainerLogs. The openshift connector completely loads the workspace's log into memory, but, since those logs are currently empty, that isn't a problem. We will be able to use LogWatch when Issue with LogWatch fabric8io/kubernetes-client#735 is fixed
the Kubernetes client sets the default ping interval to 1s. See https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/Config.java#L97
The ping interval can cause issues when the listener's onMessage method takes longer. InputStreamPumper throws "Pipe not exception" in that case
The solution is to set the ping interval to 0 that will disable pinging which will also, I think, enhance performance.
See Problem with ping interval square/okhttp#3297
there is a bug in https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/base/OperationSupport.java#L287. If a response is null, the Kubernetes client will throw an NPE and ExecWebSocketListener.onFailure won't close its listener properly. See OperationSupport.createStatus() can throw a NPE occasionally fabric8io/kubernetes-client#733
currently, OpenShiftClient creates a new okhttpclient whenever it calls a method which is not recommended. See https://github.com/square/okhttp/blob/master/okhttp/src/main/java/okhttp3/OkHttpClient.java#L57. I think we should create OpenShiftClient only once.

amisevsk · 2017-07-17T18:51:32Z

Regarding

getEvents does nothing. I have replaced it with openShiftClient.events().inNamespace().list().getItems()

OpenShiftConnector#getEvents() is used only by DockerInstanceStopDetector. I've looked into making this work numerous times, and I can't really see a good, elegant way to adapt OpenShift events to the Docker events Che expects. There has been talk of improving this via the SPI (e.g. not relying on a getEvents method, and instead letting the impl side of the SPI decide how workspace health is detected), and I think postponing this issue until there is a more clear picture of how that will work is a good idea.

The issues I encountered with getEvents() are

It seems like pod health events are only reported when the liveness probe pings the container -- this means that often, the relevant events don't happen quickly enough to be useful
The oom event we're trying to detect is only one of various possible ways Che can crash due to oom. More often agents are killed, and getEvents() won't help us here.
The way oom detection is done in DockerInstanceStopDetector is a bit of a workaround specific to Docker. It's hard to replicate this in OS without ugly hacks.
The entire issue of Che oom stuff is kind of moot at this point. Che currently pings workspaces for health, and even when I managed to hack together a version of getEvents() that does what we expect, the wsagent ping request would often detect issues before the oom event ever fired. Since setting termination grace period on workspace pods to 0, the prompt that appears as a result of the wsagent ping (workspace is no longer responding, with close/restart buttons) does work, although it could be communicated better.

l0rd · 2017-10-10T18:06:51Z

We won't fix that in the openshiftconnector but with SPI OpenShift implementation

ibuziuk · 2017-11-23T12:23:29Z

@l0rd after talking with @skabashnyuk today we might have performance problems with multi-tenant che-server due to connection leaks [1] (looks like mt che-server could become unusable even for a few concurrent users)

[1] fabric8io/kubernetes-client#741

l0rd · 2017-11-23T15:21:54Z

@skabashnyuk can you help to reproduce that with rh-che? How many concurrent users? Is there any particular actions (like creating workspaces, executing commands...) that the concurrent users should do before we observe this behavior?

l0rd · 2017-11-24T14:33:25Z

Looks related to eclipse-che/che#7418

garagatyi · 2018-07-11T07:23:16Z

Seems that this issue is outdated.

ibuziuk changed the title ~~Variou improvements in OpenShiftConnector~~ Various bugs / improvements in OpenShiftConnector has been found while fixing "Error while pumping stream" issue Jul 17, 2017

ibuziuk changed the title ~~Various bugs / improvements in OpenShiftConnector has been found while fixing "Error while pumping stream" issue~~ Various bugs have been found in 'OpenShiftConnector' while fixing "Error while pumping stream" issue Jul 17, 2017

amisevsk mentioned this issue Jul 17, 2017

Testing che when OpenShift is loaded or overcommited #194

Closed

ibuziuk added the severity/P2 label Jul 18, 2017

l0rd added the sprint/current-sprint label Jul 19, 2017

l0rd modified the milestone: Sprint #135 Jul 19, 2017

ibuziuk mentioned this issue Jul 19, 2017

Sprint #135 #197

Closed

24 tasks

l0rd mentioned this issue Aug 4, 2017

Fix leak of connection to OpenShift eclipse-che/che#5902

Closed

ibuziuk removed the sprint/current-sprint label Aug 9, 2017

l0rd removed this from the Sprint #135 milestone Apr 18, 2018

garagatyi closed this as completed Jul 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various bugs have been found in 'OpenShiftConnector' while fixing "Error while pumping stream" issue #198

Various bugs have been found in 'OpenShiftConnector' while fixing "Error while pumping stream" issue #198

ibuziuk commented Jul 17, 2017

amisevsk commented Jul 17, 2017

l0rd commented Oct 10, 2017

ibuziuk commented Nov 23, 2017

l0rd commented Nov 23, 2017

l0rd commented Nov 24, 2017

garagatyi commented Jul 11, 2018

Various bugs have been found in 'OpenShiftConnector' while fixing "Error while pumping stream" issue #198

Various bugs have been found in 'OpenShiftConnector' while fixing "Error while pumping stream" issue #198

Comments

ibuziuk commented Jul 17, 2017

amisevsk commented Jul 17, 2017

l0rd commented Oct 10, 2017

ibuziuk commented Nov 23, 2017

l0rd commented Nov 23, 2017

l0rd commented Nov 24, 2017

garagatyi commented Jul 11, 2018