Workspaces take too long to stop #1410

ScrewTSW · 2019-05-15T10:22:19Z

Issue problem:
Recently it happens very often that workspaces can take upwards of 200 seconds, up to 1000 to get stopped.

This issue is visible across all clusters with both ephemeral and PVC workspaces (che7)

Workspace IDs that failed to stop:

These worksapce stop fails have been collected between 10:00 pm yesterday and 10:00 am this morning

PVC workspaces

1a: osiotest-workspace-1a total: 17

workspacewvafn1awvb8x1oax
workspace0azosssqnwnnamrz
workspacedkjdne9qb8dmb12r
workspacecbkrswrlmka8b3ie
workspaceqe47ta7okc0k22bl
workspace8mkxnf9r8gw4htt6
workspacedkjdne9qb8dmb12r
workspace3ijyyynk4bwr6iaw
workspaceqnftnnwr4pr1u6qf
workspaceva62dih9lwdf7lgi
workspacexlzrrp441tsgn2u3
workspacevb9nska2aepvl0eh
workspacefx6y3b7bvt1g7bol
workspace8hydyq200ryuss3v
workspacexma5mpptcrndud8j
workspace3eg3yrinj90uue88
workspacel37oqbcm8o86tj6y

1b: osiotest-workspace-1b total: 14

workspaceogj3ha7japkfemx3
workspacettkovts9egov9gog
workspacehqths5cyy6j2bhal
workspacezr4oyzfxgp0bj7sj
workspacevgwa7bcfaijwiub8
workspacehqths5cyy6j2bhal
workspace8fao0hsnk5tj7uli
workspacegga733wvpsf7xz7y
workspaceadwg3au7qdojhuhk
workspacesd5cexl3pcfdjz1k
workspacekkwvaf12p13fddks
workspacepzk2gmr7wt4mdcwr
workspaceqc5uxcwlrj5fk7s7
workspacesxjjz7ds74h1v9fi

2: osiotest-workspace-2 total: 17

workspacei9e6bx9kx4eityiy
workspacevel3v8ult7hwp08h
workspace7p75afwbgi60jruq
workspace64uzf7zmnt4wb3bc
workspacezze7v1iq54xm64af
workspace83sdj3hqpkxvgmfa
workspaceerl2qg7eatn1r0h8
workspace7p75afwbgi60jruq
workspace310hqbfb34r8zttz
workspacexjxzw8ldxcnx3jju
workspace7y0ay94u16f0yfr1
workspaceaxr0pi3iv9euk1jk
workspacecp0g4tronp98eud3
workspaceq0hbv7qessfaikv8
workspacenbwiwoafe2hx5iji
workspacexd92qfc99i0icmfq
workspaceb22uadtawo00mdt4

2a: osiotest-workspace-2a

not available - account has stuck running pods, excluded from tests

Ephemeral workspaces

1a: osiotest-workspace-eph-1a total: 19

workspaceou8o8uvtz6bo9jkg
workspace2uxbkrid6zt4lpjp
workspaceqncuy89bfg82q6tl
workspaceax7sdu2h3r7ydasy
workspacewn5w6nq664bk6kt1
workspace2kxh2y40azafvjv8
workspace30106z5xev9m3i9g
workspacel4u0ravw3u4u8eun
workspacew6dzf3yqt8zi70jh
workspacevwuer5ekskobrrlg
workspacec7fk0z2byzfyveol
workspace8sipoo5i7pxs3u8w
workspace4zcrn6gziomkf04s
workspaceqoh05jhn37dbijsr
workspaceej6a8jgo2yhhonfg
workspaceukqsxl6b1shd8hal
workspacejbtru1kb3q7w0qja
workspacek5qg1s9f4o39ssur
workspace68e16k41nsbwab3t

1b: osiotest-workspace-eph-1b total: 14

workspace6ha3eip3gz57yax2
workspacewgptydgtgfu5ffcx
workspaceoj25u1kqka53xj18
workspacer6cfys4loueixexr
workspacedp0i8hlwduusofhh
workspacecgtr5sayu56sfqph
workspaceub0wq5ncvmn50k7x
workspaceley7yoaaxurjmek9
workspacecmfm0a1ci9v0ms0q
workspace9zjbhuk42qeqd2y0
workspacea5h1numb4mhu1amg
workspace0zihl6vdo5d71s0h
workspaceavolf11tilqt9u37
workspaceanbvsauhz463lm6c

2: osiotest-workspace-eph-2 total: 13

workspacem6l5clm37ux4azud
workspace5r1p0fqyc13wc0w0
workspacezzw9j451cgwqjdvq
workspacegugzf4tlfmbla7r6
workspace1bujuwsbdx4fbbof
workspacez9jecg5j0drb7pty
workspace4ezlpjlu8yq45vjo
workspaceamnv55ydlqa01k06
workspace809oiss80n4v4hwu
workspacep5rfxkcrhyp5jovg
workspacegrb5lhdpr1v2wycs
workspace2ymd2zf8heoj1xh3
workspacepaslvi2iaebb1plx

2a: osiotest-workspace-eph-2a-new total: 17

workspacet5kay8v6h1ibv70f
workspaceeaepvep5h5cbe2us
workspaceigvn3yboxj2vn4fw
workspacednvznsmjxz9x5ttt
workspaceh5hsgt7ofn9ltq7m
workspace33ctthytiglvo8v3
workspaceyl4lcd5r6eih3eca
workspace07aslyt84o94d0d5
workspacey2eh54yg5crmrfbd
workspacey8w9e41j95l6mzaz
workspace4aporvtc89j86kj4
workspacehp6jxdkj9uwp3sfv
workspace734ip7lefagbdti1
workspacei66ve9ih1ht4xmfp
workspace4r3srlfh3qy7o2ld
workspace7dac8pao5lrqozm7
workspace5f20vp9xtdtcaie0

Red Hat Che version:

version: @eclipse-che/theia-assembly 0.5.0

I can reproduce it on latest official image

Reproduction Steps:

This issue happens at random, with intervals where workspace stops instantly and other where it takes insanely long time.

Create workspace
Start workspace
Wait for it to be in state READY
Stop workspace
Delete workspace
repeat

Runtime:

runtime used:

minishift (include output of minishift version)
OpenShift.io
Openshift Container Platform (include output of oc version)

The text was updated successfully, but these errors were encountered:

Katka92 · 2019-05-15T11:12:01Z

I think something similar is happening with few workspaces created during periodic tests.
Affected worskpaces (all on production):

workspacer9xqgwoluvvm5s1w 2a 
workspaceo7jttxuwdggqa2ei 1a
workspaceuap750q287ln1epb 1a
workspacexccsqy6sjbj7epr4 1b
workspacetni7njpgioqze2by 2

What can I see in kibana e.g. for workspaceo7jttxuwdggqa2ei:

00:10:28.594 Workspace 'osiotest1a/workspace5i4uis' with id 'workspaceo7jttxuwdggqa2ei' created by user 'osiotest1a'
00:10:29.065 Starting workspace 'osiotest1a/workspace5i4uis' with id 'workspaceo7jttxuwdggqa2ei' by user 'osiotest1a'
(workspace is not started in 10 minutes, try to stop and remove it)
00:20:34.699 Workspace 'osiotest1a/workspace5i4uis' with id 'workspaceo7jttxuwdggqa2ei' is stopping by user 'osiotest1a'
00:22:11.050 Workspace 'osiotest1a:workspace5i4uis' with id 'workspaceo7jttxuwdggqa2ei' started by user 'osiotest1a'
00:22:11.079 No workspace start time recorded for workspace workspaceo7jttxuwdggqa2ei
00:27:16.806 Workspace 'osiotest1a/workspace5i4uis' with id 'workspaceo7jttxuwdggqa2ei' is stopped by user 'osiotest1a'
(then four of this:)
00:27:17.007 Could not get runtimeIdentity for workspace 'workspaceo7jttxuwdggqa2ei', and so cannot verify if current subject '593d4c52-dfe0-47cd-9238-6a44b06e4d34' owns workspace
00:27:17.529 Workspace 'workspaceo7jttxuwdggqa2ei' removed by user 'osiotest1a'
(then two of this:)
00:27:21.565 Could not get runtimeIdentity for workspace 'workspaceo7jttxuwdggqa2ei', and so cannot verify if current subject '593d4c52-dfe0-47cd-9238-6a44b06e4d34' owns workspace

So from what I can see here, the events were create - starting - stopping - started - stopped - removed. Please note, that from starting to started it took 12 minutes and from started to stopped it took 5 mintes.

ibuziuk · 2019-05-15T11:23:11Z

@tdancs @ScrewTSW @Katka92 could you please confirm that those are both prod and prod-preview issue? going to bring it to Sre today

Katka92 · 2019-05-16T06:18:23Z

Seen again on production

workspacezjweqmjgo4p19edk 2
workspaceu2geeobgdb6yrmab 1a
workspaceq4jnery4fyew6b5q 1b

Katka92 · 2019-05-16T13:34:24Z

@ibuziuk Just a tough - could it be cause by long time for workspace starting? I just realised the meaning from what I've posted here #1410 (comment). In Kibana logs, it is pretty nicely shown - the time from when workspace is started to workspace stopped is just 5 seconds. So the problem is not in stopping itself.
I think that tests sends request to stop the workspace and wait until it is stopped, but really doesn't care if some of the time that is measured is spent by pod to get started. The problem is that even when stop request is sent, the pod somehow ignores it and waits until is started. Then stops and deletes.
If I'm correct, this problem is only a consequence of starting issue.

Katka92 · 2019-05-16T13:41:51Z

If I'm correct, this problem is only a consequence of starting issue.

Ah, from what I can see e.g. for workspace workspace4zcrn6gziomkf04s it was started in ~3 minutes and stopped after ~4 minutes. So this still is different issue.

Katka92 · 2019-05-21T07:12:56Z

Seen again on cluster 2:

workspaceuj6b71yhiy7xdj9z   4 minutes 
workspacegvsef06s4366xu5z   7 minutes

Katka92 · 2019-05-22T06:49:10Z

Seen again cluster
2a workspacehe686e199dl4srol 5 minutes
1a

workspaceupqt2m647jcpubfs   10 minutes 
workspacebgk5jpg2kech6ix7   10 minutes
workspacec9qxlg9rsgc9v2ly   11 minutes

ibuziuk · 2019-05-22T10:24:30Z

prod pods have been restarted and I can not reproduce the issue anymore

ScrewTSW · 2019-05-22T11:03:58Z

the status on production with PVC has improved a lot, but I've still seen 5 tests runs that have failed of which three were slow stopping and two were workspaces killed during startup (probably the Theia issue)
https://osioperf-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/workspace-startup-test/3094/
https://osioperf-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/workspace-startup-test/3095/
https://osioperf-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/workspace-startup-test/3118/

https://osioperf-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/workspace-startup-test/3106/
https://osioperf-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/workspace-startup-test/3115/

ScrewTSW · 2019-05-23T10:11:26Z

Since yesterday, the time it takes to stop workspaces has been significantly reduced.
The thing is, why does prod preview exhibit more than half of the times for both starting and stopping compared to production?

PVC stopping times are hovering around 20-40 seconds for prod and 10 seconds for prod-preview
PVC starting times are 40-60 seconds on prod and solid 40 seconds on prod-preview
Ephemeral stopping times are 15-30 with occasional spikes and 2-10 seconds for prod-preview
Ephemeral starting times are 40-60 seconds with occasional spikes and solid 40 for prod-preview

ibuziuk · 2019-05-23T10:44:38Z

more than 5 seconds for workspace stop is an abnormal situation. Are you sure it still persists after a series of the deprovisioning actions. For me workspace stop takes just a couple of seconds.
@tdancs also, could you please compare the startup / stop results for prod-preview only on 2a claster (which is used for both) to make sure the infrastructure is the same.

Katka92 · 2019-05-24T06:32:57Z

@ibuziuk I quickly scanned workspaces stop times in periodic tests. I went through just few of them, but I can see e.g. this (logs gathered from Kibana):

May 24th 2019, 05:08:57.235 Workspace 'osiotest2/workspaceu35lb6' with id 'workspacesbmbkmtpr51d4sqh' is stopping by user 'osiotest2'
May 24th 2019, 05:09:11.591 Workspace 'osiotest2/workspaceu35lb6' with id 'workspacesbmbkmtpr51d4sqh' is stopped by user 'osiotest2'

Which says that stopping of this workspace last for 14 seconds. The test was run today in the morning, so I would say that the issue is still present.

ScrewTSW · 2019-05-27T12:41:38Z

Another cases of workspaces taking too long to stop (data over this weekend 24-25-26-27.5)

Ephemeral workspaces

starter-us-east-1a : osio-ci-testcreation

workspacevdnzc21vny4es4ar
workspace4aiv59mmgtayds12
workspacewswquj7uwtb3oh97

starter-us-east-1b : osiotest-workspace-eph-1b

workspaceenpoegakfzh8kup2
workspace261z3ahrmcsptrcu
workspaceyzkzrn9uxg9vnwu4

starter-us-east-2 : osiotest-workspace-eph-2

workspacew95id1pdk87rbjq0
workspacec52v7wg3yii30dke
workspacespbw78nwxr1smpnn
workspacevsp0d79n5b9jhren

starter-us-east-2a : osiotest-workspace-eph-2a-new

workspacer5himz8cxveyemg6
workspacefmq4744402fq6han
workspacet19cc5iydbe38ogh

starter-us-east-2a-preview : osiotest-workspace-eph-preview

NONE

PVC workspaces:

starter-us-east-1a : osiotest-workspace-1a

workspace42etpzz8anwwd0bs
workspace1fnywbw4b1iyzkfy

starter-us-east-1b : osiotest-workspace-1b

workspacezy5mf8c5y46zj4i7

starter-us-east-2 : kkanova-osiotest1

workspacebgblk9ybuc2djn9b
workspaceozqhfhszikv3q99l

starter-us-east-2a : che-perf-prod1

workspaceasqexxoal9a2h38o
workspace4burlydy1epe4ju2
workspacepoqvvrezgf9lxya6

starter-us-east-2a-preview : osiotest-workspace-new-preview

NONE

ibuziuk · 2019-10-22T08:23:49Z

Fixed by updating async thread pool settings - average workspace stop for the last week is 1.6 seconds according to grafana

ScrewTSW added the kind/bug label May 15, 2019

ibuziuk self-assigned this May 22, 2019

ibuziuk added this to the Sprint #167 (Che OSIO) milestone May 22, 2019

ibuziuk added the severity/P1 label May 27, 2019

skabashnyuk mentioned this issue Jun 11, 2019

[Metrics] Time of workspace termination eclipse-che/che#13500

Closed

ibuziuk closed this as completed Oct 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workspaces take too long to stop #1410

Workspaces take too long to stop #1410

ScrewTSW commented May 15, 2019 •

edited

Loading

Katka92 commented May 15, 2019 •

edited

Loading

ibuziuk commented May 15, 2019

Katka92 commented May 16, 2019

Katka92 commented May 16, 2019

Katka92 commented May 16, 2019

Katka92 commented May 21, 2019

Katka92 commented May 22, 2019

ibuziuk commented May 22, 2019

ScrewTSW commented May 22, 2019

ScrewTSW commented May 23, 2019

ibuziuk commented May 23, 2019

Katka92 commented May 24, 2019

ScrewTSW commented May 27, 2019

ibuziuk commented Oct 22, 2019

Workspaces take too long to stop #1410

Workspaces take too long to stop #1410

Comments

ScrewTSW commented May 15, 2019 • edited Loading

Workspace IDs that failed to stop:

These worksapce stop fails have been collected between 10:00 pm yesterday and 10:00 am this morning

PVC workspaces

Ephemeral workspaces

Katka92 commented May 15, 2019 • edited Loading

ibuziuk commented May 15, 2019

Katka92 commented May 16, 2019

Katka92 commented May 16, 2019

Katka92 commented May 16, 2019

Katka92 commented May 21, 2019

Katka92 commented May 22, 2019

ibuziuk commented May 22, 2019

ScrewTSW commented May 22, 2019

ScrewTSW commented May 23, 2019

ibuziuk commented May 23, 2019

Katka92 commented May 24, 2019

ScrewTSW commented May 27, 2019

Ephemeral workspaces

PVC workspaces:

ibuziuk commented Oct 22, 2019

ScrewTSW commented May 15, 2019 •

edited

Loading

Katka92 commented May 15, 2019 •

edited

Loading