Seeing persistent/repeated 'persistentvolumeclaim "jenkins-home" not found' errors after OSIO env reset #4121

ldimaggi · 2018-08-02T20:57:51Z

For a user (my user account) provisioned on starter-us-east-2 - after an OSIO env reset, jenkins pod fails to start. Event log includes:

4:48:43 PM | jenkins-1-zvlxn | Pod | Warning | Failed Scheduling | 
persistentvolumeclaim "jenkins-home" not found 152 times in the last 49 minutes
4:03:43 PM | jenkins-1-zvlxn | Pod | Warning | Failed Scheduling | 
persistentvolumeclaim "jenkins-home" is being deleted 5 times in the last 49 minutes
4:03:41 PM | content-repository-1-sf2cr | Pod | Normal | Started | 
Started container

This issue appeared today after the resolution of: #3934

The text was updated successfully, but these errors were encountered:

hrishin · 2018-08-03T05:20:50Z

@ldimaggi we are trying to understand the root cause of this issue. @aslakknutsen would you like shed some lights here?

piyush-garg · 2018-08-03T16:09:02Z

@ldimaggi Just to correct you, build team is facing this issue from past 5-6 days and Service Delivery was also informed about that. I discussed this issue with @mmclanerh

This issue is not happening after or because of #3934 This is a different issue happening before that fix also.

The fix we just provided in Jenkins version 4.0.97 is related to #3956 and is affecting the #3934 as Jenkins will come up fast now and 503 issue will not block the user as you said build is taking 15-20 minutes in #3934 (comment). We are still working on #3934

xyntrix · 2018-08-03T20:00:46Z

i would suggest looking at the pvc and clearing off any cruft / old files that arent relevant. i wonder if the amount of files present have ballooned due to stale/failed workspace clean ups?

aslakknutsen · 2018-08-06T09:35:29Z

Not investigated this specifically, but it looks very similar to another issue from Friday.

Essentially this can happen due to a Reset; Clean & Apply.

Clean return 'deleted ok', but when Apply happens multiple seconds later, the PVC is still not deleted from Openshift, so Apply end up Updating the PVC instead of recreating it. Openshift then later comes around and Deletes the PVC.

A second Reset fixes it after the PVC has been deleted.

ldimaggi · 2018-08-09T12:54:12Z

Still seeing this:
persistentvolumeclaim "jenkins-home" not found52 times in the last 16 minutes

It looks like Aslak is correct - a 2nd reset is needed.

ppitonak · 2018-09-27T08:32:00Z

@ldimaggi is this still happening?

ldimaggi · 2018-09-27T16:42:00Z

I think we can close this one - it's been almost 2 months and I do not think the issue is happening now.

ppitonak · 2018-10-04T11:58:02Z

Happening right now with account osio-ci-e2e-001-preview on prod-preview.

ppitonak · 2018-10-04T14:50:08Z

#4378 (comment)

The reason why the jenkins-home was not present is still unknown to me. One theory is that the account may have been reset while it was still being mounted and as such caused the initialization to fail to create a new PVC because the previous one was still stuck.

ldimaggi · 2018-10-17T02:43:46Z

Seeing this problem again on October 17 - this is causing automated tests running on the starter-us-east-2 cluster to fail as the Jenkins pod fails to start after a reset.

ppitonak · 2018-10-17T11:04:31Z

@aslakknutsen regarding your comment from Aug 6, did we try to fix it?

rupalibehera · 2018-11-05T12:44:44Z

This seems like an reset issue, assigning to platform team.

hrishin · 2018-11-23T08:29:46Z

Still facing this issue on us-east-2 cluster.

@aslakknutsen @jmelis @mmclanerh do we any update on this?

ppitonak · 2018-12-11T12:28:10Z

@alexeykazakov @stevengutz

@chmouel thinks that this is the reason why pipelines fail, can we prioritize this? It causes e2e tests fail in prod-preview which in turn blocks deployments to production.

chmouel · 2018-12-11T12:32:04Z

related investigation #4598 (comment)

alexeykazakov · 2018-12-11T15:03:15Z

@MatousJobanek please take a look.

ppitonak · 2018-12-12T13:00:10Z

I tried to reconstruct what happened, it was useful when we were debugging different Jenkins issues in the past. All times are UTC

Dec 12, 2018 00:10:00 build 4401 starts, creates a space, new app
Dec 12, 2018 00:14:32 e2e test creates a workspace in OSIO UI, opens Che and waits for workspace to start
Dec 12, 2018 00:20:00 (approximate time) build successfully started (visible in next Jenkins job run's oc-jenkins-logs-before-all.txt)
Dec 12, 2018 00:24:40 e2e test fails because Che workspace did not start properly, gathers oc-che-logs.txt
Dec 12, 2018 00:24:58 e2e test resets the account successfully
Dec 12, 2018 00:25:00 (approximate time) Deleted pod: jenkins-1-z9rvw (visible in next Jenkins job run's oc-jenkins-logs-before-all.txt)
Dec 12, 2018 01:10:00 build 4402 starts, creates a space
Dec 12, 2018 00:10:00 (approximate time) persistentvolumeclaim "jenkins-home" not found (if I understand it correctly, oc get ev groups similar events, this is the last occurence, first time seen 45 mins ago)
Dec 12, 2018 01:11:57 e2e test gathers logs oc-jenkins-logs-before-all.txt
- Jenkins pod in Pending state pod/jenkins-1-h68tr 0/1 Pending 0 47m
Dec 12, 2018 01:13:34 application created (new pipeline should be created and started at this point)
Dec 12, 2018 01:15:02 e2e test opens pipelines view
Dec 12, 2018 01:24:30 (approximate time) Deleted pod: jenkins-1-h68tr
Dec 12, 2018 01:25:05 e2e test fails because 'View log' link is not present in UI, saves jenkins-direct-log.png and oc-jenkins-logs.txt
- in both CICO Jenkins jobs, the OSIO Jenkins version was a0f86aa (latest at that time)

ppitonak · 2018-12-13T09:44:28Z

We have two jobs that failed with similar error but contain a different message in OpenShift events:
http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1802/oc-jenkins-logs.txt

1m          11m          13        jenkins-1-l7pjf.156fa5ad37fdfd2c   Pod                                                Warning   FailedMount                   kubelet, ip-172-21-60-149.ec2.internal   MountVolume.SetUp failed for volume "0c128a67-1b82-4f51-8508-b8b01bf576be-13-14" : stat /var/lib/origin/openshift.local.volumes/pods/7a177a7b-fe32-11e8-827b-12510d4247be/volumes/rht~glfs-subvol/0c128a67-1b82-4f51-8508-b8b01bf576be-13-14: transport endpoint is not connected
35s         9m           5         jenkins-1-l7pjf.156fa5c968ab2215   Pod                                                Warning   FailedMount                   kubelet, ip-172-21-60-149.ec2.internal   Unable to mount volumes for pod "jenkins-1-l7pjf_osio-ci-e2e-002-jenkins(7a177a7b-fe32-11e8-827b-12510d4247be)": timeout expired waiting for volumes to attach or mount for pod "osio-ci-e2e-002-jenkins"/"jenkins-1-l7pjf". list of unmounted volumes=[jenkins-home]. list of unattached volumes=[jenkins-home jenkins-config jenkins-token-jv2fp]

http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-2-released/1221/oc-jenkins-logs.txt

8m          10m          2         jenkins-1-l8njq.156fcf9529867aa2   Pod                                                Warning   FailedMount                   kubelet, ip-172-31-69-202.us-east-2.compute.internal   Unable to mount volumes for pod "jenkins-1-l8njq_osio-ci-e2e-005-jenkins(856efa70-fe9d-11e8-a5d6-02d7377a4b17)": timeout expired waiting for volumes to attach or mount for pod "osio-ci-e2e-005-jenkins"/"jenkins-1-l8njq". list of unmounted volumes=[jenkins-home]. list of unattached volumes=[jenkins-home jenkins-config jenkins-token-9rkb4]

chmouel · 2018-12-13T09:45:50Z

@ppitonak any chances you can do a oc get pvc before running the tests, perhaps that would help,

ppitonak · 2018-12-13T10:22:24Z

@chmouel done, will be available in following test runs

MatousJobanek · 2018-12-13T17:25:21Z

Just giving a link to my proposed solution in another issue: #4598 (comment)
May I ask what kind of reset was used by the account? Just clean or complete delete which is enabled only at internal feature level?

ppitonak · 2018-12-17T12:57:55Z

Our accounts are set to either beta or released features. it clicks on Profile -> Edit Profile -> Reset Environment

ldimaggi · 2018-12-19T14:04:59Z

Raising severity to level "2" based on investigation that this issue is the root cause of issue # #4598 - resulting after a user resets his user environment.

MatousJobanek · 2018-12-21T17:01:16Z

Just a small update - the fix is done fabric8-services/fabric8-tenant#714 - now I'm just waiting till the quay database is fixed, so I can merge it and deploy it to prod-preview.

alexeykazakov · 2018-12-21T17:17:57Z

I's in prod-preview now. #4598 (comment)

ppitonak · 2019-01-17T09:57:25Z

I haven't seen this issue for long time. Closing

ppitonak · 2019-01-17T13:02:44Z

Unfortunately I found it in our 5 hours old logs :(
http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-released/5271/oc-jenkins-logs.txt

alexeykazakov · 2019-01-17T22:07:58Z

This failure seem to be caused by something else. See #4598 (comment)

I'm assigning it to the build team to investigate the new failures.

chmouel · 2019-01-17T22:31:57Z

Those are indeed other issues but those are issues with the openshift platform, there is nothing we can 'fix' in there we just need to accept that is unstable, we perhaps can try to retry all the time but this is just going to amplify the issue,

ldimaggi added SEV2-high type/bug team/build-cd area/jenkins priority/P2 High labels Aug 2, 2018

ldimaggi mentioned this issue Aug 3, 2018

Intermittent/recurring issue - Seeing increased numbers of 503 errors on Jenkins pod starts this week #3934

Closed

ldimaggi closed this as completed Sep 27, 2018

hrishin mentioned this issue Oct 4, 2018

Jenkins is not unidled for 1a cluster user #4378

Closed

ppitonak reopened this Oct 4, 2018

ppitonak mentioned this issue Oct 4, 2018

Non-standard version of Jenkins deployed #4385

Closed

chmouel mentioned this issue Oct 17, 2018

Idler is going crazy when there is no dc/jenkins in -jenkins namespace #4180

Closed

rupalibehera added team/service-delivery and removed team/build-cd labels Nov 5, 2018

rupalibehera added team/platform and removed team/service-delivery labels Nov 5, 2018

chmouel mentioned this issue Dec 11, 2018

jenkins.http’s server IP address could not be found. #4598

Open

ppitonak added the area/e2e-tests label Dec 11, 2018

ppitonak added priority/P1 Critical and removed priority/P2 High labels Dec 12, 2018

ldimaggi added SEV2-high and removed SEV3-medium labels Dec 19, 2018

ldimaggi mentioned this issue Dec 19, 2018

Space Deletion Function is not reliably deleting OpenShift resources #4657

Open

MatousJobanek mentioned this issue Dec 19, 2018

When there is an error while creating/updating ns create a new ones with a new base-name fabric8-services/fabric8-tenant#710

Closed

ppitonak closed this as completed Jan 17, 2019

ppitonak reopened this Jan 17, 2019

alexeykazakov added team/build-cd and removed team/platform labels Jan 17, 2019

chmouel removed the team/build-cd label Jan 17, 2019

stevengutz added the team/build-cd label Jan 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeing persistent/repeated 'persistentvolumeclaim "jenkins-home" not found' errors after OSIO env reset #4121

Seeing persistent/repeated 'persistentvolumeclaim "jenkins-home" not found' errors after OSIO env reset #4121

ldimaggi commented Aug 2, 2018

hrishin commented Aug 3, 2018

piyush-garg commented Aug 3, 2018

xyntrix commented Aug 3, 2018

aslakknutsen commented Aug 6, 2018

ldimaggi commented Aug 9, 2018

ppitonak commented Sep 27, 2018

ldimaggi commented Sep 27, 2018

ppitonak commented Oct 4, 2018

ppitonak commented Oct 4, 2018

ldimaggi commented Oct 17, 2018

ppitonak commented Oct 17, 2018

rupalibehera commented Nov 5, 2018

hrishin commented Nov 23, 2018

ppitonak commented Dec 11, 2018

chmouel commented Dec 11, 2018 •

edited

Loading

alexeykazakov commented Dec 11, 2018

ppitonak commented Dec 12, 2018

ppitonak commented Dec 13, 2018

chmouel commented Dec 13, 2018

ppitonak commented Dec 13, 2018

MatousJobanek commented Dec 13, 2018

ppitonak commented Dec 17, 2018

ldimaggi commented Dec 19, 2018

MatousJobanek commented Dec 21, 2018

alexeykazakov commented Dec 21, 2018

ppitonak commented Jan 17, 2019

ppitonak commented Jan 17, 2019

alexeykazakov commented Jan 17, 2019

chmouel commented Jan 17, 2019 •

edited

Loading

Seeing persistent/repeated 'persistentvolumeclaim "jenkins-home" not found' errors after OSIO env reset #4121

Seeing persistent/repeated 'persistentvolumeclaim "jenkins-home" not found' errors after OSIO env reset #4121

Comments

ldimaggi commented Aug 2, 2018

hrishin commented Aug 3, 2018

piyush-garg commented Aug 3, 2018

xyntrix commented Aug 3, 2018

aslakknutsen commented Aug 6, 2018

ldimaggi commented Aug 9, 2018

ppitonak commented Sep 27, 2018

ldimaggi commented Sep 27, 2018

ppitonak commented Oct 4, 2018

ppitonak commented Oct 4, 2018

ldimaggi commented Oct 17, 2018

ppitonak commented Oct 17, 2018

rupalibehera commented Nov 5, 2018

hrishin commented Nov 23, 2018

ppitonak commented Dec 11, 2018

chmouel commented Dec 11, 2018 • edited Loading

alexeykazakov commented Dec 11, 2018

ppitonak commented Dec 12, 2018

ppitonak commented Dec 13, 2018

chmouel commented Dec 13, 2018

ppitonak commented Dec 13, 2018

MatousJobanek commented Dec 13, 2018

ppitonak commented Dec 17, 2018

ldimaggi commented Dec 19, 2018

MatousJobanek commented Dec 21, 2018

alexeykazakov commented Dec 21, 2018

ppitonak commented Jan 17, 2019

ppitonak commented Jan 17, 2019

alexeykazakov commented Jan 17, 2019

chmouel commented Jan 17, 2019 • edited Loading

chmouel commented Dec 11, 2018 •

edited

Loading

chmouel commented Jan 17, 2019 •

edited

Loading