Skip to content
This repository has been archived by the owner on Jul 23, 2020. It is now read-only.

Seeing persistent/repeated 'persistentvolumeclaim "jenkins-home" not found' errors after OSIO env reset #4121

Open
ldimaggi opened this issue Aug 2, 2018 · 29 comments

Comments

@ldimaggi
Copy link
Collaborator

ldimaggi commented Aug 2, 2018

For a user (my user account) provisioned on starter-us-east-2 - after an OSIO env reset, jenkins pod fails to start. Event log includes:

4:48:43 PM | jenkins-1-zvlxn | Pod | Warning | Failed Scheduling | 
persistentvolumeclaim "jenkins-home" not found 152 times in the last 49 minutes
4:03:43 PM | jenkins-1-zvlxn | Pod | Warning | Failed Scheduling | 
persistentvolumeclaim "jenkins-home" is being deleted 5 times in the last 49 minutes
4:03:41 PM | content-repository-1-sf2cr | Pod | Normal | Started | 
Started container

This issue appeared today after the resolution of: #3934

@hrishin
Copy link

hrishin commented Aug 3, 2018

@ldimaggi we are trying to understand the root cause of this issue. @aslakknutsen would you like shed some lights here?

@piyush-garg
Copy link
Collaborator

@ldimaggi Just to correct you, build team is facing this issue from past 5-6 days and Service Delivery was also informed about that. I discussed this issue with @mmclanerh

This issue is not happening after or because of #3934 This is a different issue happening before that fix also.

The fix we just provided in Jenkins version 4.0.97 is related to #3956 and is affecting the #3934 as Jenkins will come up fast now and 503 issue will not block the user as you said build is taking 15-20 minutes in #3934 (comment). We are still working on #3934

@xyntrix
Copy link

xyntrix commented Aug 3, 2018

i would suggest looking at the pvc and clearing off any cruft / old files that arent relevant. i wonder if the amount of files present have ballooned due to stale/failed workspace clean ups?

@aslakknutsen
Copy link
Collaborator

Not investigated this specifically, but it looks very similar to another issue from Friday.

Essentially this can happen due to a Reset; Clean & Apply.

Clean return 'deleted ok', but when Apply happens multiple seconds later, the PVC is still not deleted from Openshift, so Apply end up Updating the PVC instead of recreating it. Openshift then later comes around and Deletes the PVC.

A second Reset fixes it after the PVC has been deleted.

@ldimaggi
Copy link
Collaborator Author

ldimaggi commented Aug 9, 2018

Still seeing this:
persistentvolumeclaim "jenkins-home" not found52 times in the last 16 minutes

It looks like Aslak is correct - a 2nd reset is needed.

@ppitonak
Copy link
Collaborator

@ldimaggi is this still happening?

@ldimaggi
Copy link
Collaborator Author

I think we can close this one - it's been almost 2 months and I do not think the issue is happening now.

@ppitonak
Copy link
Collaborator

ppitonak commented Oct 4, 2018

Happening right now with account osio-ci-e2e-001-preview on prod-preview.

@ppitonak
Copy link
Collaborator

ppitonak commented Oct 4, 2018

#4378 (comment)

The reason why the jenkins-home was not present is still unknown to me. One theory is that the account may have been reset while it was still being mounted and as such caused the initialization to fail to create a new PVC because the previous one was still stuck.

@ldimaggi
Copy link
Collaborator Author

Seeing this problem again on October 17 - this is causing automated tests running on the starter-us-east-2 cluster to fail as the Jenkins pod fails to start after a reset.

@ppitonak
Copy link
Collaborator

@aslakknutsen regarding your comment from Aug 6, did we try to fix it?

@rupalibehera
Copy link
Collaborator

This seems like an reset issue, assigning to platform team.

@hrishin
Copy link

hrishin commented Nov 23, 2018

Still facing this issue on us-east-2 cluster.

image

@aslakknutsen @jmelis @mmclanerh do we any update on this?

@ppitonak
Copy link
Collaborator

@alexeykazakov @stevengutz

@chmouel thinks that this is the reason why pipelines fail, can we prioritize this? It causes e2e tests fail in prod-preview which in turn blocks deployments to production.

@chmouel
Copy link

chmouel commented Dec 11, 2018

related investigation #4598 (comment)

@alexeykazakov
Copy link
Member

@MatousJobanek please take a look.

@ppitonak ppitonak added priority/P1 Critical and removed priority/P2 High labels Dec 12, 2018
@ppitonak
Copy link
Collaborator

I tried to reconstruct what happened, it was useful when we were debugging different Jenkins issues in the past. All times are UTC

  • Dec 12, 2018 00:10:00 build 4401 starts, creates a space, new app
  • Dec 12, 2018 00:14:32 e2e test creates a workspace in OSIO UI, opens Che and waits for workspace to start
  • Dec 12, 2018 00:20:00 (approximate time) build successfully started (visible in next Jenkins job run's oc-jenkins-logs-before-all.txt)
  • Dec 12, 2018 00:24:40 e2e test fails because Che workspace did not start properly, gathers oc-che-logs.txt
  • Dec 12, 2018 00:24:58 e2e test resets the account successfully
  • Dec 12, 2018 00:25:00 (approximate time) Deleted pod: jenkins-1-z9rvw (visible in next Jenkins job run's oc-jenkins-logs-before-all.txt)
  • Dec 12, 2018 01:10:00 build 4402 starts, creates a space
  • Dec 12, 2018 00:10:00 (approximate time) persistentvolumeclaim "jenkins-home" not found (if I understand it correctly, oc get ev groups similar events, this is the last occurence, first time seen 45 mins ago)
  • Dec 12, 2018 01:11:57 e2e test gathers logs oc-jenkins-logs-before-all.txt
    • Jenkins pod in Pending state pod/jenkins-1-h68tr 0/1 Pending 0 47m
  • Dec 12, 2018 01:13:34 application created (new pipeline should be created and started at this point)
  • Dec 12, 2018 01:15:02 e2e test opens pipelines view
  • Dec 12, 2018 01:24:30 (approximate time) Deleted pod: jenkins-1-h68tr
  • Dec 12, 2018 01:25:05 e2e test fails because 'View log' link is not present in UI, saves jenkins-direct-log.png and oc-jenkins-logs.txt
    • in both CICO Jenkins jobs, the OSIO Jenkins version was a0f86aa (latest at that time)

@ppitonak
Copy link
Collaborator

We have two jobs that failed with similar error but contain a different message in OpenShift events:
http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1802/oc-jenkins-logs.txt

1m          11m          13        jenkins-1-l7pjf.156fa5ad37fdfd2c   Pod                                                Warning   FailedMount                   kubelet, ip-172-21-60-149.ec2.internal   MountVolume.SetUp failed for volume "0c128a67-1b82-4f51-8508-b8b01bf576be-13-14" : stat /var/lib/origin/openshift.local.volumes/pods/7a177a7b-fe32-11e8-827b-12510d4247be/volumes/rht~glfs-subvol/0c128a67-1b82-4f51-8508-b8b01bf576be-13-14: transport endpoint is not connected
35s         9m           5         jenkins-1-l7pjf.156fa5c968ab2215   Pod                                                Warning   FailedMount                   kubelet, ip-172-21-60-149.ec2.internal   Unable to mount volumes for pod "jenkins-1-l7pjf_osio-ci-e2e-002-jenkins(7a177a7b-fe32-11e8-827b-12510d4247be)": timeout expired waiting for volumes to attach or mount for pod "osio-ci-e2e-002-jenkins"/"jenkins-1-l7pjf". list of unmounted volumes=[jenkins-home]. list of unattached volumes=[jenkins-home jenkins-config jenkins-token-jv2fp]

http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-2-released/1221/oc-jenkins-logs.txt

8m          10m          2         jenkins-1-l8njq.156fcf9529867aa2   Pod                                                Warning   FailedMount                   kubelet, ip-172-31-69-202.us-east-2.compute.internal   Unable to mount volumes for pod "jenkins-1-l8njq_osio-ci-e2e-005-jenkins(856efa70-fe9d-11e8-a5d6-02d7377a4b17)": timeout expired waiting for volumes to attach or mount for pod "osio-ci-e2e-005-jenkins"/"jenkins-1-l8njq". list of unmounted volumes=[jenkins-home]. list of unattached volumes=[jenkins-home jenkins-config jenkins-token-9rkb4]

@chmouel
Copy link

chmouel commented Dec 13, 2018

@ppitonak any chances you can do a oc get pvc before running the tests, perhaps that would help,

@ppitonak
Copy link
Collaborator

@chmouel done, will be available in following test runs

@MatousJobanek
Copy link

Just giving a link to my proposed solution in another issue: #4598 (comment)
May I ask what kind of reset was used by the account? Just clean or complete delete which is enabled only at internal feature level?

@ppitonak
Copy link
Collaborator

Our accounts are set to either beta or released features. it clicks on Profile -> Edit Profile -> Reset Environment

@ldimaggi
Copy link
Collaborator Author

Raising severity to level "2" based on investigation that this issue is the root cause of issue # #4598 - resulting after a user resets his user environment.

@MatousJobanek
Copy link

Just a small update - the fix is done fabric8-services/fabric8-tenant#714 - now I'm just waiting till the quay database is fixed, so I can merge it and deploy it to prod-preview.

@alexeykazakov
Copy link
Member

I's in prod-preview now. #4598 (comment)

@ppitonak
Copy link
Collaborator

I haven't seen this issue for long time. Closing

@ppitonak
Copy link
Collaborator

@ppitonak ppitonak reopened this Jan 17, 2019
@alexeykazakov
Copy link
Member

This failure seem to be caused by something else. See #4598 (comment)

I'm assigning it to the build team to investigate the new failures.

@chmouel
Copy link

chmouel commented Jan 17, 2019

Those are indeed other issues but those are issues with the openshift platform, there is nothing we can 'fix' in there we just need to accept that is unstable, we perhaps can try to retry all the time but this is just going to amplify the issue,

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests