Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

postgres data gone after minikube node reboot #15065

Closed
16 tasks
gattytto opened this issue Nov 1, 2019 · 13 comments
Closed
16 tasks

postgres data gone after minikube node reboot #15065

gattytto opened this issue Nov 1, 2019 · 13 comments
Labels
area/install Issues related to installation, including offline/air gap and initial setup kind/bug Outline of a bug - must adhere to the bug report template. severity/P1 Has a major impact to usage or development of the system.

Comments

@gattytto
Copy link

gattytto commented Nov 1, 2019

Describe the bug

rebooting the minikube node hosting a che env, postgres pod's /var/lib/pgsql/data is gone, postgres and keycloak pods go BackOff

Che version

  • latest
  • [ *] nightly
  • other: please specify

Steps to reproduce

anything that causes the minikube node to reboot (be it gracefully or a hard reset)

Expected behavior

I expect the che context to be brought back up with postgres and keycloak pods loading the pre-existing database until I decide to issue chectl:delete

Runtime

  • kubernetes (include output of kubectl version)
  • Openshift (include output of oc version)
  • [* ] minikube (include output of minikube version and kubectl version)
  • minishift (include output of minishift version and oc version)
  • docker-desktop + K8S (include output of docker version and kubectl version)
  • other: (please specify)

Screenshots

Installation method

  • [* ] chectl
    chectl server:start -m -p minikube
  • che-operator
  • minishift-addon
  • I don't know

Environment

  • [* ] my computer
    • Windows
    • [* ] Linux
    • macOS
  • Cloud
    • Amazon
    • Azure
    • GCE
    • [ *] other (please specify)
      -LXD

Additional context

the PersistentVolume implemented by chectl to start the postgres should use a path beginning with /data to avoid minikube earsing its content upon a node hard-reset.

hostpath field "path:" set to empty when defining a PersistentVolume causes minikube default StorageClass implementation to use /tmp/hostpath-provisioner/ as the folder, which gets emptied upon reboots according to https://minikube.sigs.k8s.io/docs/reference/persistent_volumes/

if this gets sorted out I could go on and run test-scenarios for the workspace pods too.

$ kubectl get pv pvc-90a86e5a-a7d8-43b5-9bae-9e1064f9df0b -o yaml

apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
hostPathProvisionerIdentity: 47e548c5-fca5-11e9-9417-02427d267bb8
pv.kubernetes.io/provisioned-by: k8s.io/minikube-hostpath
creationTimestamp: "2019-11-01T15:56:33Z"
finalizers:

  • kubernetes.io/pv-protection
    name: pvc-90a86e5a-a7d8-43b5-9bae-9e1064f9df0b
    resourceVersion: "175275"
    selfLink: /api/v1/persistentvolumes/pvc-90a86e5a-a7d8-43b5-9bae-9e1064f9df0b
    uid: 071444ea-f8f9-4943-9bd0-c7170b94f995
    spec:
    accessModes:
  • ReadWriteOnce
    capacity:
    storage: 1Gi
    claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: postgres-data
    namespace: che
    resourceVersion: "175266"
    uid: 90a86e5a-a7d8-43b5-9bae-9e1064f9df0b
    hostPath:
    path: /tmp/hostpath-provisioner/pvc-90a86e5a-a7d8-43b5-9bae-9e1064f9df0b
    type: ""
    persistentVolumeReclaimPolicy: Delete
    storageClassName: standard
    volumeMode: Filesystem
    status:
    phase: Bound

@gattytto gattytto added the kind/bug Outline of a bug - must adhere to the bug report template. label Nov 1, 2019
@gattytto gattytto changed the title postgres data gone after minikube node hard reset postgres data gone after minikube node reboot Nov 1, 2019
@che-bot che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Nov 1, 2019
@gattytto
Copy link
Author

gattytto commented Nov 1, 2019

@ibuziuk ibuziuk removed the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Nov 4, 2019
@ibuziuk
Copy link
Member

ibuziuk commented Nov 4, 2019

Looks like related to disaster recovery - #14240
@gattytto thanks for reporting and looks like you did a pretty good analysis. Will you be interested in contributing a fix?

the PersistentVolume implemented by chectl to start the postgres should use a path beginning with /data to avoid minikube earsing its content upon a node hard-reset.

hostpath field "path:" set to empty when defining a PersistentVolume causes minikube default StorageClass implementation to use /tmp/hostpath-provisioner/ as the folder, which gets emptied upon reboots according to https://minikube.sigs.k8s.io/docs/reference/persistent_volumes/

if this gets sorted out I could go on and run test-scenarios for the workspace pods too.

@ibuziuk ibuziuk added severity/P1 Has a major impact to usage or development of the system. area/install Issues related to installation, including offline/air gap and initial setup team/platform labels Nov 4, 2019
@gattytto
Copy link
Author

gattytto commented Nov 5, 2019

@ibuziuk yes partially, I’m in testing phase but it can be done

@gattytto
Copy link
Author

gattytto commented Nov 7, 2019

I need some help, please. I will provide reproduction steps. First of all this is specific to minikube+chectl deployment of che.

so far I did code changes in https://github.com/gattytto/che-operator and started the deployment using:
chectl server:start -m -p minikube --che-operator-image=quay.io/gattytto/che-operator:latest -t /usr/local/lib/chectl/templates

one part of the change is to controller code adding the persistentVolume, and there's also a storageClass in https://github.com/gattytto/che-operator/blob/master/deploy/storageclass.yaml with which I had to use kubectl command to add it to the cluster, because for some reason the dashboard doesn't accept it (but CMDLine kubectl does). the storage class is hardcoded to the persistentVolumeClaim(PVC) and the persistentVolume(PV) because the PVC gets the standard one when created without specific storageclass and PV gets none. I see the argument to use a specific storage class but for the time I just hardcoded it.

chectl yaml files for role.yaml and cluster-role.yaml had the addition of the persistentvolumes resource, I have edited the ones in https://github.com/gattytto/che-operator/blob/master/deploy/role.yaml and /cluster-role.yaml respectively and copied them to:
/usr/local/lib/chectl/templates/che-operator/
so chectl uses them when starting the deployment.

I have manually created /data/minikube folder and set permission to 777, the operator startup process effectively creates the subfolder "userdata", which holds the postgres db files and has the expected user rights for UID=26 and GID=26.
THIS PART IS IMPORTANT, because the PersistentVolume type is DirectoryOrCreate, and since in the scenario that minikube is using the vm-driver=none tag (running inside LXC container), minikube is running as root and the directory minikube inside /data will be created with root:root rights. so That's why I pre-created it and set the rights to 777.
this will be fixable from code when minikube team implements the "mountoptions" property for persistentVolumes in minikube.

Part of the process gets done and it gets stuck before deploying the plugin registry. I don't know why and I also don't know how to further debug / test why the operator is stopping the deplyment process.
As seen in the screenshot, what I CAN be sure of, is that both keycloak and postgres pods are started and healthy, I have also accessed keycloak-che url and successfully logged in as admin:admin.

image

image

image

@gattytto
Copy link
Author

gattytto commented Nov 7, 2019

and it works after a hard reset of the LXC container, at least what was started, comes back.
image

@sleshchenko
Copy link
Member

@gattytto Could you share che-operator logs. AFAIK che-operator do some exec in keycloak, maybe it's failed.

@gattytto
Copy link
Author

gattytto commented Nov 8, 2019

I have finished the code modifications to persist postgres data and it works.

After a hard reset of the LXC container, postgres, keycloack and che come back.

as for Workspaces: they don't, because their storage got deleted by minikube

image

@gattytto
Copy link
Author

gattytto commented Dec 1, 2019

it seems like persistentvolumeclaim provisioning is split in half for the kubernetes use-case, che-operator provisions postgres-data volume and che-server follows config values set in volumeclaimStrategy and uses java code to make the volumes for the workspaces. Could this be moved to che-operator golang code instead?

@simha369
Copy link

simha369 commented Dec 4, 2019

I am still facing the same issue, Persistent volume Postgres data lost after minikube stop.
Do we have a solution for this problem? please share.
If this is working in an earlier minikube version. please share the working minikube version.
i am facing issue in minikube version: v1.5.2

@gattytto
Copy link
Author

gattytto commented Dec 6, 2019

@simha369 no there's no fix but I have filed a feature request #15157 .. you can patch the che-operator code to persist your postgres database and general info (like ssh keys?) from your dev env, but after a hard reset you would still need to recreate (delete and create again) the workspaces from your devfiles registry or using factories. So depending on what you need to persist there is a workaround or not (for the moment)

@AndrienkoAleksandr
Copy link
Contributor

@gattytto Join to review, please eclipse-che/che-operator#144

@tolusha
Copy link
Contributor

tolusha commented Jan 23, 2020

@gattytto
Do you think we can close the issue?

@gattytto
Copy link
Author

I'm very happy to say yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/install Issues related to installation, including offline/air gap and initial setup kind/bug Outline of a bug - must adhere to the bug report template. severity/P1 Has a major impact to usage or development of the system.
Projects
None yet
Development

No branches or pull requests

8 participants