Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement recovery for Kubernetes/OpenShift infrastructures #5919

Closed
sleshchenko opened this issue Aug 7, 2017 · 4 comments
Closed

Implement recovery for Kubernetes/OpenShift infrastructures #5919

sleshchenko opened this issue Aug 7, 2017 · 4 comments
Assignees
Labels
kind/enhancement A feature request - must adhere to the feature request template.

Comments

@sleshchenko
Copy link
Member

sleshchenko commented Aug 7, 2017

Kubernetes/OpenShift workspaces are considered as stopped when workspace master is restarted.
It is needed to implement recovery for Kubernetes/OpenShift workspaces, so workspaces will be considered as running after the restart of the master.
Recovering should be adapted to Rolling Update of a workpace master. So, recovery workflow should look like:

  1. Pod CheServer is running and Service redirects to it. There may be running, starting or stopping workspaces.
  2. Pod CheServer* is starting. After the start, it should know about all workspaces, it should be possible to interact with running workspace (requests its servers, stop it).
  3. Pod CheServer* is running Service routes all traffic to it.
  4. CheServer is stopping. It should finish all operation with workspaces (stopping and starting workspaces).
  5. CheServer* should know about finished operations and pick up RUNNING workspaces.
@sleshchenko sleshchenko added kind/task Internal things, technical debt, and to-do tasks to be performed. team/platform labels Aug 7, 2017
@benoitf benoitf changed the title Implement recovery for OpenShift infrastructure [SPI] Implement recovery for OpenShift infrastructure Sep 15, 2017
@benoitf benoitf added the target/branch Indicates that a PR will be merged into a branch other than master. label Sep 15, 2017
@akorneta akorneta self-assigned this Sep 20, 2017
@garagatyi garagatyi changed the title [SPI] Implement recovery for OpenShift infrastructure Implement recovery for OpenShift infrastructure Nov 21, 2017
@sleshchenko
Copy link
Member Author

Depends on #7785

@gorkem gorkem mentioned this issue Feb 12, 2018
19 tasks
@sleshchenko sleshchenko self-assigned this Feb 15, 2018
@sleshchenko sleshchenko added the status/in-progress This issue has been taken by an engineer and is under active development. label Mar 16, 2018
@sleshchenko
Copy link
Member Author

Today we have the following status on this issue:

Looks like there is no an ability restore all running workspaces when tomcat is booted using only Kubernetes/OpenShift client and checking create objects on a cluster like a recovery is implemented in Docker Infrastructure.

Another proposed way to recovery workspaces was implementing lazy recovery when each workspace will be recovered only when a workspace is requested by a client. In this case request for getting workspaces list (GET /api/workspace) would initiate several requests to K8s/OS cluster and it would increase response time. Because of that, it was decided not to implement it.

So, it's needed to persist somewhere (like a database) metainformation of running workspaces for further recovery of them.

Also, the scope of this issue was extended and it is required to make Kubernetes/OpenShift infrastructure ready for Rolling Update (Issue description is updated). In this case, recovery should be implemented in the following way:

  1. Pod CheServer is running and Service redirects to it. There may be running, starting or stopping workspaces.
  2. Pod CheServer* is starting. After the start, it should know about all workspaces, it should be possible to interact with running workspace (requests its servers, stop it).
  3. Pod CheServer* is running Service routes all traffic to it.
  4. CheServer is stopping. It should finish all operation with workspaces (stopping and starting workspaces).
  5. CheServer* should know about finished operations and pick up RUNNING workspaces.

Since there may be two running Che Server instances at the same time, it's not enough to rework infrastructure, because of Workspace API has own local cache. So Workspace API should be reworked to use local/(persistent or distributed) depending on configuration.

More details about Workspace API and Kubernetes/OpenShift changes will be described soon.

@sleshchenko sleshchenko changed the title Implement recovery for OpenShift infrastructure Implement recovery for Kubernetes/OpenShift infrastructures Mar 16, 2018
@sleshchenko
Copy link
Member Author

sleshchenko commented Mar 21, 2018

During Rolling Update at some period of time, there will be two instances of Che Server.
So, it’s needed to somehow synchronize them and data which are held in memory.

Kubernetes/OpenShift infrastructure changes

It is proposed to implement OpenShift Recovery in the following way:

  1. OpenShift infrastructure persists meta information of Runtimes which are active (starting, running, stopping).
    Meta information includes
    - namespace
    - machines []
         - machineName
         - podName
         - containerName
         - attributes
         - servers[]
                - url
                  status
                  Attributes
  1. OpenShift infrastructure fetch persisted Runtimes while evaluating of active runtimes https://github.com/eclipse/che/blob/master/wsmaster/che-core-api-workspace/src/main/java/org/eclipse/che/api/workspace/server/spi/RuntimeInfrastructure.java#L77

  2. OpenShift context will use persisted Runtimes for recovering active ones.
    OpenShiftRuntimes flush their statuses (machines, servers) to the persistent layer.

In this manner, OpenShift infrastructures will be synchronized on an old Che Server Pod and an updated One.

Also, here is one more thing that should be covered properly, it’s servers readiness probes. It should not produce any issues if two Che Servers will do servers checks on RUNNING runtimes. But only one Che Server should perform initial servers checking on STARTING runtimes. Another Che Server should launch own servers checks only when runtimes become RUNNING.

Workspace API changes

As about Workspace API is also should be patched a bit. It is required to synchronize workspace statuses cache in WorkspaceRuntimes between Che Servers instances. Looks like using distributed cache without persisting is enough. Because infrastructure will recover all persisted runtimes after Che Server start.
This part can be done as a separated issue #9206.

Also, not to force users to reload a page, it's needed to sync between instances (maybe persist) JSON RPC subscribers.

While Che Server shutdown it should have enough time to finish all workspace related operation, like STARTING or STOPPING of workspaces.

Should be disabled a feature of stopping all workspaces(Workspace service termination) before a stop of the Che Server.

Some aspects of Rolling Update and OpenShift Runtimes recovering may be missed, but I hope this information shows the plan how OpenShift going to be implemented.

@sleshchenko sleshchenko added kind/enhancement A feature request - must adhere to the feature request template. and removed kind/task Internal things, technical debt, and to-do tasks to be performed. labels Mar 23, 2018
@sleshchenko
Copy link
Member Author

Created one more separated task that should be done for using Kubernetes/OpenShift recovering functionality. It is about WorkspaceServiceTermination adaptation #9317

@sleshchenko sleshchenko removed the target/branch Indicates that a PR will be merged into a branch other than master. label Apr 11, 2018
@sleshchenko sleshchenko removed the status/in-progress This issue has been taken by an engineer and is under active development. label Apr 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A feature request - must adhere to the feature request template.
Projects
None yet
Development

No branches or pull requests

5 participants