[CLOUD-2261] handling txn recovery by querying api #230

ochaloup · 2018-04-17T13:34:12Z

https://issues.jboss.org/browse/CLOUD-2261

hopefully non-invasive addition of the txn recovery mechanism based on the algorithm presented at https://docs.google.com/document/d/1JdOTMP6pdexJ__KYlw9hgwR6DOCHTtYCQy7PqyMk8OU/edit which addresses issue mentioned at CLOUD-2261

/cc @rcernich

rcernich · 2018-05-03T13:36:11Z

os-eap-migration/added/launch/openshift-migrate-common.sh

+  if [ $probeStatus -eq 0 ] && [ "$(type -t probePodLog)" = 'function' ]; then
+    # -- checking if server.log is clean from errors (only if function of the particular name exists)
+    probePodLog # calling function from partitionPV.sh
+    probeStatus=$?


Have you been able to verify that this is useful? By default, logging is sent to stdout. This may also levy requirements on the pod for storage related to logging (e.g. emptyDir volume mount).

yes, at my knowledge this is useful and currently I haven't found better way to find what is outcome of the unavailable resource. Even though I've created a jira with asking QA guys to create test for it: https://issues.jboss.org/browse/CLOUD-2519

On the asking about the server log probing it's not done by checking the file itself but by querying OpenShift API. That way those are data taken from stdout not from a file in particular.

rcernich · 2018-05-03T13:37:56Z

os-eap-migration/added/launch/openshift-migrate-common.sh


-  if [ $? -eq 0 ] ; then
+  if [ $probeStatus -eq 0 ] && [ "$(type -t probePodLog)" = 'function' ]; then


Does this check need to be able to discern the difference between an application failure and a probe failure (e.g. probe fails because server's taking too long to start)?

agree, the check currently is not capable to discern the probe and application failure. Is your point the application failure (aka ERROR during app startup) should not cause migrate pod to be marked as not capable to start, right?

More that the probe might fail because the management port isn't available yet, e.g. if it were run immediately after startup, i.e. it's still initializing. There is logic in the probe script to accommodate retries, e.g. it retries 30 times with a one second delay after each failure, but we're thinking of moving that configuration into the container's probe configuration directly, which could affect usage here. This is just something to consider.

I was (at least aware) of the fact there are some retries for the startup initialization. The function which pobes the log (aka. downloading the log by OpenShift API call) waits while the readiness probe returns success. At that point the script expects the container is started and it checks the content of the log.

I'm not sure what is meant by the "container's probe configuration". I expect such refactoring could affect this PR but it could be that not much. The script here just waits for get a started app server. If it's possible to find the status of the server (started ok/error on starting) then the functionality here will be just the same as it's presented in this PR.

This is what I was referring to: https://docs.openshift.com/container-platform/3.9/dev_guide/application_health.html

In addition to what's listed there, you can also specify periodSeconds, successThreshold and failureThreshold, the first and last of which could replace the settings in the probe itself. That's what I was talking about.

ok. Do I get it right that the probe script will be the same as it's now (https://github.com/jboss-openshift/application-templates/blob/master/eap/eap71-tx-recovery-s2i.json#L378) but you use the periodSeconds instead of the COUNT which is currently in use (https://github.com/ochaloup/cct_module/blob/master/os-eap-probes/added/readinessProbe.sh#L9)? How this will impact the openshift-migrate-common.sh script? I mean you still need to somehow determine that the app server is ready to serve the cli commands that the function runMigration() use. The probe script will be needed to be run manually in the script, or how that will work?

Anyway I'm happy to change the code in any way you consider being the best to follow the new practice.

I don't think anything needs to change. I'm just pointing out that if we go this route, we need to ensure we preserve this type of functionality in the probes. There are other products that are looking at switching to configuring some of this on the pods, because it works better for their use case. That said, I think you could hard code those environment variables in your migration script to future proof yourself, should the defaults ever change (e.g. go to 1/1).

I see. I'm just now thinking if it influences this verification of the server log content. I mean I named it as probe but it was not meant to be used for checking container readiness. It was meant to check if the migration action could be processed. The migration pod itself is already marked as live and ready but this part of checking plays role during the migration pod starting app server which then process the migration action. If the app container is not ready to run or there is some problem in server log it does not mean the migration pod is marked as "a not ready". It just means it tries to start the app server once again in a while.

rcernich · 2018-05-03T13:42:33Z

os-eap-probes/added/readinessProbe.sh

@@ -8,7 +8,7 @@ LOG=/tmp/readiness-log

 COUNT=30
 SLEEP=5
-DEBUG=false
+DEBUG=${SCRIPT_DEBUG:-false}


This seems out of scope, but innocuous.

I think this helps quite nice during investigation. The cct_module defines the SCRIPT_DEBUG env variable to cause set -x to be set up. Here we got debug information from the python readiness probe as well when the env variable is set. I used that with success while investigating errors.

rcernich · 2018-05-03T13:47:31Z

os-eap-txn-partition/added/partitionPV.sh

@@ -0,0 +1,202 @@
+source ${JBOSS_HOME}/bin/launch/openshift-node-name.sh


Shouldn't these be rolled into the existing partitionPV.sh script? My concern is that we're diverging and will have to start maintaining these scripts in parallel.

do you mean the os-eap-txn-partition vs. os-partition module, right? That's what I was pretty unsure how to do so. The os-partition module is not used only for the txn recovery but for message migration too. But my concern and testing was the txn recovery. If you think I can try to merge those two modules to one but still there will need to be (probably?) two kinds of algorithms - one for split-lock approached used in messaging case, the other for txn recovery. Or if you will I can try to remove the split-lock approach completely and replace with the txn api query approach. Just currently I'm not sure what issues could turn up.

What is you idea about the approach? One os-partion based on the txn recovery algorithm?

I thought the two were delegating the recovery work and other app specific details. This script it sourced by the product scripts, which provide implementations for runMigration(). I would expect the same paradigm to be used. I believe the basic script implements a strategy pattern (e.g. split the dir, add markers, cleanup, etc.), where certain steps have product specific implementations (e.g. runMigration()). bash is a bit of a pita for this, but I think that's why we're sourcing the file: so the consumer can provide implementations for the stubs.

I see. Just I'm sorry for that but I do not follow what should be changed in my case. I was thinking in the similar way (even not specified such precisely as you did). There is a new module os-eap-txn-partition which provides implementation of the migration itself. The root scripts (which calls the runMigration()) are untouched (...ok, that's not true as I needed a way to provide a bit different parameters than it was originally intended. Still I was trying to not touch any original behavior and just adding a line to get my implementation working).

@rcernich would you be so kind and give me few pointers what you expect to be done? Many thanks!

rcernich · 2018-05-03T13:50:18Z

os-eap-txn-partition/added/queryapi/query.py

+    TOKEN_FILE_PATH = '/var/run/secrets/kubernetes.io/serviceaccount/token'
+    NAMESPACE_FILE_PATH = '/var/run/secrets/kubernetes.io/serviceaccount/namespace'
+    CERT_FILE_PATH = '/var/run/secrets/kubernetes.io/serviceaccount/ca.crt'
+    STATUS_LIVING_PODS = ['Pending', 'Running', 'Unknown']


Do we need to be concerned with pods that might be bouncing, e.g. crash loop backoff? (I'm not sure what state those would be in, or if there would be anything to recover if that happened. Just asking.)

yes, I think we need to be concerned but I got the reference for the pod possible status values that could be received from the API from the kubernetes doc: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-and-container-status
I could be wrong but I hope this is ok.

Sounds good. Just checking.

ochaloup · 2018-05-16T12:32:41Z

hi @rcernich . I cooperate with @tremes on testing this feature, see https://issues.jboss.org/browse/CLOUD-2261. It's now approved from the test point of view. Is there something more that holds this PR to be merged?

Do you expect some further work on integration of the similar approach for the messaging?

rcernich · 2018-05-17T19:22:37Z

I was hoping that the new split logic would be done in a generic way, such that it could be used for both EAP and AMQ like the current logic. What's generic here is splitting a "data" directory for a cluster and cleaning up the partitions once its owner has been terminated. Is it no longer possible to use the same mechanism for this?

ochaloup · 2018-05-29T20:46:17Z

hi @rcernich I've finally changed my PR for all tenants of the os-partition module to use the same new logic querying the openshift api.
Would you be so kind and check the changes? Could be this in current shape added to the sprint and merged eventually?
Thanks.

rcernich

Structurally, this looks good to me.

ochaloup · 2018-05-31T14:33:49Z

I've haven't done any structural change. I just fixed function probePodLog as there was fail in the newly proposed test https://gitlab.cee.redhat.com/tremes/xpaas-qe/commit/e9d1a8d81274e3a91f8e73a7fce055c620b8efe2

rcernich · 2018-06-05T15:17:02Z

This set of changes doesn't work with AMQ. The build fails because JBOSS_HOME is not set. Also, partitionPV.sh makes use of openshift-node-name.sh, which is placed in JBOSS_HOME/bin/launch, which would not be available in an AMQ image. I'll see if I can rework things to get something built.

ochaloup · 2018-06-05T19:48:58Z

@rcernich I see, that's my omission. Let me know, please, whether I should start digging into this too. Thanks.

rcernich · 2018-06-05T20:23:02Z

I'm working through the builds and the unit tests. I've got amq63 built and am working through test failures with eap64. Haven't tried to build eap71 yet.

ochaloup · 2018-07-04T14:16:54Z

As agreed with @rcernich I updated the PR in way of using two migration approaches (not only one merged for use of amq/jdg and eap). The eap crashrecovery would use the query api approach where directories are named by pod name. The amq/jdg are expected to use the flock solution where directories are named with numbers like split-#.

…plit functionality

rcernich

Looks okay to me.

ochaloup mentioned this pull request Apr 23, 2018

[CLOUD-2261] adding permisssion to list pods jboss-openshift/application-templates#447

Merged

rcernich reviewed May 3, 2018

View reviewed changes

ochaloup force-pushed the CLOUD-2261-txn-recovery-design branch from 4252db2 to 5e9de06 Compare May 15, 2018 20:21

ochaloup force-pushed the CLOUD-2261-txn-recovery-design branch from 5e9de06 to f30c044 Compare May 29, 2018 20:39

rcernich previously approved these changes May 29, 2018

View reviewed changes

ochaloup dismissed rcernich’s stale review via 95ccff6 May 30, 2018 11:08

ochaloup force-pushed the CLOUD-2261-txn-recovery-design branch 2 times, most recently from 95ccff6 to 8145f43 Compare May 31, 2018 14:30

rcernich mentioned this pull request Jun 6, 2018

CLOUD-2261 handling parition by querying api #260

Open

7 tasks

ochaloup force-pushed the CLOUD-2261-txn-recovery-design branch 5 times, most recently from f324ca2 to fa29ac3 Compare June 21, 2018 09:05

Updating gitignore for eclipse profile file

87d619e

ochaloup force-pushed the CLOUD-2261-txn-recovery-design branch 2 times, most recently from a5d7d1b to 6d4dafa Compare July 4, 2018 13:11

[CLOUD-2261] Adding os-partition-txnrecovery module

6cadec9

ochaloup force-pushed the CLOUD-2261-txn-recovery-design branch from 6d4dafa to 909af50 Compare July 4, 2018 13:24

[CLOUD-2261] Integrating the os-parition-txnrecovery module for eap s…

b918476

…plit functionality

ochaloup force-pushed the CLOUD-2261-txn-recovery-design branch from 909af50 to b918476 Compare July 16, 2018 10:32

rcernich approved these changes Jul 23, 2018

View reviewed changes

rcernich merged commit 7ad2773 into jboss-openshift:master Jul 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CLOUD-2261] handling txn recovery by querying api #230

[CLOUD-2261] handling txn recovery by querying api #230

ochaloup commented Apr 17, 2018 •

edited

Loading

rcernich May 3, 2018

ochaloup May 4, 2018

rcernich May 3, 2018

ochaloup May 4, 2018

rcernich May 4, 2018

ochaloup May 9, 2018

rcernich May 9, 2018

ochaloup May 9, 2018

rcernich May 9, 2018

ochaloup May 10, 2018

rcernich May 3, 2018

ochaloup May 4, 2018

rcernich May 3, 2018

ochaloup May 4, 2018 •

edited

Loading

rcernich May 4, 2018

ochaloup May 9, 2018 •

edited

Loading

rcernich May 3, 2018

ochaloup May 4, 2018

rcernich May 4, 2018

ochaloup commented May 16, 2018 •

edited

Loading

rcernich commented May 17, 2018

ochaloup commented May 29, 2018

rcernich left a comment

ochaloup commented May 31, 2018 •

edited

Loading

rcernich commented Jun 5, 2018

ochaloup commented Jun 5, 2018

rcernich commented Jun 5, 2018

ochaloup commented Jul 4, 2018

rcernich left a comment


		if [ $? -eq 0 ] ; then
		if [ $probeStatus -eq 0 ] && [ "$(type -t probePodLog)" = 'function' ]; then

		@@ -0,0 +1,202 @@
		source ${JBOSS_HOME}/bin/launch/openshift-node-name.sh

[CLOUD-2261] handling txn recovery by querying api #230

[CLOUD-2261] handling txn recovery by querying api #230

Conversation

ochaloup commented Apr 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ochaloup May 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ochaloup May 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ochaloup commented May 16, 2018 • edited Loading

rcernich commented May 17, 2018

ochaloup commented May 29, 2018

rcernich left a comment

Choose a reason for hiding this comment

ochaloup commented May 31, 2018 • edited Loading

rcernich commented Jun 5, 2018

ochaloup commented Jun 5, 2018

rcernich commented Jun 5, 2018

ochaloup commented Jul 4, 2018

rcernich left a comment

Choose a reason for hiding this comment

ochaloup commented Apr 17, 2018 •

edited

Loading

ochaloup May 4, 2018 •

edited

Loading

ochaloup May 9, 2018 •

edited

Loading

ochaloup commented May 16, 2018 •

edited

Loading

ochaloup commented May 31, 2018 •

edited

Loading