-
Notifications
You must be signed in to change notification settings - Fork 63
HA upgrades failing #1589
Comments
You don't actually have a database pod in the list. Did the operator fail to create it? Please attach proper logs, and not the snippets that don't include the relevant information. |
klogs attached |
looking at database-0 kube.log:
|
Looking at - apiVersion: v1
count: 405
firstTimestamp: "2020-11-13T22:20:56Z"
lastTimestamp: "2020-11-14T00:30:45Z"
involvedObject:
fieldPath: spec.containers{cc-deployment-updater-cc-deployment-updater}
message: Readiness probe failed: false
- apiVersion: v1
count: 586
firstTimestamp: "2020-11-13T22:49:17Z"
lastTimestamp: "2020-11-14T00:27:44Z"
involvedObject:
fieldPath: spec.containers{cc-deployment-updater-cc-deployment-updater}
message: Readiness probe failed: false So both The last entry in {"timestamp":"2020-11-13T22:23:25.734779572Z","message":"database schema is as new or newer than locally available migrations","log_level":"info","source":"cc.db.wait_until_current","data":{},"thread_id":47384045602280,"fiber_id":47384105923720,"process_id":11,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/db_migrator.rb","lineno":43,"method":"wait_for_migrations!"} That was a good two hours earlier (!), reporting that no schema migration was needed and everything was good to go. This was around when the container was restarted, according to its cc-deployment-updater-cc-deployment-updater:
State: Running
Started: Fri, 13 Nov 2020 14:23:06 -0800
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 13 Nov 2020 12:56:45 -0800
Finished: Fri, 13 Nov 2020 14:22:21 -0800
Ready: False
Restart Count: 1
Liveness: exec […] delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: exec […] delay=0s timeout=1s period=10s #success=1 #failure=3 Looking at the images being used by the pods:
So there is no issue with migrations here: even though we're changing image repositories, all images involved are using the same BOSH release version. Also, both scheduler pods have already been upgraded. Since the containers haven't been restarted, the liveness probes must be passing; and since it did report that it didn't need migrations, it's outputting to the expected file… Looking at the start of {"timestamp":"2020-11-13T20:59:37.126795950Z","message":"database schema is as new or newer than locally available migrations","log_level":"info","source":"cc.db.wait_until_current","data":{},"thread_id
":47405084435960,"fiber_id":47405144960920,"process_id":14,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/db_migrator.rb","lineno":43,"method":"wait_for_migratio
ns!"}
{"timestamp":"2020-11-13T22:22:35.226469718Z","message":"run-deployment-update","log_level":"info","source":"cc.deployment_updater.update","data":{},"thread_id":47405084435960,"fiber_id":47405144960920,
"process_id":14,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/deployment_updater/dispatcher.rb","lineno":10,"method":"dispatch"} So somehow there was a 1.5 hour delay between migration and running… But {"timestamp":"2020-11-13T22:22:19.326170349Z","message":"cc.deployment_updater","log_level":"error","source":"cc.deployment_updater","data":{"error":"Sequel::DatabaseDisconnectError","error_message":"Mysql2::Error::ConnectionError: MySQL server has gone away","backtrace":"…"},"thread_id":47056957613560,"fiber_id":47057017693360,"process_id":13,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/deployment_updater/scheduler.rb","lineno":66,"method":"rescue in with_error_logging"}
{"timestamp":"2020-11-13T22:23:25.734779572Z","message":"database schema is as new or newer than locally available migrations","log_level":"info","source":"cc.db.wait_until_current","data":{},"thread_id":47384045602280,"fiber_id":47384105923720,"process_id":11,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/db_migrator.rb","lineno":43,"method":"wait_for_migrations!"} So it's probably just doing active/passive? That raises questions, like does HA deployments (without the upgrade) even do the right thing? And how did we get wedged such that both sides ended up unready? |
Looks like a bug to investigate. We'll put this on top of the Icebox for upcoming planning unless there's more urgency required, then it goes to top of To Do for someone to work on right away. |
Please provide me with a session transcript, i.e. the commands/steps you did, for both base deployment, and then the upgrade. |
|
@andreas-kupries |
I wish the mail notifications were faster. ...
|
yes
yes, catapult
yes
yes, you will have to change them accordingly
No, I used GKE DNS service for this. You can use anything |
Confirmed with GKE ... Update start at 14:38, system settled at 15:10 (0:32 later) with Non-rotated pods:
After using catapult to deploy a GKE cluster the deployment and upgrade commands were:
|
Sigh ... After a diversion through Ah damn ... The two kubecf log pulls have different structures ... I may have used different klog scripts for both (I tried and thought to have avoided that 💥 ). That makes things badly comparable. |
Version Changes
Commands for reproduction
Results
InvestigationLooking at the See the table below. Note that the Ids were all copy-pasted out of
The same seems to be true for all the jobs in the pod, actually. The name of the Image used is the same as well. Without any kind of changes, actually. Note that the names were all copy-pasted out of
While I see the
The For all other jobs, the Now, conversely, looking at one of the pods which were rotated by the operator (
So the operator rotated the pod. And I see this switch in The exceptions are exactly the two Conclusions so farI posit that:
ActionsInvestigate more where/how the |
And now I wonder how much of the |
Doing a non-upgrade minikube deployment (everything non-HA, except scale scheduler to two instances) shows that the scheduler readiness probe is busted: when doing HA scheduler, only one of the two pods have the I believe we need to, at least, do:
Alternatively, we drop the readiness probe; however, that leads to a chance that we will be unable to detect when all instances go down. Another alternative is to drop the @viovanov Removing the |
@mook-as yes, please remove. That seems like the safer option |
Describe the bug
HA Upgrades from CAP 2.1.0 to kubecf master is failing:
To Reproduce
HA deployment of CAP 2.1.0 with diego
Upgrade to kubecf-bundle-v0.0.0-20201110150701.2007.gbfe922e3
Expected behavior
scheduler-0 gets stuck at
12/13
statuskubectl logs -n kubecf scheduler-0 logs
pods which don't get rotated :
klogs
: https://kubecf-klog.s3-us-west-2.amazonaws.com/klog-kubecf1589.tar.gzEnvironment
GKE
Values.yaml
The text was updated successfully, but these errors were encountered: