Skip to content
This repository has been archived by the owner on Oct 22, 2021. It is now read-only.

HA upgrades failing #1589

Closed
prabalsharma opened this issue Nov 16, 2020 · 16 comments · Fixed by #1632 or #1637
Closed

HA upgrades failing #1589

prabalsharma opened this issue Nov 16, 2020 · 16 comments · Fixed by #1632 or #1637
Assignees
Labels
suse-cap Type: Bug Something isn't working

Comments

@prabalsharma
Copy link
Collaborator

prabalsharma commented Nov 16, 2020

Describe the bug
HA Upgrades from CAP 2.1.0 to kubecf master is failing:

  1. All pods don't come back up running
  2. All pods don't get rotated

To Reproduce
HA deployment of CAP 2.1.0 with diego
Upgrade to kubecf-bundle-v0.0.0-20201110150701.2007.gbfe922e3

Expected behavior
scheduler-0 gets stuck at 12/13 status

kubectl logs -n kubecf scheduler-0 logs

2020/11/13 22:20:10, /var/vcap/sys/log/cloud_controller_clock/cloud_controller_clock.log, {"timestamp":"2020-11-13T22:20:10.011119363Z","message":"Mysql2::Error::ConnectionError: MySQL server has gone away (Sequel::DatabaseDisconnectError)","log_level":"error","source":"cc.clock","data":{},"thread_id":70075352282900,"fiber_id":70075352279700,"process_id":13,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/clock/scheduler.rb","lineno":43,"method":"block in start"}
2020/11/13 22:22:19, /var/vcap/sys/log/cc_deployment_updater/cc_deployment_updater.log, {"timestamp":"2020-11-13T22:22:19.323129386Z","message":"run-deployment-update","log_level":"info","source":"cc.deployment_updater.update","data":{},"thread_id":47056957613560,"fiber_id":47057017693360,"process_id":13,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/deployment_updater/dispatcher.rb","lineno":10,"method":"dispatch"}
2020/11/13 22:22:19, /var/vcap/sys/log/cc_deployment_updater/cc_deployment_updater.log, {"timestamp":"2020-11-13T22:22:19.326170349Z","message":"cc.deployment_updater","log_level":"error","source":"cc.deployment_updater","data":{"error":"Sequel::DatabaseDisconnectError","error_message":"Mysql2::Error::ConnectionError: MySQL server has gone away","backtrace":"/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/mysql2-0.5.3/lib/mysql2/client.rb:131:in `_query'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/mysql2-0.5.3/lib/mysql2/client.rb:131:in `block in query'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/mysql2-0.5.3/lib/mysql2/client.rb:130:in `handle_interrupt'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/mysql2-0.5.3/lib/mysql2/client.rb:130:in `query'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/sequel-5.35.0/lib/sequel/adapters/mysql2.rb:136:in `block in _execute'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/sequel-5.35.0/lib/sequel/database/logging.rb:38:in `log_connection_yield'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/sequel-5.35.0/lib/sequel/adapters/mysql2.rb:131:in `_execute'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/sequel-5.35.0/lib/sequel/adapters/utils/mysql_mysql2.rb:39:in `block in execute'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/sequel-5.35.0/lib/sequel/connection_pool/threaded.rb:92:in `hold'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/sequel-5.35.0/lib/sequel/database/connecting.rb:270:in `synchronize'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/sequel-5.35.0/lib/sequel/adapters/utils/mysql_mysql2.rb:39:in `execute'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/sequel-5.35.0/lib/sequel/dataset/actions.rb:1089:in `execute'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/sequel-5.35.0/lib/sequel/adapters/mysql2.rb:276:in `execute'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/sequel-5.35.0/lib/sequel/adapters/mysql2.rb:236:in `fetch_rows'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/sequel-5.35.0/lib/sequel/dataset/actions.rb:150:in `each'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/deployment_updater/dispatcher.rb:40:in `finalize_degenerate_deployments'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/deployment_updater/dispatcher.rb:12:in `dispatch'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/deployment_updater/scheduler.rb:47:in `block in update'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/statsd-ruby-1.4.0/lib/statsd.rb:412:in `time'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/deployment_updater/scheduler.rb:46:in `update'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/deployment_updater/scheduler.rb:14:in `block (2 levels) in start'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/locket/lock_worker.rb:11:in `block in acquire_lock_and_repeatedly_call'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/locket/lock_worker.rb:9:in `loop'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/locket/lock_worker.rb:9:in `acquire_lock_and_repeatedly_call'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/deployment_updater/scheduler.rb:36:in `block in start'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/deployment_updater/scheduler.rb:62:in `with_error_logging'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/deployment_updater/scheduler.rb:10:in `start'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/tasks/deployment_updater.rake:9:in `block (2 levels) in <top (required)>'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/rake-13.0.1/lib/rake/task.rb:281:in `block in execute'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/rake-13.0.1/lib/rake/task.rb:281:in `each'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/rake-13.0.1/lib/rake/task.rb:281:in `execute'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/rake-13.0.1/lib/rake/task.rb:219:in `block in invoke_with_call_chain'\n/var/vcap/packages/ruby-2.5.5-r0.10.0/lib/ruby/2.5.0/monitor.rb:226:in `mon_synchronize'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/rake-13.0.1/lib/rake/task.rb:199:in `invoke_with_call_chain'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/rake-13.0.1/lib/rake/task.rb:188:in `invoke'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/rake-13.0.1/lib/rake/application.rb:160:in `invoke_task'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/rake-13.0.1/lib/rake/application.rb:116:in `block (2 levels) in top_level'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/rake-13.0.1/lib/rake/application.rb:116:in `each'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/rake-13.0.1/lib/rake/application.rb:116:in `block in top_level'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/rake-13.0.1/lib/rake/application.rb:125:in `run_with_threads'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/rake-13.0.1/lib/rake/application.rb:110:in `top_level'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/rake-13.0.1/lib/rake/application.rb:83:in `block in run'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/rake-13.0.1/lib/rake/application.rb:186:in `standard_exception_handling'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/rake-13.0.1/lib/rake/application.rb:80:in `run'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/gems/rake-13.0.1/exe/rake:27:in `<top (required)>'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/bin/rake:23:in `load'\n/var/vcap/packages/cloud_controller_ng/gem_home/ruby/2.5.0/bin/rake:23:in `<top (required)>'\n/var/vcap/packages/ruby-2.5.5-r0.10.0/lib/ruby/site_ruby/2.5.0/bundler/cli/exec.rb:74:in `load'\n/var/vcap/packages/ruby-2.5.5-r0.10.0/lib/ruby/site_ruby/2.5.0/bundler/cli/exec.rb:74:in `kernel_load'\n/var/vcap/packages/ruby-2.5.5-r0.10.0/lib/ruby/site_ruby/2.5.0/bundler/cli/exec.rb:28:in `run'\n/var/vcap/packages/ruby-2.5.5-r0.10.0/lib/ruby/site_ruby/2.5.0/bundler/cli.rb:424:in `exec'\n/var/vcap/packages/ruby-2.5.5-r0.10.0/lib/ruby/site_ruby/2.5.0/bundler/vendor/thor/lib/thor/command.rb:27:in `run'\n/var/vcap/packages/ruby-2.5.5-r0.10.0/lib/ruby/site_ruby/2.5.0/bundler/vendor/thor/lib/thor/invocation.rb:126:in `invoke_command'\n/var/vcap/packages/ruby-2.5.5-r0.10.0/lib/ruby/site_ruby/2.5.0/bundler/vendor/thor/lib/thor.rb:387:in `dispatch'\n/var/vcap/packages/ruby-2.5.5-r0.10.0/lib/ruby/site_ruby/2.5.0/bundler/cli.rb:27:in `dispatch'\n/var/vcap/packages/ruby-2.5.5-r0.10.0/lib/ruby/site_ruby/2.5.0/bundler/vendor/thor/lib/thor/base.rb:466:in `start'\n/var/vcap/packages/ruby-2.5.5-r0.10.0/lib/ruby/site_ruby/2.5.0/bundler/cli.rb:18:in `start'\n/var/vcap/packages/ruby-2.5.5-r0.10.0/bin/bundle:30:in `block in <main>'\n/var/vcap/packages/ruby-2.5.5-r0.10.0/lib/ruby/site_ruby/2.5.0/bundler/friendly_errors.rb:124:in `with_friendly_errors'\n/var/vcap/packages/ruby-2.5.5-r0.10.0/bin/bundle:22:in `<main>'"},"thread_id":47056957613560,"fiber_id":47057017693360,"process_id":13,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/deployment_updater/scheduler.rb","lineno":66,"method":"rescue in with_error_logging"}
2020/11/13 22:23:25, /var/vcap/sys/log/cc_deployment_updater/cc_deployment_updater.log, {"timestamp":"2020-11-13T22:23:25.734779572Z","message":"database schema is as new or newer than locally available migrations","log_level":"info","source":"cc.db.wait_until_current","data":{},"thread_id":47384045602280,"fiber_id":47384105923720,"process_id":11,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/db_migrator.rb","lineno":43,"method":"wait_for_migrations!"}

pods which don't get rotated :

kubecf        api-0                                                             17/17   Running           2          124m
kubecf        auctioneer-0                                                      6/6     Running           4          124m
kubecf        cc-worker-0                                                       6/6     Running           0          124m
kubecf        credhub-0                                                         8/8     Running           0          124m
kubecf        diego-cell-0                                                      12/12   Running           2          124m
kubecf        doppler-0                                                         6/6     Running           0          124m
kubecf        doppler-1                                                         6/6     Running           0          124m
kubecf        log-api-0                                                         9/9     Running           0          124m
kubecf        router-0                                                          7/7     Running           0          124m
kubecf        routing-api-0                                                     6/6     Running           2          124m
kubecf        scheduler-0                                                       12/13   Running           8          124m
kubecf        scheduler-1                                                       12/13   Running           6          117m
kubecf        tcp-router-0                                                      7/7     Running           0          124m
kubecf        uaa-0                                                             9/9     Running           0          124m

klogs: https://kubecf-klog.s3-us-west-2.amazonaws.com/klog-kubecf1589.tar.gz

Environment
GKE

Values.yaml

system_domain: prabal-gke.kubecf.charmedquarks.me
# install_stacks:
# - sle15
# - cflinuxfs3
features:
  eirini:
    enabled: false
  autoscaler:
    enabled: false
high_availability: true
services:
  router:
    annotations: {"external-dns.alpha.kubernetes.io/hostname": "prabal-gke.kubecf.charmedquarks.me, *.prabal-gke.kubecf.charmedquarks.me"}
    type: LoadBalancer
    externalIPs: []
    clusterIP: ~
    loadBalancerIP: ~
  ssh-proxy:
    annotations: {"external-dns.alpha.kubernetes.io/hostname": "ssh.prabal-gke.kubecf.charmedquarks.me"}
    type: LoadBalancer
    externalIPs: []
    clusterIP: ~
    loadBalancerIP: ~
  tcp-router:
    annotations: {"external-dns.alpha.kubernetes.io/hostname": "*.tcp.prabal-gke.kubecf.charmedquarks.me, tcp.prabal-gke.kubecf.charmedquarks.me"}
    type: LoadBalancer
    externalIPs: []
    clusterIP: ~
    loadBalancerIP: ~
    port_range:
      start: 20000
      end: 20008
testing:
  brain_tests:
    enabled: true
  cf_acceptance_tests:
    enabled: true
  smoke_tests:
    enabled: true
  sync_integration_tests:
    enabled: true
properties:
  acceptance-tests:
    acceptance-tests:
      acceptance_tests:
        timeout_scale: 3
        ginkgo:
          slow_spec_threshold: 300
          extra_flags: ''
          nodes: 10
          flake_attempts: 3
  brain-tests:
    acceptance-tests-brain:
      acceptance_tests_brain:
        verbose: 'false'
        in_order: 'false'
        include: ''
        exclude: ''
credentials:
  cf_admin_password: #####
# sizing:
#   diego_cell:
#     ephemeral_disk:
#       size: 30000

@mook-as
Copy link
Contributor

mook-as commented Nov 16, 2020

You don't actually have a database pod in the list. Did the operator fail to create it?

Please attach proper logs, and not the snippets that don't include the relevant information.

@prabalsharma
Copy link
Collaborator Author

You don't actually have a database pod in the list. Did the operator fail to create it?

Please attach proper logs, and not the snippets that don't include the relevant information.

klogs attached

@prabalsharma
Copy link
Collaborator Author

looking at database-0 kube.log:

2020-11-13T22:20:59.982623Z 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2020-11-13T22:20:59.982629Z 0 [Note] InnoDB: Using Linux native AIO
2020-11-13T22:20:59.983035Z 0 [Note] InnoDB: Number of pools: 1
2020-11-13T22:20:59.983213Z 0 [Note] InnoDB: Using CPU crc32 instructions
2020-11-13T22:20:59.987002Z 0 [Note] InnoDB: Initializing buffer pool, total size = 128M, instances = 1, chunk size = 128M
2020-11-13T22:20:59.998358Z 0 [Note] InnoDB: Completed initialization of buffer pool
2020-11-13T22:21:00.015301Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2020-11-13T22:21:00.033911Z 0 [Note] InnoDB: Crash recovery did not find the parallel doublewrite buffer at /var/lib/mysql/xb_doublewrite
2020-11-13T22:21:00.041002Z 0 [Note] InnoDB: Highest supported file format is Barracuda.
2020-11-13T22:21:00.141227Z 0 [Note] InnoDB: Created parallel doublewrite buffer at /var/lib/mysql/xb_doublewrite, size 3932160 bytes
2020-11-13T22:21:00.487870Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2020-11-13T22:21:00.488111Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2020-11-13T22:21:00.513931Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2020-11-13T22:21:00.514871Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2020-11-13T22:21:00.514950Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2020-11-13T22:21:00.515404Z 0 [Note] InnoDB: Waiting for purge to start
2020-11-13T22:21:00.566218Z 0 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.7.30-33 started; log sequence number 65872273
2020-11-13T22:21:00.566538Z 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2020-11-13T22:21:00.566827Z 0 [Note] Plugin 'FEDERATED' is disabled.
2020-11-13T22:21:00.677076Z 0 [Note] Skipping generation of SSL certificates as options related to SSL are specified.
2020-11-13T22:21:00.677950Z 0 [Warning] CA certificate /etc/mysql/tls/certs/ca is self signed.
2020-11-13T22:21:00.678012Z 0 [Note] Skipping generation of RSA key pair as key files are present in data directory.
2020-11-13T22:21:00.679368Z 0 [Note] Server hostname (bind-address): '0.0.0.0'; port: 3306
2020-11-13T22:21:00.679405Z 0 [Note]   - '0.0.0.0' resolves to '0.0.0.0';
2020-11-13T22:21:00.679460Z 0 [Note] Server socket created on IP: '0.0.0.0'.
2020-11-13T22:21:00.743564Z 0 [Note] Failed to start slave threads for channel ''
2020-11-13T22:21:00.811669Z 0 [Note] Event Scheduler: Loaded 0 events
2020-11-13T22:21:00.815963Z 0 [Note] mysqld: ready for connections.
Version: '5.7.30-33-log'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  MySQL Community Server (GPL), wsrep_31.43
2020-11-13T22:21:00.817531Z 1 [Note] WSREP: Initialized wsrep sidno 2
2020-11-13T22:21:00.817565Z 1 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 1) (Increment: 1 -> 1)
2020-11-13T22:21:00.817591Z 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-11-13T22:21:00.817626Z 1 [Note] WSREP: Assign initial position for certification: 32215, protocol version: 4
2020-11-13T22:21:00.817689Z 0 [Note] WSREP: Service thread queue flushed.
2020-11-13T22:21:00.817816Z 1 [Note] WSREP: GCache history reset: 8524a914-25f1-11eb-b908-963e052166e3:0 -> 8524a914-25f1-11eb-b908-963e052166e3:32215
2020-11-13T22:21:00.820719Z 1 [Note] WSREP: Synchronized with group, ready for connections
2020-11-13T22:21:00.820748Z 1 [Note] WSREP: Setting wsrep_ready to true
2020-11-13T22:21:00.820757Z 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-11-13T22:21:01.119788Z 0 [Note] InnoDB: Buffer pool(s) load completed at 201113 22:21:01
2020-11-13T22:21:31.087282Z 12 [Warning] IP address '10.8.3.25' could not be resolved: Name or service not known
2020-11-13T22:21:41.423742Z 17 [Warning] IP address '10.8.2.35' could not be resolved: Name or service not known
2020-11-13T22:21:41.797912Z 18 [Warning] IP address '10.8.1.31' could not be resolved: Name or service not known
2020-11-13T22:21:49.848305Z 28 [Warning] IP address '10.8.3.28' could not be resolved: Name or service not known
2020-11-13T22:22:19.767044Z 42 [Warning] IP address '10.8.1.30' could not be resolved: Name or service not known
2020-11-13T22:22:30.147006Z 46 [Warning] IP address '10.8.2.33' could not be resolved: Name or service not known
2020-11-13T22:22:31.775871Z 53 [Warning] IP address '10.8.2.27' could not be resolved: Name or service not known
2020-11-13T22:22:34.253088Z 54 [Warning] IP address '10.8.3.31' could not be resolved: Name or service not known
2020-11-13T22:22:36.331293Z 58 [Warning] IP address '10.8.1.32' could not be resolved: Name or service not known
2020-11-13T22:23:25.609294Z 74 [Warning] IP address '10.8.2.32' could not be resolved: Name or service not known
2020-11-13T22:23:38.050670Z 13 [Note] Aborted connection 13 to db: 'routing-api' user: 'routing-api' host: 'routing-api-0.routing-api.kubecf.svc.cluster.local' (Got an error reading communication packets)
2020-11-13T22:36:48.087337Z 22 [Note] Aborted connection 22 to db: 'credhub' user: 'credhub' host: '10.8.1.31' (Got an error reading communication packets)
2020-11-13T22:36:48.087411Z 27 [Note] Aborted connection 27 to db: 'credhub' user: 'credhub' host: '10.8.1.31' (Got an error reading communication packets)
2020-11-13T22:36:48.087604Z 18 [Note] Aborted connection 18 to db: 'credhub' user: 'credhub' host: '10.8.1.31' (Got an error reading communication packets)
2020-11-13T22:36:48.087686Z 26 [Note] Aborted connection 26 to db: 'credhub' user: 'credhub' host: '10.8.1.31' (Got an error reading communication packets)
2020-11-13T22:36:48.087729Z 24 [Note] Aborted connection 24 to db: 'credhub' user: 'credhub' host: '10.8.1.31' (Got an error reading communication packets)
2020-11-13T22:36:48.087697Z 21 [Note] Aborted connection 21 to db: 'credhub' user: 'credhub' host: '10.8.1.31' (Got an error reading communication packets)
2020-11-13T22:36:48.087828Z 23 [Note] Aborted connection 23 to db: 'credhub' user: 'credhub' host: '10.8.1.31' (Got an error reading communication packets)
2020-11-13T22:36:48.087358Z 20 [Note] Aborted connection 20 to db: 'credhub' user: 'credhub' host: '10.8.1.31' (Got an error reading communication packets)
2020-11-13T22:36:48.088132Z 25 [Note] Aborted connection 25 to db: 'credhub' user: 'credhub' host: '10.8.1.31' (Got an error reading communication packets)
2020-11-13T22:36:48.088328Z 19 [Note] Aborted connection 19 to db: 'credhub' user: 'credhub' host: '10.8.1.31' (Got an error reading communication packets)
2020-11-13T22:36:49.805773Z 59 [Note] Aborted connection 59 to db: 'routing-api' user: 'routing-api' host: '10.8.1.32' (Got an error reading communication packets)
2020-11-13T22:36:49.806405Z 58 [Note] Aborted connection 58 to db: 'routing-api' user: 'routing-api' host: '10.8.1.32' (Got an error reading communication packets)
2020-11-13T22:36:53.802042Z 44 [Note] Aborted connection 44 to db: 'network_connectivity' user: 'network_connectivity' host: '10.8.2.35' (Got an error reading communication packets)
2020-11-13T22:36:58.842646Z 14 [Note] Aborted connection 14 to db: 'uaa' user: 'uaa' host: 'uaa-1.uaa.kubecf.svc.cluster.local' (Got an error reading communication packets)
2020-11-13T22:47:41.836172Z 362 [Warning] IP address '10.8.1.44' could not be resolved: Name or service not known
2020-11-13T22:48:13.214943Z 370 [Warning] IP address '10.8.1.46' could not be resolved: Name or service not known
2020-11-13T22:49:06.131204Z 385 [Warning] IP address '10.8.3.36' could not be resolved: Name or service not known
2020-11-13T22:49:08.441896Z 45 [Note] Aborted connection 45 to db: 'network_connectivity' user: 'network_connectivity' host: '10.8.3.25' (Got an error reading communication packets)
2020-11-13T22:49:33.965300Z 86 [Note] Aborted connection 86 to db: 'routing-api' user: 'routing-api' host: 'routing-api-0.routing-api.kubecf.svc.cluster.local' (Got an error reading communication packets)
2020-11-13T22:49:33.967074Z 226 [Note] Aborted connection 226 to db: 'routing-api' user: 'routing-api' host: 'routing-api-0.routing-api.kubecf.svc.cluster.local' (Got an error reading communication packets)
2020-11-13T22:50:09.844550Z 405 [Warning] IP address '10.8.2.41' could not be resolved: Name or service not known
2020-11-13T22:53:19.879390Z 463 [Warning] IP address '10.8.1.48' could not be resolved: Name or service not known
2020-11-13T22:53:24.622164Z 466 [Warning] IP address '10.8.2.42' could not be resolved: Name or service not known
2020-11-13T22:54:07.739257Z 480 [Warning] IP address '10.8.1.43' could not be resolved: Name or service not known

@mook-as
Copy link
Contributor

mook-as commented Nov 16, 2020

Looking at events.yaml:

- apiVersion: v1
  count: 405
  firstTimestamp: "2020-11-13T22:20:56Z"
  lastTimestamp: "2020-11-14T00:30:45Z"
  involvedObject:
    fieldPath: spec.containers{cc-deployment-updater-cc-deployment-updater}
  message: Readiness probe failed: false
- apiVersion: v1
  count: 586
  firstTimestamp: "2020-11-13T22:49:17Z"
  lastTimestamp: "2020-11-14T00:27:44Z"
  involvedObject:
    fieldPath: spec.containers{cc-deployment-updater-cc-deployment-updater}
  message: Readiness probe failed: false

So both cc-deployment-updater containers are not ready, and the logs were taken around 00:30 UTC.

The last entry in scheduler-0/cc-deployment-updater is:

{"timestamp":"2020-11-13T22:23:25.734779572Z","message":"database schema is as new or newer than locally available migrations","log_level":"info","source":"cc.db.wait_until_current","data":{},"thread_id":47384045602280,"fiber_id":47384105923720,"process_id":11,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/db_migrator.rb","lineno":43,"method":"wait_for_migrations!"}

That was a good two hours earlier (!), reporting that no schema migration was needed and everything was good to go. This was around when the container was restarted, according to its describe-pod.txt:

  cc-deployment-updater-cc-deployment-updater:
    State:          Running
      Started:      Fri, 13 Nov 2020 14:23:06 -0800
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 13 Nov 2020 12:56:45 -0800
      Finished:     Fri, 13 Nov 2020 14:22:21 -0800
    Ready:          False
    Restart Count:  1
    Liveness:   exec […] delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  exec […] delay=0s timeout=1s period=10s #success=1 #failure=3

Looking at the images being used by the pods:

pod image
api-1 docker.io/cfcontainerization/capi:SLE_15_SP1-27.5-7.0.0_374.gb8e8e6af-1.98.0
cc-worker-1 docker.io/cfcontainerization/capi:SLE_15_SP1-27.5-7.0.0_374.gb8e8e6af-1.98.0
singleton-blobstore-0 docker.io/cfcontainerization/capi:SLE_15_SP1-27.5-7.0.0_374.gb8e8e6af-1.98.0
api-0 registry.suse.com/cap/capi:SLE_15_SP1-27.5-7.0.0_374.gb8e8e6af-1.98.0
cc-worker-0 registry.suse.com/cap/capi:SLE_15_SP1-27.5-7.0.0_374.gb8e8e6af-1.98.0
scheduler-0 registry.suse.com/cap/capi:SLE_15_SP1-27.5-7.0.0_374.gb8e8e6af-1.98.0
scheduler-1 registry.suse.com/cap/capi:SLE_15_SP1-27.5-7.0.0_374.gb8e8e6af-1.98.0

So there is no issue with migrations here: even though we're changing image repositories, all images involved are using the same BOSH release version. Also, both scheduler pods have already been upgraded.

Since the containers haven't been restarted, the liveness probes must be passing; and since it did report that it didn't need migrations, it's outputting to the expected file…

Looking at the start of scheduler-1's logs, we see:

{"timestamp":"2020-11-13T20:59:37.126795950Z","message":"database schema is as new or newer than locally available migrations","log_level":"info","source":"cc.db.wait_until_current","data":{},"thread_id
":47405084435960,"fiber_id":47405144960920,"process_id":14,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/db_migrator.rb","lineno":43,"method":"wait_for_migratio
ns!"}
{"timestamp":"2020-11-13T22:22:35.226469718Z","message":"run-deployment-update","log_level":"info","source":"cc.deployment_updater.update","data":{},"thread_id":47405084435960,"fiber_id":47405144960920,
"process_id":14,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/deployment_updater/dispatcher.rb","lineno":10,"method":"dispatch"}

So somehow there was a 1.5 hour delay between migration and running… But 2020-11-13T22:22:35.226469718Z is around the time that scheduler-0 got disconnected when it restarted (this is the end of the logs):

{"timestamp":"2020-11-13T22:22:19.326170349Z","message":"cc.deployment_updater","log_level":"error","source":"cc.deployment_updater","data":{"error":"Sequel::DatabaseDisconnectError","error_message":"Mysql2::Error::ConnectionError: MySQL server has gone away","backtrace":""},"thread_id":47056957613560,"fiber_id":47057017693360,"process_id":13,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/deployment_updater/scheduler.rb","lineno":66,"method":"rescue in with_error_logging"}
{"timestamp":"2020-11-13T22:23:25.734779572Z","message":"database schema is as new or newer than locally available migrations","log_level":"info","source":"cc.db.wait_until_current","data":{},"thread_id":47384045602280,"fiber_id":47384105923720,"process_id":11,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/db_migrator.rb","lineno":43,"method":"wait_for_migrations!"}

So it's probably just doing active/passive? That raises questions, like does HA deployments (without the upgrade) even do the right thing? And how did we get wedged such that both sides ended up unready?

@gaktive
Copy link
Collaborator

gaktive commented Nov 16, 2020

Looks like a bug to investigate. We'll put this on top of the Icebox for upcoming planning unless there's more urgency required, then it goes to top of To Do for someone to work on right away.

@gaktive gaktive removed their assignment Nov 16, 2020
@andreas-kupries
Copy link
Contributor

@prabalsharma

Please provide me with a session transcript, i.e. the commands/steps you did, for both base deployment, and then the upgrade.
I will also need the scripts and data files (like charts) you used (links/locations should be fine).

@andreas-kupries andreas-kupries self-assigned this Nov 18, 2020
@andreas-kupries
Copy link
Contributor

@prabalsharma
Copy link
Collaborator Author

prabalsharma commented Nov 18, 2020

@andreas-kupries
Its CAP 2.1.0 to master build kubecf-bundle-v0.0.0-20201110150701.2007.gbfe922e3. you can find master builds in kubecf s3 bucket.
I just did helm install with the above mentioned values.yaml for kubecf and then an upgrade using the same values.yaml.

@andreas-kupries
Copy link
Contributor

I wish the mail notifications were faster. ...

@prabalsharma

  • Ok. helm install of the two things above, to some GKE cluster.
  • What did you use to deploy the GKE cluster ? Catapult ?
  • I note system_domain: prabal-gke.kubecf.charmedquarks.me in the values.yaml
    That has to be changed to my own domain, right ? Ditto related entries (service annotations).
    The DNS is done via our route53 domains ?

@prabalsharma
Copy link
Collaborator Author

prabalsharma commented Nov 18, 2020

I wish the mail notifications were faster. ...

@prabalsharma

  • Ok. helm install of the two things above, to some GKE cluster.

yes

  • What did you use to deploy the GKE cluster ? Catapult ?

yes, catapult

  • I note system_domain: prabal-gke.kubecf.charmedquarks.me in the values.yaml

yes

That has to be changed to my own domain, right ? Ditto related entries (service annotations).

yes, you will have to change them accordingly

The DNS is done via our route53 domains ?

No, I used GKE DNS service for this. You can use anything

@andreas-kupries
Copy link
Contributor

andreas-kupries commented Nov 19, 2020

Confirmed with GKE ...

Update start at 14:38, system settled at 15:10 (0:32 later) with scheduler-* not rotated and at 12/13.
Contrary to the initial description I only see three other pods not being rotated (alive for longer than 32 min).
These are diego-cell-0 and the bosh-dns-* pods.
Note that the two other diego-cell pods (1, 2) are rotated and green 12/12, fully functional on that point.

Non-rotated pods:

NAME                                     READY   STATUS      RESTARTS   AGE
bosh-dns-5bb6c47fb7-8strn                1/1     Running     0          44m
bosh-dns-5bb6c47fb7-qks6s                1/1     Running     0          45m
diego-cell-0                             12/12   Running     2          165m
scheduler-0                              12/13   Running     8          165m
scheduler-1                              12/13   Running     9          156m

After using catapult to deploy a GKE cluster the deployment and upgrade commands were:

   k create namespace cfo
   helm install -n cfo cfo suse/cf-operator --set "global.singleNamespace.name=kubecf"
   helm install -n kubecf kubecf suse/kubecf --version 2.5.8 --values ../../EXT/values.yaml

   ../../kubecf-1/dev/kube/klog.sh
   rm -rf ~/klog/kubecf/
   mv klog.tar.gz klog-pre.tar.gz 

  # The tarballs for operator and kubecf charts were created in the kubecf-1 directory (A kube cf checkout), via
  # `make kubecf-bundle`. The relevant commit was bfe922e3d29d4119c40e394c0ab66fcb0825c9f0 (matching the original)

   helm upgrade -n cfo cfo ../../kubecf-1/output/cf-operator.tgz --set "global.singleNamespace.name=kubecf"
   helm upgrade -n kubecf  kubecf ../../kubecf-1/output/kubecf_release.tgz  --values ../../EXT/values.yaml

@andreas-kupries
Copy link
Contributor

Sigh ... After a diversion through cc-clock and tps-watcher where I thought that these were failing and restaring all the time (failed, and restart, but only a few times, and recovered) I confirm that the cc-deployment-updater is not ready in my system also. The only.

Ah damn ... The two kubecf log pulls have different structures ... I may have used different klog scripts for both (I tried and thought to have avoided that 💥 ). That makes things badly comparable.

@andreas-kupries
Copy link
Contributor

Version Changes

Chart Setup Upgraded to
suse/cf-operator 4.5.13+0.gd4738712 6.1.17+0.gec409fd7
suse/kubecf 2.5.8(-rc3) 0.0.0-20201110150701.2007.gbfe922e3-dirty

Commands for reproduction

  • Catapult, make k8s for GKE cluster.
  • cd into associated build directory.
  • helm repo add stratos http://opensource.suse.com/stratos/
  • helm repo add metrics http://opensource.suse.com/stratos-metrics/
  • helm repo add suse https://kubernetes-charts.suse.com/
  • k create namespace cfo
  • helm install -n cfo cfo suse/cf-operator --set "global.singleNamespace.name=kubecf"
  • wait for completion
  • helm install -n kubecf kubecf suse/kubecf --version 2.5.8 --values ../../EXT/values.yaml
  • wait for completion
  • Note: The directory ../../EXT contains the tarballs from make kubecf-bundle for kubecf commit bfe922e3.
  • helm upgrade -n cfo cfo ../../EXT/cf-operator.tgz --set "global.singleNamespace.name=kubecf"
  • wait for completion
  • helm upgrade -n kubecf kubecf ../../EXT/kubecf_release.tgz --values ../../EXT/values.yaml
  • wait for completion

Results

  • Depending on unknown factors the diego-cell pods may or may not rotate completely. In the first attempt the pods 2 and 1 rotated, and pod 0 did not. In the second attempt all pods rotated.

  • The scheduler pods do not rotate and fall into a 12/13 red state. IOW the old pods keep running, but are broken. The broken job looks to be cc-deployment-updater.

  • A forced manual restart of the scheduler pods, IOW a manual (forced) rotation has them coming back 13/13 green.

Investigation

Looking at the Image IDs for the cc-deployment-updater job, as per describe-pod.txt for the scheduler-0 pod I find that it is completely unchanged in the important parts (The sha256 code identifying the exact version used.

See the table below. Note that the Ids were all copy-pasted out of describe-pod.txt. (I am right-justifying the Ids to line up the sha256).

When Id
Before docker-pullable://registry.suse.com/cap/capi@sha256:f04d035309c3a80856f14532f27a721df1e46553a59c27538396b3f6d0a6e588
After docker-pullable://cfcontainerization/capi@sha256:f04d035309c3a80856f14532f27a721df1e46553a59c27538396b3f6d0a6e588
Forced docker-pullable://cfcontainerization/capi@sha256:f04d035309c3a80856f14532f27a721df1e46553a59c27538396b3f6d0a6e588

The same seems to be true for all the jobs in the pod, actually.

The name of the Image used is the same as well. Without any kind of changes, actually. Note that the names were all copy-pasted out of describe-pod.txt.

When Image
Before registry.suse.com/cap/capi:SLE_15_SP1-27.5-7.0.0_374.gb8e8e6af-1.98.0
After registry.suse.com/cap/capi:SLE_15_SP1-27.5-7.0.0_374.gb8e8e6af-1.98.0
Forced registry.suse.com/cap/capi:SLE_15_SP1-27.5-7.0.0_374.gb8e8e6af-1.98.0

While I see the Container ID of the job changing, I ascribe that to

  1. the failure of the job and kubernetes trying to return it to a running state, and
  2. the rotation I forced on the system.

The cloud-controller-clock and tps-watcher jobs had restarts in the upgrade as well, with changing containers, however they recovered (It seems to be related to transient loss of DNS is the cluster).

For all other jobs, the Container IDs did not change in the upgrade, indicating no restarts at all. They did change in the forced rotation however, of course.

Now, conversely, looking at one of the pods which were rotated by the operator (diego-cell-0) I notice that while the Image version is the same, the Image reference as a whole did change:

When Image
Before registry.suse.com/cap/garden-runc:SLE_15_SP1-27.5-7.0.0_374.gb8e8e6af-1.19.16
After docker.io/cfcontainerization/garden-runc:SLE_15_SP1-27.5-7.0.0_374.gb8e8e6af-1.19.16

So the operator rotated the pod.

And I see this switch in Image references, from registry.suse.com/cap/ to docker.io/cfcontainerization/ (or even just cfcontainerization/ in some cases) for most pods.

The exceptions are exactly the two scheduler pods.

Conclusions so far

I posit that:

  • There is something special about the scheduler pods which causes their Image references to be left unchanged, compared to every other pod (outside of database seeder and the various dns pods).

    This triggers the chain below:

  • The operator does not rotate the scheduler pod because it sees that all the Image (references) are the same. Making it assume that with nothing changed nothing is to be done.

  • The cc-deployment-operator job has some persistent state (config on the ephemeral disk (ED) of the job?) which is out of date after rotating (most/all of) the other pods.

    Restarting just the job container continues to pick up this state and thus continues to fail, leaving the pod broken.

  • Killing the pod and having it start from scratch squashes the bad state (as part of getting a fresh ED). The remade state/config then has the proper information to talk to the other pods and jobs, resulting in a good pod.

Actions

Investigate more where/how the Image information in the deployed helm chart is generated/configured by our tower of babel^Wtemplating.

@andreas-kupries
Copy link
Contributor

And now I wonder how much of the Image changes are due to using a released chart first, and then upgrading to a development bundle. I suspect I will have to run a third repro, and use dev bundles for both initial and upgraded deployment.

@mook-as
Copy link
Contributor

mook-as commented Nov 30, 2020

Doing a non-upgrade minikube deployment (everything non-HA, except scale scheduler to two instances) shows that the scheduler readiness probe is busted: when doing HA scheduler, only one of the two pods have the Sleeping… lines, since they appear to be working as active/passive (via locket); it only seems to work because when the grep Sleeping returns nothing, jq -e … actually returns success on no input.

I believe we need to, at least, do:

  • Split cc-deployment-updater to its own instance group.
  • Mark the new thing active/passive.

Alternatively, we drop the readiness probe; however, that leads to a chance that we will be unable to detect when all instances go down.

Another alternative is to drop the cc_deployment_updater job entirely; we don't seem to actually need it, and removing it still passes 🐱. That seems like a better solution, at least in the short term; it reduces resource usage, and has no known downsides. I'll work on a PR to do so properly (the test run is very much improper).

@viovanov Removing the cc_deployment_updater seems sane to you?

@viovanov
Copy link
Member

viovanov commented Dec 1, 2020

@mook-as yes, please remove. That seems like the safer option

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
suse-cap Type: Bug Something isn't working
Projects
None yet
5 participants