Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to bosh cck an unresponsive vm (very high cpu load) #2531

Open
poblin-orange opened this issue Jun 17, 2024 · 2 comments
Open

unable to bosh cck an unresponsive vm (very high cpu load) #2531

poblin-orange opened this issue Jun 17, 2024 · 2 comments

Comments

@poblin-orange
Copy link

poblin-orange commented Jun 17, 2024

Describe the bug

On a overloaded vm, we met the following issue:

  • bosh vm is seen unresposive by the director
Task 5813416. Done
Deployment '00-shared-services-r2'
Instance                                                          Process State       AZ     IPs            Deployment  
services-agents-r2-z1/48bd83b9-22d8-469a-9425-2ca16412a79a        running             r2-z1  xx.xx.xx.6   00-shared-services-r2  
                                                                                             192.168.64.67    
services-agents-r2-z1/5082819e-3ddc-493e-a38c-3894a81f668e        unresponsive agent  r2-z1  192.168.64.68  00-shared-services-r2  
                                                                                             xx.xx.xx.7     
services-agents-r2-z1/f0764857-34a1-403c-8302-1b89671133b0        unresponsive agent  r2-z1  xx.xx.xx.5   00-shared-services-r2  
                                                                                             192.168.64.66    
services-agents-r2-z2/59810cf3-f7ab-46d7-bbff-1cefc536cfe3        running             r2-z2  192.168.64.74  00-shared-services-r2  
                                                                                             xx.xx.xx.9     
services-proxy-agents-r2-z1/fe212593-e542-4f0a-bcff-5e38485a6c73  unresponsive agent  r2-z1  192.168.64.73  00-shared-services-r2  
                                                                                             xx.xx.xx.8     
5 instances
Succeeded
  • technically, vm is up (ping / nc -vz / monit process up when looking)

  • however a bosh cck fails with the following error:

$ bosh cck
Using environment '192.168.99.152' as user 'yyyyy'
Using deployment '00-shared-services-r2'
Task 5813417
Task 5813417 | 18:27:48 | Scanning 5 VMs: Checking VM states (00:00:17)
                        L Error: Action Failed get_state: Getting processes status: Getting service status: Unmarshalling Monit status: unexpected EOF
Task 5813417 | 18:28:05 | Error: Action Failed get_state: Getting processes status: Getting service status: Unmarshalling Monit status: unexpected EOF
Task 5813417 Started  Wed Jun 12 18:27:48 UTC 2024
Task 5813417 Finished Wed Jun 12 18:28:05 UTC 2024
Task 5813417 Duration 00:00:17
Task 5813417 error
Performing a scan on deployment '00-shared-services-r2':
  Expected task '5813417' to succeed but state is 'error'
  • fails before any resolution can be chosen by operator => bosh cck is not usable
  • workaround:
    • bosh deploy <manfest.yml> --fix.
    • bosh is able to repair, recreating the unresponsive vm

To Reproduce
Steps to reproduce the behavior (example):

  1. Deploy a bosh director on with
  2. Upload and
  3. Deploy
  4. bosh ssh to a specific instance
  5. Run on the vm to see the behavior

Expected behavior
A clear and concise description of what you expected to happen.

Logs
Logs are always helpful! Add logs to help explain your problem.

Versions (please complete the following information):

  • Infrastructure: vsphere
  • BOSH version bosh/277.4.3
  • BOSH CLI version 7.5.6
  • Stemcell version bosh-vsphere-esxi-ubuntu-jammy-go_agent 1.465* ubuntu-jammy

Deployment info:
If possible, share your (redacted) manifest and any ops files used to deploy
BOSH or any other releases on top of BOSH.

If you used any deployment strategy it'd be helpful to point it out and share as
much about it as possible (e.g. bosh-deployment, PCF, genesis, spiff, etc)

Additional context
Add any other context about the problem here.

@beyhan
Copy link
Member

beyhan commented Jun 20, 2024

Could you please try with the bosh recovery option documented in https://bosh.io/docs/recover/?

@beyhan beyhan moved this from Inbox to Pending Review | Discussion in Foundational Infrastructure Working Group Jun 20, 2024
@gberche-orange
Copy link

Thanks @beyhan, we'll test.

BTW, the sale symptom was previously fixed on bosh vms and bosh deploy in c9cc3ff but presumably not yet in bosh cck #1754 associated to https://www.pivotaltracker.com/n/projects/1456570/stories/151434806 see screenshot image
image

@jpalermo jpalermo moved this from Pending Review | Discussion to Waiting for Changes | Open for Contribution in Foundational Infrastructure Working Group Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Waiting for Changes | Open for Contribution
Development

No branches or pull requests

3 participants