Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production - [Alerting] Android devices disconnected #9186

Closed
dotnet-eng-status bot opened this issue Apr 26, 2022 · 7 comments
Closed

Production - [Alerting] Android devices disconnected #9186

dotnet-eng-status bot opened this issue Apr 26, 2022 · 7 comments
Assignees
Labels
Critical Grafana Alert Issues opened by Grafana Inactive Alert Issues from Grafana alerts that are now "OK" Ops - First Responder Production Tied to the Production environment (as opposed to Staging)

Comments

@dotnet-eng-status
Copy link

💔 Metric state changed to alerting

We detected Android devices that have disconnected from their hosts. The most usual cause is the device getting stuck in a bootloader after a reboot. There can be other reasons too (battery dying..).

The Helix machines listed below have failed most or all of their mobile operations and need to be taken out of the rotation and DDFUN needs to send someone to check on them.

Action points:

  1. Offline the affected machines. You can use OSOB CLI for instance. Example:
    dotnet run change-machine-state --queue windows.10.amd64.android.open -r "Android device disconnected - https://github.com/dotnet/core-eng/issues/[FILL IN ID]" --production -d -m [FILL IN MACHINE NAME]
    
  2. Open DDFUN IcM request to go and have a look at the machines: https://aka.ms/ddfunicm - see example ticket
  3. Link the IcM ticket in this issue
  4. Wait for ticket resolution
  5. Return machines back to the online state. You can use OSOB CLI for that. Example:
    dotnet run change-machine-state --queue windows.10.amd64.android.open -r "Android device disconnected" --production -e -m DNCENGWIN-123
    

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

More information about mobile device investigations here.

  • FailureRate {Machine=DNCENGWIN-112} 83

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-35f560112f7a4bfabf9fd69bc1bd76fa

@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active Critical Ops - First Responder Grafana Alert Issues opened by Grafana Production Tied to the Production environment (as opposed to Staging) labels Apr 26, 2022
@premun premun self-assigned this Apr 26, 2022
@premun
Copy link
Member

premun commented Apr 26, 2022

@dotnet-eng-status
Copy link
Author

💔 Metric state changed to alerting

We detected Android devices that have disconnected from their hosts. The most usual cause is the device getting stuck in a bootloader after a reboot. There can be other reasons too (battery dying..).

The Helix machines listed below have failed most or all of their mobile operations and need to be taken out of the rotation and DDFUN needs to send someone to check on them.

Action points:

  1. Offline the affected machines. You can use OSOB CLI for instance. Example:
    dotnet run change-machine-state --queue windows.10.amd64.android.open -r "Android device disconnected - https://github.com/dotnet/core-eng/issues/[FILL IN ID]" --production -d -m [FILL IN MACHINE NAME]
    
  2. Open DDFUN IcM request to go and have a look at the machines: https://aka.ms/ddfunicm - see example ticket
  3. Link the IcM ticket in this issue
  4. Wait for ticket resolution
  5. Return machines back to the online state. You can use OSOB CLI for that. Example:
    dotnet run change-machine-state --queue windows.10.amd64.android.open -r "Android device disconnected" --production -e -m DNCENGWIN-123
    

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

More information about mobile device investigations here.

  • FailureRate {Machine=DNCENGWIN-112} 83

Go to rule

@MattGal
Copy link
Member

MattGal commented Apr 26, 2022

I checked and this machine was already offline when I got around to poking it. (Edit it's just a weird double-alert)

@premun
Copy link
Member

premun commented Apr 27, 2022

@MattGal it is by design unfortunately. The alert says that:

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Proposed improvement with details of why it is like this is here:
#9092

@premun
Copy link
Member

premun commented Apr 27, 2022

Putting the phone back online as DDFUN has fixed it. Waiting for alert to clear..

@dotnet-eng-status
Copy link
Author

💔 Metric state changed to alerting

We detected Android devices that have disconnected from their hosts. The most usual cause is the device getting stuck in a bootloader after a reboot. There can be other reasons too (battery dying..).

The Helix machines listed below have failed most or all of their mobile operations and need to be taken out of the rotation and DDFUN needs to send someone to check on them.

Action points:

  1. Offline the affected machines. You can use OSOB CLI for instance. Example:
    dotnet run change-machine-state --queue windows.10.amd64.android.open -r "Android device disconnected - https://github.com/dotnet/core-eng/issues/[FILL IN ID]" --production -d -m [FILL IN MACHINE NAME]
    
  2. Open DDFUN IcM request to go and have a look at the machines: https://aka.ms/ddfunicm - see example ticket
  3. Link the IcM ticket in this issue
  4. Wait for ticket resolution
  5. Return machines back to the online state. You can use OSOB CLI for that. Example:
    dotnet run change-machine-state --queue windows.10.amd64.android.open -r "Android device disconnected" --production -e -m DNCENGWIN-123
    

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

More information about mobile device investigations here.

  • FailureRate {Machine=DNCENGWIN-112} 83

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Apr 27, 2022
@dotnet-eng-status
Copy link
Author

💚 Metric state changed to ok

We detected Android devices that have disconnected from their hosts. The most usual cause is the device getting stuck in a bootloader after a reboot. There can be other reasons too (battery dying..).

The Helix machines listed below have failed most or all of their mobile operations and need to be taken out of the rotation and DDFUN needs to send someone to check on them.

Action points:

  1. Offline the affected machines. You can use OSOB CLI for instance. Example:
    dotnet run change-machine-state --queue windows.10.amd64.android.open -r "Android device disconnected - https://github.com/dotnet/core-eng/issues/[FILL IN ID]" --production -d -m [FILL IN MACHINE NAME]
    
  2. Open DDFUN IcM request to go and have a look at the machines: https://aka.ms/ddfunicm - see example ticket
  3. Link the IcM ticket in this issue
  4. Wait for ticket resolution
  5. Return machines back to the online state. You can use OSOB CLI for that. Example:
    dotnet run change-machine-state --queue windows.10.amd64.android.open -r "Android device disconnected" --production -e -m DNCENGWIN-123
    

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

More information about mobile device investigations here.

Go to rule

@premun premun closed this as completed Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Critical Grafana Alert Issues opened by Grafana Inactive Alert Issues from Grafana alerts that are now "OK" Ops - First Responder Production Tied to the Production environment (as opposed to Staging)
Projects
None yet
Development

No branches or pull requests

2 participants