Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [Datasophon-service] When the alarm is restored in AlertActor, the state modification logic is abnormal #402

Closed
3 tasks done
thomasg19930417 opened this issue Sep 1, 2023 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@thomasg19930417
Copy link
Contributor

thomasg19930417 commented Sep 1, 2023

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

Should the turntable here be updated when it is not running? If (roleInstance. getServiceRoleState()!= ServiceRoleState. RUNNING)
ClusterServiceRoleInstanceEntity roleInstance = roleInstanceService.getOneServiceRole(labels.getServiceRoleName(), hostname, clusterId);
if (roleInstance.getServiceRoleState() == ServiceRoleState.RUNNING) {
roleInstance.setServiceRoleState(ServiceRoleState.RUNNING);
if (nodeHasWarnAlertList) {
roleInstance.setServiceRoleState(ServiceRoleState.EXISTS_ALARM);
}
oleInstanceService.updateById(roleInstance);
}

What you expected to happen

Should the turntable here be updated when it is not running? If (roleInstance. getServiceRoleState()!= ServiceRoleState. RUNNING)
ClusterServiceRoleInstanceEntity roleInstance = roleInstanceService.getOneServiceRole(labels.getServiceRoleName(), hostname, clusterId);
if (roleInstance.getServiceRoleState() == ServiceRoleState.RUNNING) {
roleInstance.setServiceRoleState(ServiceRoleState.RUNNING);
if (nodeHasWarnAlertList) {
roleInstance.setServiceRoleState(ServiceRoleState.EXISTS_ALARM);
}
oleInstanceService.updateById(roleInstance);
}

How to reproduce

Should the turntable here be updated when it is not running? If (roleInstance. getServiceRoleState()!= ServiceRoleState. RUNNING)
ClusterServiceRoleInstanceEntity roleInstance = roleInstanceService.getOneServiceRole(labels.getServiceRoleName(), hostname, clusterId);
if (roleInstance.getServiceRoleState() == ServiceRoleState.RUNNING) {
roleInstance.setServiceRoleState(ServiceRoleState.RUNNING);
if (nodeHasWarnAlertList) {
roleInstance.setServiceRoleState(ServiceRoleState.EXISTS_ALARM);
}
oleInstanceService.updateById(roleInstance);
}

Anything else

Should the turntable here be updated when it is not running? If (roleInstance. getServiceRoleState()!= ServiceRoleState. RUNNING)
ClusterServiceRoleInstanceEntity roleInstance = roleInstanceService.getOneServiceRole(labels.getServiceRoleName(), hostname, clusterId);
if (roleInstance.getServiceRoleState() == ServiceRoleState.RUNNING) {
roleInstance.setServiceRoleState(ServiceRoleState.RUNNING);
if (nodeHasWarnAlertList) {
roleInstance.setServiceRoleState(ServiceRoleState.EXISTS_ALARM);
}
oleInstanceService.updateById(roleInstance);
}

Version

dev

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@thomasg19930417 thomasg19930417 added the bug Something isn't working label Sep 1, 2023
@datasophon
Copy link
Member

I'm not sure what issue you're trying to clarify. Can you elaborate on it

@thomasg19930417
Copy link
Contributor Author

dev 分支,当服务宕机告警恢复时,修改状态的逻辑应该有问题,如下代码这里应该是当前是非running状态才去修改状态为running
if (roleInstance.getServiceRoleState() == ServiceRoleState.RUNNING) {
roleInstance.setServiceRoleState(ServiceRoleState.RUNNING);
}

@thomasg19930417
Copy link
Contributor Author

dev 分支,当服务宕机告警恢复时,修改状态的逻辑应该有问题,如下代码这里应该是当前是非running状态才去修改状态为running if (roleInstance.getServiceRoleState() == ServiceRoleState.RUNNING) { roleInstance.setServiceRoleState(ServiceRoleState.RUNNING); }

这个会导致服务发送 resovled 告警时 ,无法将异常状态恢复到正常状态,我同步对比了之前版本的代码 ,这个地方应该是在改造的时候写错了吧

@datasophon
Copy link
Member

We tested that it is possible to recover from an abnormal state to a normal state. How did the situation you mentioned occur

@thomasg19930417
Copy link
Contributor Author

如果从页面直接启停应该复现不了这个问题,服务停掉后 ,后台启动应该能复现问题(情况应该是机器负载高导致prometheus采集的时候异常后续正常的时候回送告警解除信息无法将状态重置为正常状态)

@datasophon
Copy link
Member

按照你所描述的,我们复现了这个问题,你能帮我们解决它吗?
According to your description, we have reproduced this problem. Can you help us solve it?

@thomasg19930417
Copy link
Contributor Author

按照你所描述的,我们复现了这个问题,你能帮我们解决它吗? According to your description, we have reproduced this problem. Can you help us solve it?

I will submit a PR later

datasophon pushed a commit that referenced this issue Sep 1, 2023
* remove  redundant initializer

* Fix issues  #402
@thomasg19930417
Copy link
Contributor Author

add pr link: #404

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants