-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Health checks windows service #13388
Feature: Health checks windows service #13388
Conversation
Hey @deblasis Super excited to see this PR come through, thank you! After looking ( and asking ) around it doesn't look like we have any established pattern for mocking syscalls. You're free to add the tests in whenever you get a chance 👍 😄 |
Hi @Amier3 ! I have added syscall mocking and testing, I guess this is now finally ready for review! 😁
|
agent/checks/os_service_windows.go
Outdated
defer func() { | ||
errDisconnect := m.Disconnect() | ||
if isHealthy || errDisconnect == nil || err != nil { | ||
return | ||
} | ||
//unreachable at the moment but we might want to log this error. leaving here for code-review | ||
err = errDisconnect | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am doing this because I think it might be important to at least log errors if there are any at this point (after changing the if condition of course) and I wanted to raise the question with you.
If not, the code can be simplified with a simple
defer func () { m.Disconnect() }()
without capturing the error.
Same applies below for service.Close()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a quick glance I think the fail conditions for Disconnect and Close are unlikely and can be ignored.
In the worst case scenario the process holding the ServiceManager open doesn't get cleaned up but if the OS is degraded to the point where it can't close there may be worse problems 😅
I'd vote for a simple defer but it's not a strong opinion by any means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here as the one about getWindowsSvc
. What would happen if OpenService
fail and we execute Disconnect anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the documentation updates and made a few suggestions to improve clarity.
@deblasis : for context, our education/docs team has performed a review (above). The engineering review still needs to be conducted! I'd recommend waiting for an engineering review before making changes so you can address all the feedback in one go. |
This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions. |
…cks_windows_service Signed-off-by: Alessandro De Blasis <alex@deblasis.net>
…cks_windows_service Signed-off-by: Alessandro De Blasis <alex@deblasis.net>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this large chunk of design and implementation! I'm not sure if we'll be able to implement Windows machines for CI to test this but I'll try it out locally on my desktop and ask the team how we want to handle this windows-specific feature.
I left a few minor comments that are non-blocking.
agent/checks/os_service_windows.go
Outdated
defer func() { | ||
errDisconnect := m.Disconnect() | ||
if isHealthy || errDisconnect == nil || err != nil { | ||
return | ||
} | ||
//unreachable at the moment but we might want to log this error. leaving here for code-review | ||
err = errDisconnect | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a quick glance I think the fail conditions for Disconnect and Close are unlikely and can be ignored.
In the worst case scenario the process holding the ServiceManager open doesn't get cleaned up but if the OS is degraded to the point where it can't close there may be worse problems 😅
I'd vote for a simple defer but it's not a strong opinion by any means.
ServiceID structs.ServiceID | ||
OSService string | ||
Interval time.Duration | ||
Timeout time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see Timeout
being used anywhere. Is it worth implementing a timeout error? Unlike gRPC or script checks I would assume OS service polling shouldn't take too long and it's not worth enforcing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agent/checks/os_service_windows.go
Outdated
err = nil | ||
isHealthy = true | ||
case svc.Paused, svc.Stopped: | ||
err = ErrOSServiceStatusCritical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can wrap the service state inside ErrOSServiceStatusCritical
to help user have what state is failing the check.
agent/checks/os_service_windows.go
Outdated
service := win.getWindowsSvc(serviceName, svcHandle) | ||
defer func() { | ||
errClose := service.Close() | ||
if isHealthy || errClose == nil || err != nil { | ||
return | ||
} | ||
//unreachable at the moment but we might want to log this error. leaving here for code-review | ||
err = errClose | ||
}() | ||
status, err := service.Query() | ||
if err != nil { | ||
return fmt.Errorf("error querying service status: %w", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing for me. I don't know how in details the windows library work, but looking to this, getWindowsSvc
will only allocate the struct and don't do any access, right? in that case what will happen if Query fail for some reason but we still call service.Close
in the defer? will service.Close
error?
agent/checks/os_service_windows.go
Outdated
defer func() { | ||
errDisconnect := m.Disconnect() | ||
if isHealthy || errDisconnect == nil || err != nil { | ||
return | ||
} | ||
//unreachable at the moment but we might want to log this error. leaving here for code-review | ||
err = errDisconnect | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here as the one about getWindowsSvc
. What would happen if OpenService
fail and we execute Disconnect anyway?
Co-authored-by: trujillo-adam <47586768+trujillo-adam@users.noreply.github.com>
Co-authored-by: trujillo-adam <47586768+trujillo-adam@users.noreply.github.com>
…cks_windows_service Signed-off-by: Alessandro De Blasis <alex@deblasis.net>
@deblasis would you be able to rebase or merge from |
…cks_windows_service
Thanks! Tests are all passing. Will make some time this week to test on Windows |
Apologies for the delay; I didn't have time this week but will make sure to test next week. I will also post a comment on the original github issue to see if there is anyone who would like to test out the feature. |
I discovered a few issues while testing:
|
Reproduction steps: // config/nodecheck.hcl
check = {
name = "node-check"
os_service = "AdobeARMService"
interval = "10s"
}
// config/svccheck.hcl
check = {
name = "service-check"
os_service = "AdobeARMService"
service_id = "hello"
interval = "10s"
} # powershell as admin
.\consul.exe agent -dev -config-dir=".\config" Confirming that AdodeARMService is running PS C:\Users\Chris> Get-Service -Name "AdobeARMService"
Status Name DisplayName
------ ---- -----------
Running AdobeARMservice Adobe Acrobat Update Service |
Hi @kisunji! Sorry about that. But I have good news 👇
Addressed in fc0dd92
I have noticed that I missed some properties and the corresponding generated mappings (fixed in 461b42e) but however I believe there's a bug elsewhere, I have added another check to... check 🙂 (repro config below) and I am seeing Since I was looking at the UI, I noticed the lack of message so I have added a "Healthy" message with 5719fd6
Addressed in f440966 I have tested the above with: // config/nodecheck.hcl
check = {
name = "node-check"
os_service = "Dhcp"
interval = "10s"
}
// config/h2ping.hcl
check = {
id = "h2ping-check"
name = "h2ping"
h2ping = "localhost:22222"
interval = "10s"
h2ping_use_tls = false
} |
Ah it must be a UI bug then. Thanks for the quick updates, will try to test again soon |
@johncowen: any ideas on why newer health check types ( |
If the I would guess that the new healthchecks are coming out with empty Umm separate thing but if we have new healthcheck types then the UI code might need to be aware of those for menus etc. not totally sure if we inspect those or if they are hardcoded. At the very least UI should add these possible values to our mock API I think |
Hi @johncowen ! Nice seeing you again around here (you reviewed a PR of mine back in the day) :)
I can confirm that this is the case, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a few suggestions and questions w/r/t the docs. You did a really good job on the main description of the OS service check, BTW.
@@ -245,6 +245,8 @@ The check sends datagrams to the value specified at the interval specified in th | |||
If the datagram is sent successfully or a timeout is returned, the check is set to the `passing` state. | |||
The check is logged as `critical` if the datagram is sent unsuccessfully. | |||
|
|||
- `OSService` `(string: "")` - Specifies the name of a service to check that is running as an OS-level service; either Windows Services on Windows or SystemD services on Unix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `OSService` `(string: "")` - Specifies the name of a service to check that is running as an OS-level service; either Windows Services on Windows or SystemD services on Unix. | |
- `OSService` `(string: "")` - Specifies an OS-level service to check. You can specify either `Windows Services` on Windows or `SystemD` services on Unix. |
The current wording reads as if we are checking the actual Windows Service or SystemD application, but stated in a convoluted way. I think we might mean, though, that this option tells Consul that the service it is configured to check runs as a Windows Service or SystemD service. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Currently, this feature works only on windows but the main pieces for making it work with SystemD are in place. The design was intended to offer the same look-n-feel to the user independently of the implementation details/OS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that I accepted this by mistake and I feel like it should be reverted and/or rephrased to indicate that it's the name
of the OS-specific service that goes into that.
Co-authored-by: trujillo-adam <47586768+trujillo-adam@users.noreply.github.com>
Co-authored-by: trujillo-adam <47586768+trujillo-adam@users.noreply.github.com>
Suggestions accepted! Thanks but @jkirschner-hashicorp and yourself helped a lot already giving it the first pass a few weeks ago. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more tweak
@@ -245,6 +245,8 @@ The check sends datagrams to the value specified at the interval specified in th | |||
If the datagram is sent successfully or a timeout is returned, the check is set to the `passing` state. | |||
The check is logged as `critical` if the datagram is sent unsuccessfully. | |||
|
|||
- `OSService` `(string: "")` - Specifies an OS-level service to check. You can specify either `Windows Services` on Windows or `SystemD` services on Unix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `OSService` `(string: "")` - Specifies an OS-level service to check. You can specify either `Windows Services` on Windows or `SystemD` services on Unix. | |
- `OSService` `(string: "")` - Specifies the name of an OS-level service. Configure this option if the service you intend to check runs as an OS-level service. You can specify either `Windows Services` on Windows or `SystemD` services on Unix. |
I think trying to combine the action to specify the name and the action of performing the check (if that makes sense) is what's causing confusion for me. We should state them separately. First, what the param is for. Second, what the param means in the context of the overall API call. What do you think?
👍
…On Mon, 29 Aug 2022 at 18:05, Chris S. Kim ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In agent/agent.go
<#13388 (comment)>:
> + if prev := a.checkOSServices[cid]; prev != nil {
+ prev.Stop()
+ }
I wouldn't mind if you just removed the line from the docker case as well
—
Reply to this email directly, view it on GitHub
<#13388 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHAEQNW57SXDCXXOIMI2FPLV3TUVJANCNFSM5YDYBKAQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I think it should be OK now. Shall we🚢 it ? 😄 |
@deblasis I've tested this successfully on Windows so this is good to 🚢 ! |
Description
To provide windows service health check functionality addressing the issue #10637
Testing & Reproduction steps
The PR is Draft because I would like you guys to point me to your preferred way of mocking syscalls and I will update the feature branch with the relevant tests.
Links
PR Checklist