Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Health checks windows service #13388

Merged
merged 25 commits into from
Oct 17, 2022

Conversation

deblasis
Copy link
Contributor

@deblasis deblasis commented Jun 7, 2022

Description

To provide windows service health check functionality addressing the issue #10637

Testing & Reproduction steps

The PR is Draft because I would like you guys to point me to your preferred way of mocking syscalls and I will update the feature branch with the relevant tests.

Links

PR Checklist

  • updated test coverage
  • external facing docs updated
  • not a security concern

@github-actions github-actions bot added pr/dependencies PR specifically updates dependencies of project theme/config Relating to Consul Agent configuration, including reloading type/docs Documentation needs to be created/updated/clarified labels Jun 7, 2022
@Amier3
Copy link
Contributor

Amier3 commented Jun 8, 2022

Hey @deblasis

Super excited to see this PR come through, thank you! After looking ( and asking ) around it doesn't look like we have any established pattern for mocking syscalls. You're free to add the tests in whenever you get a chance 👍 😄

@deblasis deblasis marked this pull request as ready for review June 9, 2022 14:51
@deblasis deblasis requested a review from a team as a code owner June 9, 2022 14:51
@deblasis
Copy link
Contributor Author

deblasis commented Jun 9, 2022

Hey @deblasis

Super excited to see this PR come through, thank you! After looking ( and asking ) around it doesn't look like we have any established pattern for mocking syscalls. You're free to add the tests in whenever you get a chance 👍 😄

Hi @Amier3 !

I have added syscall mocking and testing, I guess this is now finally ready for review! 😁

GOOS=windows go test -tags=windows ./agent/checks -run TestCheck_OSService -v

image

Comment on lines 40 to 47
defer func() {
errDisconnect := m.Disconnect()
if isHealthy || errDisconnect == nil || err != nil {
return
}
//unreachable at the moment but we might want to log this error. leaving here for code-review
err = errDisconnect
}()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am doing this because I think it might be important to at least log errors if there are any at this point (after changing the if condition of course) and I wanted to raise the question with you.

If not, the code can be simplified with a simple

defer func () { m.Disconnect() }()

without capturing the error.
Same applies below for service.Close().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a quick glance I think the fail conditions for Disconnect and Close are unlikely and can be ignored.

In the worst case scenario the process holding the ServiceManager open doesn't get cleaned up but if the OS is degraded to the point where it can't close there may be worse problems 😅

I'd vote for a simple defer but it's not a strong opinion by any means.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as the one about getWindowsSvc. What would happen if OpenService fail and we execute Disconnect anyway?

@deblasis deblasis changed the title [WIP] Feature: Health checks windows service Feature: Health checks windows service Jun 9, 2022
Copy link
Contributor

@trujillo-adam trujillo-adam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the documentation updates and made a few suggestions to improve clarity.

website/content/docs/discovery/checks.mdx Outdated Show resolved Hide resolved
website/content/api-docs/agent/check.mdx Outdated Show resolved Hide resolved
@jkirschner-hashicorp
Copy link
Contributor

@deblasis : for context, our education/docs team has performed a review (above). The engineering review still needs to be conducted! I'd recommend waiting for an engineering review before making changes so you can address all the feedback in one go.

@github-actions
Copy link

This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions.

@github-actions github-actions bot added the meta/stale Automatically flagged for inactivity by stalebot label Aug 15, 2022
…cks_windows_service

Signed-off-by: Alessandro De Blasis <alex@deblasis.net>
…cks_windows_service

Signed-off-by: Alessandro De Blasis <alex@deblasis.net>
Copy link
Contributor

@kisunji kisunji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this large chunk of design and implementation! I'm not sure if we'll be able to implement Windows machines for CI to test this but I'll try it out locally on my desktop and ask the team how we want to handle this windows-specific feature.

I left a few minor comments that are non-blocking.

agent/checks/check.go Outdated Show resolved Hide resolved
agent/checks/os_service_windows.go Outdated Show resolved Hide resolved
Comment on lines 40 to 47
defer func() {
errDisconnect := m.Disconnect()
if isHealthy || errDisconnect == nil || err != nil {
return
}
//unreachable at the moment but we might want to log this error. leaving here for code-review
err = errDisconnect
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a quick glance I think the fail conditions for Disconnect and Close are unlikely and can be ignored.

In the worst case scenario the process holding the ServiceManager open doesn't get cleaned up but if the OS is degraded to the point where it can't close there may be worse problems 😅

I'd vote for a simple defer but it's not a strong opinion by any means.

agent/checks/os_service_windows.go Outdated Show resolved Hide resolved
agent/checks/os_service_windows.go Outdated Show resolved Hide resolved
ServiceID structs.ServiceID
OSService string
Interval time.Duration
Timeout time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see Timeout being used anywhere. Is it worth implementing a timeout error? Unlike gRPC or script checks I would assume OS service polling shouldn't take too long and it's not worth enforcing.

Copy link
Contributor

@dhiaayachi dhiaayachi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @deblasis!
I left few questions and an improvement suggestion but nothing substantial to add to @kisunji review.

err = nil
isHealthy = true
case svc.Paused, svc.Stopped:
err = ErrOSServiceStatusCritical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can wrap the service state inside ErrOSServiceStatusCritical to help user have what state is failing the check.

Comment on lines 53 to 65
service := win.getWindowsSvc(serviceName, svcHandle)
defer func() {
errClose := service.Close()
if isHealthy || errClose == nil || err != nil {
return
}
//unreachable at the moment but we might want to log this error. leaving here for code-review
err = errClose
}()
status, err := service.Query()
if err != nil {
return fmt.Errorf("error querying service status: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing for me. I don't know how in details the windows library work, but looking to this, getWindowsSvc will only allocate the struct and don't do any access, right? in that case what will happen if Query fail for some reason but we still call service.Close in the defer? will service.Close error?

Comment on lines 40 to 47
defer func() {
errDisconnect := m.Disconnect()
if isHealthy || errDisconnect == nil || err != nil {
return
}
//unreachable at the moment but we might want to log this error. leaving here for code-review
err = errDisconnect
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as the one about getWindowsSvc. What would happen if OpenService fail and we execute Disconnect anyway?

deblasis and others added 4 commits August 28, 2022 17:26
Co-authored-by: trujillo-adam <47586768+trujillo-adam@users.noreply.github.com>
Co-authored-by: trujillo-adam <47586768+trujillo-adam@users.noreply.github.com>
…cks_windows_service

Signed-off-by: Alessandro De Blasis <alex@deblasis.net>
@kisunji
Copy link
Contributor

kisunji commented Aug 30, 2022

@deblasis would you be able to rebase or merge from main? It should have fixes for the failing integration tests

@kisunji
Copy link
Contributor

kisunji commented Sep 6, 2022

Thanks! Tests are all passing. Will make some time this week to test on Windows

@kisunji
Copy link
Contributor

kisunji commented Sep 16, 2022

Apologies for the delay; I didn't have time this week but will make sure to test next week. I will also post a comment on the original github issue to see if there is anyone who would like to test out the feature.

@kisunji
Copy link
Contributor

kisunji commented Sep 20, 2022

I discovered a few issues while testing:

  1. We need to initialize checkOSServices otherwise there is a panic due to assigning to a nil map
  2. (not a blocker) UI does not recognize OS as a type of check so it appears as a "Serf Check" in the UI.
  3. I am seeing the following in the logs:
2022-09-20T17:29:34.057-0400 [DEBUG] agent: Check failed: check=service-check error="error accessing service: The handle is invalid."

@kisunji
Copy link
Contributor

kisunji commented Sep 20, 2022

Reproduction steps:

// config/nodecheck.hcl
check = {
  name = "node-check"
  os_service = "AdobeARMService"
  interval = "10s"
}

// config/svccheck.hcl
check = {
  name = "service-check"
  os_service = "AdobeARMService"
  service_id = "hello"
  interval = "10s"
}
# powershell as admin
.\consul.exe agent -dev -config-dir=".\config"

Confirming that AdodeARMService is running

PS C:\Users\Chris> Get-Service -Name "AdobeARMService"

Status   Name               DisplayName
------   ----               -----------
Running  AdobeARMservice    Adobe Acrobat Update Service

@deblasis
Copy link
Contributor Author

Hi @kisunji! Sorry about that. But I have good news 👇

  • We need to initialize checkOSServices otherwise there is a panic due to assigning to a nil map

Addressed in fc0dd92

  • (not a blocker) UI does not recognize OS as a type of check so it appears as a "Serf Check" in the UI.

I have noticed that I missed some properties and the corresponding generated mappings (fixed in 461b42e) but however I believe there's a bug elsewhere, I have added another check to... check 🙂 (repro config below) and I am seeing serf in there as well:
image

Since I was looking at the UI, I noticed the lack of message so I have added a "Healthy" message with 5719fd6

  • I am seeing the following in the logs:
2022-09-20T17:29:34.057-0400 [DEBUG] agent: Check failed: check=service-check error="error accessing service: The handle is invalid."

Addressed in f440966

I have tested the above with:

// config/nodecheck.hcl
check = {
  name = "node-check"
  os_service = "Dhcp"
  interval = "10s"
}

// config/h2ping.hcl
check = {
  id = "h2ping-check"
  name = "h2ping"
  h2ping = "localhost:22222"
  interval = "10s"
  h2ping_use_tls = false
}

@kisunji
Copy link
Contributor

kisunji commented Sep 21, 2022

Ah it must be a UI bug then. Thanks for the quick updates, will try to test again soon

@jkirschner-hashicorp
Copy link
Contributor

jkirschner-hashicorp commented Sep 21, 2022

@johncowen: any ideas on why newer health check types (os_service from this PR, h2ping introduced a few months ago) would show up as a "Serf" health check on the UI?

@johncowen
Copy link
Contributor

If the Type field in the HTTP response that the UI uses is empty then we assume that is a serf check. Possibly or possibly not useful UI PR #10194

I would guess that the new healthchecks are coming out with empty Types in the HTTP responses.

Umm separate thing but if we have new healthcheck types then the UI code might need to be aware of those for menus etc. not totally sure if we inspect those or if they are hardcoded.

At the very least UI should add these possible values to our mock API I think

@deblasis
Copy link
Contributor Author

Hi @johncowen ! Nice seeing you again around here (you reviewed a PR of mine back in the day) :)

If the Type field in the HTTP response that the UI uses is empty then we assume that is a serf check. Possibly or possibly not useful UI PR #10194

I would guess that the new healthchecks are coming out with empty Types in the HTTP responses.

I can confirm that this is the case, the Type is empty in the HTTP responses even for "older" checks like h2ping.

Copy link
Contributor

@trujillo-adam trujillo-adam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a few suggestions and questions w/r/t the docs. You did a really good job on the main description of the OS service check, BTW.

@@ -245,6 +245,8 @@ The check sends datagrams to the value specified at the interval specified in th
If the datagram is sent successfully or a timeout is returned, the check is set to the `passing` state.
The check is logged as `critical` if the datagram is sent unsuccessfully.

- `OSService` `(string: "")` - Specifies the name of a service to check that is running as an OS-level service; either Windows Services on Windows or SystemD services on Unix.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `OSService` `(string: "")` - Specifies the name of a service to check that is running as an OS-level service; either Windows Services on Windows or SystemD services on Unix.
- `OSService` `(string: "")` - Specifies an OS-level service to check. You can specify either `Windows Services` on Windows or `SystemD` services on Unix.

The current wording reads as if we are checking the actual Windows Service or SystemD application, but stated in a convoluted way. I think we might mean, though, that this option tells Consul that the service it is configured to check runs as a Windows Service or SystemD service. Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Currently, this feature works only on windows but the main pieces for making it work with SystemD are in place. The design was intended to offer the same look-n-feel to the user independently of the implementation details/OS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that I accepted this by mistake and I feel like it should be reverted and/or rephrased to indicate that it's the name of the OS-specific service that goes into that.

website/content/docs/ecs/configuration-reference.mdx Outdated Show resolved Hide resolved
deblasis and others added 2 commits September 23, 2022 07:43
Co-authored-by: trujillo-adam <47586768+trujillo-adam@users.noreply.github.com>
Co-authored-by: trujillo-adam <47586768+trujillo-adam@users.noreply.github.com>
@deblasis
Copy link
Contributor Author

deblasis commented Sep 23, 2022

@trujillo-adam:

Had a few suggestions and questions w/r/t the docs. You did a really good job on the main description of the OS service check, BTW.

Suggestions accepted! Thanks but @jkirschner-hashicorp and yourself helped a lot already giving it the first pass a few weeks ago.
My initial version was not that clear. Now it looks pretty good. 👍

Copy link
Contributor

@trujillo-adam trujillo-adam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more tweak

website/content/api-docs/agent/check.mdx Outdated Show resolved Hide resolved
@@ -245,6 +245,8 @@ The check sends datagrams to the value specified at the interval specified in th
If the datagram is sent successfully or a timeout is returned, the check is set to the `passing` state.
The check is logged as `critical` if the datagram is sent unsuccessfully.

- `OSService` `(string: "")` - Specifies an OS-level service to check. You can specify either `Windows Services` on Windows or `SystemD` services on Unix.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `OSService` `(string: "")` - Specifies an OS-level service to check. You can specify either `Windows Services` on Windows or `SystemD` services on Unix.
- `OSService` `(string: "")` - Specifies the name of an OS-level service. Configure this option if the service you intend to check runs as an OS-level service. You can specify either `Windows Services` on Windows or `SystemD` services on Unix.

I think trying to combine the action to specify the name and the action of performing the check (if that makes sense) is what's causing confusion for me. We should state them separately. First, what the param is for. Second, what the param means in the context of the overall API call. What do you think?

@deblasis
Copy link
Contributor Author

deblasis commented Oct 11, 2022 via email

@deblasis
Copy link
Contributor Author

I think it should be OK now. Shall we🚢 it ? 😄

@kisunji
Copy link
Contributor

kisunji commented Oct 17, 2022

@deblasis I've tested this successfully on Windows so this is good to 🚢 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/dependencies PR specifically updates dependencies of project theme/config Relating to Consul Agent configuration, including reloading type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants