windows service health check #10637

rismoney · 2021-07-19T02:47:41Z

Agents should have a native health check to determine if a service is running. Many services dont have network listeners and handle either outbound, or local service capability. Avoiding process creation/kill of calling a cmd sc/pwsh cmdlet to check the existence of a service would be great to just handle in native go.

I cant imagine this to be a big ask, as other health checks are seemingly simplistic functions.

jkirschner-hashicorp · 2021-07-30T01:00:15Z

Hi @rismoney,

It sounds like there's a service with no network listeners that you want to have health checks on. Rather than using a Script + Interval check, you'd prefer a native check to see if the service/process is running.

If I've understood correctly (let me know), I have a few follow-up questions:

Is a Script + Interval check what you're using instead today? If so, what are you hoping would improve/change for your use case as a result of switching to a native check?
What configuration do you imagine passing in? (e.g., service name, interval, anything else?)
What information do you imagine the health status (passing, warning, critical) being based on? (e.g., based on ServiceController.Status, or something else)

rismoney · 2021-07-30T17:50:01Z

It sounds like there's a service with no network listeners that you want to have health checks on. Rather than using a Script Interval check, you'd prefer a native check to see if the service/process is running.

Yes.

Is a Script + Interval check what you're using instead today? If so, what are you hoping would improve/change for your use case as a result of switching to a native check?

This is what we would use. I am running very latency sensitive financial transaction applications. Spawning to cmd/powershell would get expensive as the service count were to rise invoking a get-service/sc or similar operation. We are trying to do service inventory on 500 nodes with 25-50 custom services per node, and would not want the health check to be shell/script based with another process invocation. Also my understanding is any approach that uses wmi calls isn't going to scale well (the prometheus exporter kind of taught that lesson)

What configuration do you imagine passing in? (e.g., service name, interval, anything else?)

Yes, unfortunately the only thing we will know is it's name and that it's running so those two attributes are nearly all we have. That itself would be the health check. I know it's kind of primitive, The applications with listeners are also fairly sensitive, and we don't want to interrogate them by port. They have their own built in health capabilities, that essentially will down themselves if unhealthy.

What information do you imagine the health status

I think it's fairly boolean. running or not. I suppose there are some edge cases around "starting, stopping, etc) but I'd punt on those, and assume anything not running is unhealthy. Those are typically backoff timer based, and will timeout/succeed eventually.

While my initial use case is windows, I would imagine similar capabilities for systemd services would be a feature on parity with this.

I am relatively new to exploring consul, and thought this capability might open up a huge monitoring integration for us, where we can dynamically determine how to monitor things based on what services are active where in the environment. If we had to powershell/sc to get it for each service it would potentially be super expensive as the service count rises.

jkirschner-hashicorp · 2021-08-04T20:53:29Z

@rismoney :

We are trying to do service inventory on 500 nodes with 25-50 custom services per node

Are all of the 25-50 services per node unique services? Or do you have multiple instances of some services on a single node?

I ask because if service name is the identifying information used for the health check, I imagine that would be problematic with multiple instances of a service on a single node.

rismoney · 2021-08-04T21:44:57Z

The services are uniquely named and require a directory with the associated app/config in each dir. They could be the same app, but their config/name is unique. I am not sure windows even allows duplicate service names on the same node.

To clarify 2 services named:
myco-svctype-svcname-001 would reside in C:\myservice\myco-svctype-svcname-001
myco-svctype-svcname-002 would reside in C:\myservice\myco-svctype-svcname-002

The underlying appcode might be the same, but are redundantly stored on the filesystem and have different associated conf files. At no point would the service names not be unique. Also I don't think you can have 2 windows services that point to the same path\to\filename.exe ... So there is that too.

So duplication of services wouldn't occur on the same node, but could occur across nodes (parallel stuff). So myco-svctype-svcname-001 could be running on 3 nodes.

I hope that clarifies.

jkirschner-hashicorp · 2021-08-09T22:32:29Z

Hello Consul community members,

We would welcome a PR contributed by the community for this enhancement!

If you're interested, please comment here so anyone interested can stay informed. We also recommend sharing a written design approach here before proceeding with implementation to ensure we're aligned from the start.

The approach should ensure the following:

The configuration entries for defining the health check must be OS agnostic (e.g., could work for a Windows Service or systemd on Linux), so that this enhancement could be extended to other OSs in the future.
The implementation need only support Windows for now.
- Why? The overhead to create a new process on Windows seems to be much, much higher than it is on Linux (via fork), so the need is greatest on Windows.
- The documentation should explain the workaround to get the same functionality on Linux (i.e., write a script that does the check and use local script checks)

Implementing automated tests for this may not be straightforward, as (to the best of my knowledge) we don't have unit tests that run specifically on Windows within Circle CI right now. The HashiCorp team can discuss that with whomever takes on this enhancement.

deblasis · 2021-12-01T13:52:32Z

Hi there, I bumped into this issue and I would like to contribute.
I am a huge Hashicorp fan and shamelessly wannabe Hashicorpian too 😄

That said, I wanted to try something "odd", considering that this issue requires sharing a design beforehand.
I tried following the RFC template and I came up with this public gist:

https://gist.github.com/deblasis/b6dd645a8bd5e2b3c9b95bb30bbb238c

Any feedback is more than welcome.

Cheers

jkirschner-hashicorp · 2021-12-07T04:54:05Z

Hi @deblasis,

Thank you so much for writing this up - and for taking the initiative to find and apply our RFC template!

The engineering team will take a look soon and provide feedback directly on the Gist (likely in the next 1-2 weeks). In the meantime, I might provide some feedback on UX considerations (also on the Gist).

deblasis · 2021-12-07T14:37:07Z

Hi @jkirschner-hashicorp,
you are welcome! I tried, hopefully, it's easier this way.

Thank you for the update. It sounds fantastic! Take your time.
Best

kisunji · 2021-12-15T18:42:03Z

The team has blocked off some time next week to go through your RFC! We appreciate this initiative and will get back to you with valuable feedback

rismoney · 2022-05-29T22:41:51Z

any update or blockers on moving this forward?

deblasis · 2022-05-31T08:28:31Z

Hi @rismoney, it is completely my fault I guess. The team provided feedback, we are aligned with the design and there are no blockers.
I just had to write the relevant code but in the meantime, I started working on a big project and I didn't manage to follow up with my OSS commitments.
Sorry about that.
I booked some time to look at this over the weekend. Stay tuned!

deblasis · 2022-06-07T17:52:34Z

Hi folks,
I have some news, I have updated the Gist (PRD) for the issue and you can see the diff here.

I have also written some code in #13388 and I would like some feedback and some direction regarding your preferred way of mocking syscalls. I could attempt doing it "my own way", but I'd rather stay consistent if you guys have a pattern that's already used elsewhere and would like to be replicated here.

kisunji · 2022-09-16T19:35:03Z

Hey everyone, @deblasis has a PR implementing Windows service health checks: #13388

If anyone would like to try it out and give feedback, that would be much appreciated.

jkirschner-hashicorp · 2022-10-17T22:24:06Z

@rismoney : This has been closed by #13388 thanks to the hard work of @deblasis!

The functionality will first be available in the 1.14.0 release.

rismoney · 2022-10-18T02:57:52Z

amazing work, from original documentation to merging!!! kudos and @deblasis would be an asset to any org!

deblasis · 2022-10-18T07:26:58Z

Thank you @rismoney ! In hindsight, it took me way too long to stay on top of this given work commitments and I hate unfinished business but I am very happy to see my code finally merged upstream again. Always nice to be working with you guys, I always learn something new🙏

Farewell, until next time :)

jkirschner-hashicorp added theme/health-checks Health Check functionality type/enhancement Proposed improvement or new feature labels Jul 19, 2021

jkirschner-hashicorp added the theme/windows Anything related to Windows label Jul 29, 2021

jkirschner-hashicorp added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jul 30, 2021

jkirschner-hashicorp removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Aug 9, 2021

jkirschner-hashicorp added the help-wanted We encourage community PRs for these issues! label Aug 9, 2021

kisunji self-assigned this Dec 14, 2021

deblasis mentioned this issue Jun 7, 2022

Feature: Health checks windows service #13388

Merged

3 tasks

jkirschner-hashicorp closed this as completed Oct 17, 2022

jkirschner-hashicorp mentioned this issue Oct 19, 2023

Docs bug - os_service is not implemented on linux. #19037

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

windows service health check #10637

windows service health check #10637

rismoney commented Jul 19, 2021

jkirschner-hashicorp commented Jul 30, 2021

rismoney commented Jul 30, 2021 •

edited

Loading

jkirschner-hashicorp commented Aug 4, 2021

rismoney commented Aug 4, 2021

jkirschner-hashicorp commented Aug 9, 2021 •

edited

Loading

deblasis commented Dec 1, 2021

jkirschner-hashicorp commented Dec 7, 2021

deblasis commented Dec 7, 2021

kisunji commented Dec 15, 2021

rismoney commented May 29, 2022

deblasis commented May 31, 2022

deblasis commented Jun 7, 2022

kisunji commented Sep 16, 2022

jkirschner-hashicorp commented Oct 17, 2022

rismoney commented Oct 18, 2022

deblasis commented Oct 18, 2022

windows service health check #10637

windows service health check #10637

Comments

rismoney commented Jul 19, 2021

jkirschner-hashicorp commented Jul 30, 2021

rismoney commented Jul 30, 2021 • edited Loading

jkirschner-hashicorp commented Aug 4, 2021

rismoney commented Aug 4, 2021

jkirschner-hashicorp commented Aug 9, 2021 • edited Loading

deblasis commented Dec 1, 2021

jkirschner-hashicorp commented Dec 7, 2021

deblasis commented Dec 7, 2021

kisunji commented Dec 15, 2021

rismoney commented May 29, 2022

deblasis commented May 31, 2022

deblasis commented Jun 7, 2022

kisunji commented Sep 16, 2022

jkirschner-hashicorp commented Oct 17, 2022

rismoney commented Oct 18, 2022

deblasis commented Oct 18, 2022

rismoney commented Jul 30, 2021 •

edited

Loading

jkirschner-hashicorp commented Aug 9, 2021 •

edited

Loading