Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows service health check #10637

Closed
rismoney opened this issue Jul 19, 2021 · 16 comments
Closed

windows service health check #10637

rismoney opened this issue Jul 19, 2021 · 16 comments
Assignees
Labels
help-wanted We encourage community PRs for these issues! theme/health-checks Health Check functionality theme/windows Anything related to Windows type/enhancement Proposed improvement or new feature

Comments

@rismoney
Copy link

Agents should have a native health check to determine if a service is running. Many services dont have network listeners and handle either outbound, or local service capability. Avoiding process creation/kill of calling a cmd sc/pwsh cmdlet to check the existence of a service would be great to just handle in native go.

I cant imagine this to be a big ask, as other health checks are seemingly simplistic functions.

@jkirschner-hashicorp jkirschner-hashicorp added theme/health-checks Health Check functionality type/enhancement Proposed improvement or new feature labels Jul 19, 2021
@jkirschner-hashicorp jkirschner-hashicorp added the theme/windows Anything related to Windows label Jul 29, 2021
@jkirschner-hashicorp
Copy link
Contributor

Hi @rismoney,

It sounds like there's a service with no network listeners that you want to have health checks on. Rather than using a Script + Interval check, you'd prefer a native check to see if the service/process is running.

If I've understood correctly (let me know), I have a few follow-up questions:

  • Is a Script + Interval check what you're using instead today? If so, what are you hoping would improve/change for your use case as a result of switching to a native check?
  • What configuration do you imagine passing in? (e.g., service name, interval, anything else?)
  • What information do you imagine the health status (passing, warning, critical) being based on? (e.g., based on ServiceController.Status, or something else)

@jkirschner-hashicorp jkirschner-hashicorp added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jul 30, 2021
@rismoney
Copy link
Author

rismoney commented Jul 30, 2021

It sounds like there's a service with no network listeners that you want to have health checks on. Rather than using a Script Interval check, you'd prefer a native check to see if the service/process is running.

Yes.

Is a Script + Interval check what you're using instead today? If so, what are you hoping would improve/change for your use case as a result of switching to a native check?

This is what we would use. I am running very latency sensitive financial transaction applications. Spawning to cmd/powershell would get expensive as the service count were to rise invoking a get-service/sc or similar operation. We are trying to do service inventory on 500 nodes with 25-50 custom services per node, and would not want the health check to be shell/script based with another process invocation. Also my understanding is any approach that uses wmi calls isn't going to scale well (the prometheus exporter kind of taught that lesson)

What configuration do you imagine passing in? (e.g., service name, interval, anything else?)

Yes, unfortunately the only thing we will know is it's name and that it's running so those two attributes are nearly all we have. That itself would be the health check. I know it's kind of primitive, The applications with listeners are also fairly sensitive, and we don't want to interrogate them by port. They have their own built in health capabilities, that essentially will down themselves if unhealthy.

What information do you imagine the health status

I think it's fairly boolean. running or not. I suppose there are some edge cases around "starting, stopping, etc) but I'd punt on those, and assume anything not running is unhealthy. Those are typically backoff timer based, and will timeout/succeed eventually.

While my initial use case is windows, I would imagine similar capabilities for systemd services would be a feature on parity with this.

I am relatively new to exploring consul, and thought this capability might open up a huge monitoring integration for us, where we can dynamically determine how to monitor things based on what services are active where in the environment. If we had to powershell/sc to get it for each service it would potentially be super expensive as the service count rises.

@jkirschner-hashicorp
Copy link
Contributor

@rismoney :

We are trying to do service inventory on 500 nodes with 25-50 custom services per node

Are all of the 25-50 services per node unique services? Or do you have multiple instances of some services on a single node?

I ask because if service name is the identifying information used for the health check, I imagine that would be problematic with multiple instances of a service on a single node.

@rismoney
Copy link
Author

rismoney commented Aug 4, 2021

The services are uniquely named and require a directory with the associated app/config in each dir. They could be the same app, but their config/name is unique. I am not sure windows even allows duplicate service names on the same node.

To clarify 2 services named:
myco-svctype-svcname-001 would reside in C:\myservice\myco-svctype-svcname-001
myco-svctype-svcname-002 would reside in C:\myservice\myco-svctype-svcname-002

The underlying appcode might be the same, but are redundantly stored on the filesystem and have different associated conf files. At no point would the service names not be unique. Also I don't think you can have 2 windows services that point to the same path\to\filename.exe ... So there is that too.

So duplication of services wouldn't occur on the same node, but could occur across nodes (parallel stuff). So myco-svctype-svcname-001 could be running on 3 nodes.

I hope that clarifies.

@jkirschner-hashicorp jkirschner-hashicorp removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Aug 9, 2021
@jkirschner-hashicorp
Copy link
Contributor

jkirschner-hashicorp commented Aug 9, 2021

Hello Consul community members,

We would welcome a PR contributed by the community for this enhancement!

If you're interested, please comment here so anyone interested can stay informed. We also recommend sharing a written design approach here before proceeding with implementation to ensure we're aligned from the start.

The approach should ensure the following:

  • The configuration entries for defining the health check must be OS agnostic (e.g., could work for a Windows Service or systemd on Linux), so that this enhancement could be extended to other OSs in the future.
  • The implementation need only support Windows for now.
    • Why? The overhead to create a new process on Windows seems to be much, much higher than it is on Linux (via fork), so the need is greatest on Windows.
    • The documentation should explain the workaround to get the same functionality on Linux (i.e., write a script that does the check and use local script checks)

Implementing automated tests for this may not be straightforward, as (to the best of my knowledge) we don't have unit tests that run specifically on Windows within Circle CI right now. The HashiCorp team can discuss that with whomever takes on this enhancement.

@jkirschner-hashicorp jkirschner-hashicorp added the help-wanted We encourage community PRs for these issues! label Aug 9, 2021
@deblasis
Copy link
Contributor

deblasis commented Dec 1, 2021

Hi there, I bumped into this issue and I would like to contribute.
I am a huge Hashicorp fan and shamelessly wannabe Hashicorpian too 😄

That said, I wanted to try something "odd", considering that this issue requires sharing a design beforehand.
I tried following the RFC template and I came up with this public gist:

https://gist.github.com/deblasis/b6dd645a8bd5e2b3c9b95bb30bbb238c

Any feedback is more than welcome.

Cheers

@jkirschner-hashicorp
Copy link
Contributor

Hi @deblasis,

Thank you so much for writing this up - and for taking the initiative to find and apply our RFC template!

The engineering team will take a look soon and provide feedback directly on the Gist (likely in the next 1-2 weeks). In the meantime, I might provide some feedback on UX considerations (also on the Gist).

@deblasis
Copy link
Contributor

deblasis commented Dec 7, 2021

Hi @jkirschner-hashicorp,
you are welcome! I tried, hopefully, it's easier this way.

Thank you for the update. It sounds fantastic! Take your time.
Best

@kisunji kisunji self-assigned this Dec 14, 2021
@kisunji
Copy link
Contributor

kisunji commented Dec 15, 2021

The team has blocked off some time next week to go through your RFC! We appreciate this initiative and will get back to you with valuable feedback

@rismoney
Copy link
Author

any update or blockers on moving this forward?

@deblasis
Copy link
Contributor

Hi @rismoney, it is completely my fault I guess. The team provided feedback, we are aligned with the design and there are no blockers.
I just had to write the relevant code but in the meantime, I started working on a big project and I didn't manage to follow up with my OSS commitments.
Sorry about that.
I booked some time to look at this over the weekend. Stay tuned!

@deblasis
Copy link
Contributor

deblasis commented Jun 7, 2022

Hi folks,
I have some news, I have updated the Gist (PRD) for the issue and you can see the diff here.

I have also written some code in #13388 and I would like some feedback and some direction regarding your preferred way of mocking syscalls. I could attempt doing it "my own way", but I'd rather stay consistent if you guys have a pattern that's already used elsewhere and would like to be replicated here.

@kisunji
Copy link
Contributor

kisunji commented Sep 16, 2022

Hey everyone, @deblasis has a PR implementing Windows service health checks: #13388

If anyone would like to try it out and give feedback, that would be much appreciated.

@jkirschner-hashicorp
Copy link
Contributor

@rismoney : This has been closed by #13388 thanks to the hard work of @deblasis!

The functionality will first be available in the 1.14.0 release.

@rismoney
Copy link
Author

amazing work, from original documentation to merging!!! kudos and @deblasis would be an asset to any org!

@deblasis
Copy link
Contributor

Thank you @rismoney ! In hindsight, it took me way too long to stay on top of this given work commitments and I hate unfinished business but I am very happy to see my code finally merged upstream again. Always nice to be working with you guys, I always learn something new🙏

Farewell, until next time :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help-wanted We encourage community PRs for these issues! theme/health-checks Health Check functionality theme/windows Anything related to Windows type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests

4 participants