Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add -recheck-with-time-limit support #223

Merged
merged 1 commit into from
Apr 16, 2024

Conversation

hoffie
Copy link
Contributor

@hoffie hoffie commented Feb 18, 2022

process-exporter already supports the -recheck flag which makes it run the whole matching logic on each scrape. This is very useful when trying to monitor processes which change their names shortly after start.

Sadly, -recheck carries a rather high performance penalty. At the same time, process name changes are very common directly after start, are seldomly expected during usage.

This commit introduces -recheck-with-time-limit which rechecks processes N seconds after their start and stops doing so afterwards. This combines the accuracy benefits of -recheck with the performance gains of not using -recheck.

I'd be glad if this could be accepted. Let me know if you want any changes. Thanks for working on process-exporter! :)

@flixr
Copy link
Contributor

flixr commented Feb 18, 2022

@hoffie this sounds great! I initially added the -recheck option and later also wanted to do something like this to limit the performance hits, but never got around to it...
I was more thinking of an exponential backoff to skip full rechecks for scrapes, but I guess this would do the job just as well for our usecases. Just like you said the name usually only changes shortly after start of the process.

One question though: we are also using this on an embedded device which does not have it's own RTC battery and is usually time-synced via PTP to an external source. Meaning that system time can jump forwards or backwards at any time.
I guess in this case this might not work, right?
For that we probably would need to store monotonic time and compare against that, since we only care about duration for this limit.

@hoffie
Copy link
Contributor Author

hoffie commented Feb 18, 2022

@flixr Thanks for the feedback!

I was more thinking of an exponential backoff to skip full rechecks for scrapes

This sounds like an interesting approach as well. It would probably require storing that iteration count per-process though, wouldn't it?

One question though: we are also using this on an embedded device which does not have it's own RTC battery and is usually time-synced via PTP to an external source. Meaning that system time can jump forwards or backwards at any time.
I guess in this case this might not work, right?

Probably depends on the size of the jumps and the exact -recheck-with-time-limit value.

For that we probably would need to store monotonic time and compare against that, since we only care about duration for this limit.

Some ideas:

  • time.Time seems to support monotonic timestamps. The main issue will be getting a monotonic-time-enabled time.Time for the StartTime. Because watched processes may have been started before process-exporter, we can't use Now(). I'm not sure if there is a way to convert a wall clock Unix timestamp into a (guessed!) monotonic timestamp.
  • There's StartTimeRel which seems to be a Kernel-provided (/proc/PID/stat) monotonic timestamp measured in ticks after boot. However, I have no idea how to get the current value of ticks after boot in a sane way. In other words, I don't know what to compare with. It might be that Go's time.Now() includes exactly this value, but I've found no clear evidence that it does and -- more importantly -- that one should rely on that.
  • We could likely run our own monotonic clock in the process via time.Ticker. This would imply storing another int per-process though.

In the end, I'm currently unsure whether to go this path. All of it would introduce complexity and while it might solve one type of problem (jump in time for embedded devices), it might cause other inaccuracies (e.g. suspended systems). I'll leave it up to @ncabatoff to decide if and how this should be done. Maybe it would be an option to merge the current, simple approach and try refining later?

Ressources:
https://man7.org/linux/man-pages/man2/clock_gettime.2.html
https://blog.cloudflare.com/its-go-time-on-linux/

@flixr
Copy link
Contributor

flixr commented Feb 19, 2022

Agreed, this is clearly a very useful option and should cover the vast majority of use cases.

In my case the time jump can potentially even be years (no RTC, no battery, so no useful time after cold reboot).
But this is a very special case and this PR should improve for 99% of cases and the option to always recheck is still there.

Probably the question is rather if there should be an additional option as you implemented it (which is great for backwards compatibility) or if this should be "merged" into one single, easier to understand option....

@hoffie
Copy link
Contributor Author

hoffie commented Feb 19, 2022

In my case the time jump can potentially even be years (no RTC, no battery, so no useful time after cold reboot). But this is a very special case

Heh, yeah, especially when considering common server use cases.

Probably the question is rather if there should be an additional option as you implemented it (which is great for backwards compatibility) or if this should be "merged" into one single, easier to understand option....

Yes. In a green field implementation there should only be one option. For compatibility I chose to add a new flag and keep the existing. I don't see a way to deprecate or hide a flag using Go's standard flag package, which is used here (I guess kingpin's can do it).

Anyway, probably up to @ncabatoff as well. :)

As it stands, the following things should be decided:

  • Should it be merged?
  • Does it need support for monotonic time tracking? (My tendency: Not right now)
  • Should we keep -recheck working for compatibility? (My tendency: Yes. process-exporter is not 1.0, so breakage would be ok, but I guess many people rely on the existing flags)

@hoffie
Copy link
Contributor Author

hoffie commented Mar 26, 2022

@ncabatoff Friendly ping -- is there anything I can do/improve to get this into a mergable state?
(I'm fully aware that reviewing pull requests does need time and energy and priorities are different oftentimes!)

process-exporter already supports the -recheck flag which makes it run
the whole matching logic on each scrape. This is very useful when trying
to monitor processes which change their names shortly after start.

Sadly, -recheck carries a rather high performance penalty. At the same
time, process name changes are very common directly after start, are
seldomly expected during usage.

This commit introduces -recheck-with-time-limit which rechecks processes
N seconds after their start and stops doing so afterwards. This combines
the accuracy benefits of -recheck with the performance gains of not
using -recheck.
@hoffie hoffie force-pushed the recheck-with-time-limit branch from fe30e27 to 4835e1f Compare March 27, 2022 19:18
@mika
Copy link

mika commented Nov 9, 2022

Gentle ping @ncabatoff, any chance to get this merged? Would be nice to get it into the prometheus-process-exporter package for the upcoming Debian/bookworm stable release. :)

@steinbrueckri
Copy link

Gentle ping @ncabatoff again ✌️

@steinbrueckri
Copy link

@ncabatoff Friendly ping -- is there anything I can do/improve to get this into a mergable state?

@alexmv
Copy link

alexmv commented Apr 12, 2024

Just another friendly vote in favor of getting this PR merged.

@steinbrueckri
Copy link

It appears that there hasn't been a new release since 2021, which is disappointing given the excellent pice of this software. @alexmv @mika, is there an active branch where we can integrate this pull request? If not, I'm willing to create a fork and help maintain it.

@guillemj
Copy link

@steinbrueckri My impression is also that @ncabatoff might have lost interest in maintaining this (no commits either on git HEAD)? So perhaps forking and collecting interesting patches would be the way to go, and if this repo gets activity again the fork could be folded back in I guess.

For the Debian packages (https://tracker.debian.org/prometheus-process-exporter) I'd need to discuss with the other members of the Prometheus Packaging Team there, but I think it we could consider switching to another upstream repo.

@mika
Copy link

mika commented Apr 16, 2024

@steinbrueckri FTR, I'm in the same boat as @guillemj and if anyone is willing to maintain a fork, this sounds like the way to go for me. :) Thx! 🙏

@ncabatoff
Copy link
Owner

Folks are correct that I haven't much energy for maintaining OSS projects lately. Probably I should find a co-maintainer, or hand the project over to someone else entirely. In the meantime, I can certainly merge this change, which looks perfectly fine. Very sorry for having ignored it for so long.

@ncabatoff ncabatoff merged commit 0bcf42e into ncabatoff:master Apr 16, 2024
@mika
Copy link

mika commented Apr 17, 2024

@ncabatoff no worries at all, I'm sure all of us can relate :) Thanks for accepting the MR, appreciated!

zhangguanzhang pushed a commit to zhangguanzhang/process-exporter that referenced this pull request Apr 28, 2024
process-exporter already supports the -recheck flag which makes it run
the whole matching logic on each scrape. This is very useful when trying
to monitor processes which change their names shortly after start.

Sadly, -recheck carries a rather high performance penalty. At the same
time, process name changes are very common directly after start, are
seldomly expected during usage.

This commit introduces -recheck-with-time-limit which rechecks processes
N seconds after their start and stops doing so afterwards. This combines
the accuracy benefits of -recheck with the performance gains of not
using -recheck.
zviRosenfeldRedis added a commit to RedisLabs/process-exporter that referenced this pull request Jun 16, 2024
* update deps

* Finish updating to Go 1.22

* Update Dockerfile for Go 1.22

* Add -recheck-with-time-limit support (ncabatoff#223)

process-exporter already supports the -recheck flag which makes it run
the whole matching logic on each scrape. This is very useful when trying
to monitor processes which change their names shortly after start.

Sadly, -recheck carries a rather high performance penalty. At the same
time, process name changes are very common directly after start, are
seldomly expected during usage.

This commit introduces -recheck-with-time-limit which rechecks processes
N seconds after their start and stops doing so afterwards. This combines
the accuracy benefits of -recheck with the performance gains of not
using -recheck.

* Fix server start error (ncabatoff#295)

* Update release action to use newer actions.  Use latest goreleaser version.  Use CGO_ENABLED in its config instead of tags.

* Upgrade prometheus/procfs dependency, use Go1.21 (ncabatoff#296)

* Upgrade prometheus/procfs dependency and therefore set Go1.21 in go.mod

* Fix broken compatibility with prometheus/procfs

* Update README for Go1.21

---------

Co-authored-by: Thomas Delbende <thomas.delbende@bleemeo.com>

* split image workflow from build (ncabatoff#299)

Various GHA improvements.

* Bump golang.org/x/net from 0.22.0 to 0.23.0 (ncabatoff#298)

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.22.0 to 0.23.0.
- [Commits](golang/net@v0.22.0...v0.23.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Nick Cabatoff <nick.cabatoff@gmail.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: ncabatoff <ncabatoff@hashicorp.com>
Co-authored-by: Christian Hoffmann <christian@hoffie.info>
Co-authored-by: AiDaiP <43956964+AiDaiP@users.noreply.github.com>
Co-authored-by: Maxi_Mega <52792549+Maxi-Mega@users.noreply.github.com>
Co-authored-by: Thomas Delbende <thomas.delbende@bleemeo.com>
Co-authored-by: Nick Cabatoff <nick.cabatoff@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants