Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promtail: fix high CPU usage on large kubernetes clusters. #1118

Merged
merged 3 commits into from
Jul 3, 2020

Conversation

shuttie
Copy link
Contributor

@shuttie shuttie commented Oct 4, 2019

What this PR does / why we need it:
This PR is a set of multiple optimizations to make promtail usable on a large kubernetes clusters. The original problem is discussed here: #1102

The fix is doing three things:

  1. Host filtering now is happening before the pipeline is build.
  2. Prometheus Kubernetes discovery is now able to do a server-side filtering.
  3. Promtool now leverages this feature to push explicit spec.hostName=$HOSTNAME field selector to kubernetes API, so in a large Kubernetes cluster there is no need to pull thousands of pod specs not belonging to a current node.

After this fix applied, we observed a significant drop of promtail CPU usage from 100% to 3% without any spikes.

Current issues:

  • I'm not sure about early host filtering. Is it legal to override __host__ label in relabel config? As filtering happening before there processed labels are built, then it's possible that we may mess the target discovery.

Which issue(s) this PR fixes:
Fixes #1102

Checklist

  • Documentation added
  • Tests updated

@CLAassistant
Copy link

CLAassistant commented Oct 4, 2019

CLA assistant check
All committers have signed the CLA.

@cyriltovena
Copy link
Contributor

Looks very good ! I don't think we can accept that without the upstream to be merged first as we don't want to fork Prometheus. So let's wait for that.

In the meantime, I think we do have a vendor folder in Loki if you could update that with the related gomod changes you've made. This kinda helps us to see the blast radius of updates made.

Again thank you for this !

@shuttie
Copy link
Contributor Author

shuttie commented Oct 21, 2019

@cyriltovena thanks for the feedback, I've updated the PR according to your comments. And the upstream Prometheus PR is still being reviewed. But I guess (and hope) that it should be merged soon.

@cyriltovena cyriltovena added the keepalive An issue or PR that will be kept alive and never marked as stale. label Oct 29, 2019
@cyriltovena
Copy link
Contributor

I wish we could have this move faster.

@dtennander
Copy link
Contributor

This change looks really promising, so I just want to do a poke here and tell you that the upstream PR is merged now 😄

@posix4e
Copy link

posix4e commented Jun 11, 2020

How's it going?

@cyriltovena
Copy link
Contributor

cyriltovena commented Jun 26, 2020

This up for grab. I think the original author is busy.

Let me know if you need help.

@shuttie
Copy link
Contributor Author

shuttie commented Jul 3, 2020

Not clear why drone-ci complains on non-gofmted cmd/fluent-bit/loki.go which was never touched in this PR. But anyway, this PR is ready for review.

cc @cyriltovena

@cyriltovena
Copy link
Contributor

Not clear why drone-ci complains on non-gofmted cmd/fluent-bit/loki.go which was never touched in this PR. But anyway, this PR is ready for review.

cc @cyriltovena

Looking.

@cyriltovena
Copy link
Contributor

Alright just merge master in and you should be good.

@shuttie
Copy link
Contributor Author

shuttie commented Jul 3, 2020

Seems to be in sync with master, and no complains from linter.

@codecov-commenter
Copy link

Codecov Report

Merging #1118 into master will decrease coverage by 0.04%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1118      +/-   ##
==========================================
- Coverage   62.45%   62.41%   -0.05%     
==========================================
  Files         158      158              
  Lines       12761    12766       +5     
==========================================
- Hits         7970     7968       -2     
- Misses       4177     4183       +6     
- Partials      614      615       +1     
Impacted Files Coverage Δ
pkg/promtail/targets/filetargetmanager.go 0.00% <0.00%> (ø)
pkg/promtail/targets/tailer.go 76.13% <0.00%> (-2.28%) ⬇️

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really happy about this. Thank you so much for all the effort !

LGTM

@cyriltovena cyriltovena merged commit de644b5 into grafana:master Jul 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keepalive An issue or PR that will be kept alive and never marked as stale. size/S
Projects
None yet
Development

Successfully merging this pull request may close these issues.

promtail high CPU usage on default helm install
6 participants