[new] promtail: add readline rate limit #5031

liguozhong · 2022-01-04T09:22:06Z

refer issue : #2664
it is important feature,。
In our production environment, the cpu can be reduced from 3.5 core to 0.8 core, and business monitoring begins to stabilize。

liguozhong · 2022-01-04T09:26:27Z

@Whyeasy

liguozhong · 2022-01-04T09:32:24Z

@cyriltovena review PR please ,this is important feature.

Whyeasy · 2022-01-04T09:34:17Z

Thanks @liguozhong. Will it also be possible to limit based on tags? So we only limit one service and not all services within a promtail instance?

liguozhong · 2022-01-04T09:42:35Z

Thanks @liguozhong. Will it also be possible to limit based on tags? So we only limit one service and not all services within a promtail instance?

done.

cyriltovena · 2022-01-04T09:42:41Z

It feels like @Whyeasy and @liguozhong have different usage for this ?

@Whyeasy wanted to drop logs after a certain rate, @liguozhong you seem to use it to avoid spike of CPU but you keep the logs trading lag for CPU.

cyriltovena · 2022-01-04T09:44:00Z

Thanks @liguozhong. Will it also be possible to limit based on tags? So we only limit one service and not all services within a promtail instance?

done.

I think @Whyeasy wants a rate limiter stage not a global stage.

cyriltovena · 2022-01-04T09:44:53Z

@liguozhong can you expand a bit more on your usage please ?

liguozhong · 2022-01-04T09:45:14Z

It feels like @Whyeasy and @liguozhong have different usage for this ?

@Whyeasy wanted to drop logs after a certain rate, @liguozhong you seem to use it to avoid spike of CPU but you keep the logs trading lag for CPU.

Logs are very important and do not need to be discarded. Local files can be cached for a long time. This PR code can work well in our monitoring

liguozhong · 2022-01-04T09:52:28Z

@liguozhong can you expand a bit more on your usage please ?

We collect the envoy access log of the istio component in k8s to draw the golden metrics of service access traffic.
And Deploy promtail in k8s through daemonset, and later found that the monitoring data is steep。

liguozhong · 2022-01-04T09:54:39Z

After investigation, we found that the vector agent has limited flow capability,
so we tried to add the flow limit to promtail to solve the problem of inaccurate monitoring.

liguozhong · 2022-01-04T09:57:16Z

Because the default k8s promatil config file has 4 jobs, it is difficult for us to configure different current limiting rules for each job. But our business can accurately know the rate of access log generation per second. So global rate limiter is more suitable for promtail in k8s daemonset
`server:
http_listen_port: 9080
grpc_listen_port: 0
log_level: debug
positions:
filename: /run/promtail/positions.yaml
clients:

url: http://1.1.1.1:3100/loki/api/v1/push
scrape_configs:
job_name: kubernetes-pods-name
kubernetes_sd_configs:
- role: pod
  relabel_configs:
- source_labels:
  - __meta_kubernetes_pod_label_name
    target_label: service
- source_labels:
  - __meta_kubernetes_pod_node_name
    target_label: host
- action: drop
  regex: ''
  source_labels:
  - service
- action: labelmap
  regex: _meta_kubernetes_pod_label(.+)
- action: replace
  replacement: $1
  separator: /
  source_labels:
  - __meta_kubernetes_namespace
  - service
    target_label: job
- action: replace
  source_labels:
  - __meta_kubernetes_namespace
    target_label: namespace
- action: replace
  source_labels:
  - __meta_kubernetes_pod_name
    target_label: pod
- action: replace
  source_labels:
  - __meta_kubernetes_pod_container_name
    target_label: container
- replacement: /var/log/pods/$1/.log
  separator: /
  source_labels:
  - __meta_kubernetes_pod_uid
  - __meta_kubernetes_pod_container_name
    target_label: path
job_name: kubernetes-pods-app
kubernetes_sd_configs:
- role: pod
  relabel_configs:
- action: drop
  regex: .+
  source_labels:
  - __meta_kubernetes_pod_label_name
- source_labels:
  - __meta_kubernetes_pod_label_app
    target_label: service
- source_labels:
  - __meta_kubernetes_pod_node_name
    target_label: host
- action: drop
  regex: ''
  source_labels:
  - service
- action: labelmap
  regex: _meta_kubernetes_pod_label(.+)
- action: replace
  replacement: $1
  separator: /
  source_labels:
  - __meta_kubernetes_namespace
  - service
    target_label: job
- action: replace
  source_labels:
  - __meta_kubernetes_namespace
    target_label: namespace
- action: replace
  source_labels:
  - __meta_kubernetes_pod_name
    target_label: pod
- action: replace
  source_labels:
  - __meta_kubernetes_pod_container_name
    target_label: container
- replacement: /var/log/pods/$1/.log
  separator: /
  source_labels:
  - __meta_kubernetes_pod_uid
  - __meta_kubernetes_pod_container_name
    target_label: path
job_name: kubernetes-pods-direct-controllers
kubernetes_sd_configs:
- role: pod
  relabel_configs:
- action: drop
  regex: .+
  separator: ''
  source_labels:
  - __meta_kubernetes_pod_label_name
  - __meta_kubernetes_pod_label_app
- action: drop
  regex: '[0-9a-z-.]+-[0-9a-f]{8,10}'
  source_labels:
  - __meta_kubernetes_pod_controller_name
- source_labels:
  - __meta_kubernetes_pod_controller_name
    target_label: service
- source_labels:
  - __meta_kubernetes_pod_node_name
    target_label: host
- action: drop
  regex: ''
  source_labels:
  - service
- action: labelmap
  regex: _meta_kubernetes_pod_label(.+)
- action: replace
  replacement: $1
  separator: /
  source_labels:
  - __meta_kubernetes_namespace
  - service
    target_label: job
- action: replace
  source_labels:
  - __meta_kubernetes_namespace
    target_label: namespace
- action: replace
  source_labels:
  - __meta_kubernetes_pod_name
    target_label: pod
- action: replace
  source_labels:
  - __meta_kubernetes_pod_container_name
    target_label: container
- replacement: /var/log/pods/$1/.log
  separator: /
  source_labels:
  - __meta_kubernetes_pod_uid
  - __meta_kubernetes_pod_container_name
    target_label: path
job_name: kubernetes-pods-indirect-controller
kubernetes_sd_configs:
- role: pod
  relabel_configs:
- action: drop
  regex: .+
  separator: ''
  source_labels:
  - __meta_kubernetes_pod_label_name
  - __meta_kubernetes_pod_label_app
- action: keep
  regex: '[0-9a-z-.]+-[0-9a-f]{8,10}'
  source_labels:
  - __meta_kubernetes_pod_controller_name
- action: replace
  regex: '([0-9a-z-.]+)-[0-9a-f]{8,10}'
  source_labels:
  - __meta_kubernetes_pod_controller_name
    target_label: service
- source_labels:
  - __meta_kubernetes_pod_node_name
    target_label: host
- action: drop
  regex: ''
  source_labels:
  - service
- action: labelmap
  regex: _meta_kubernetes_pod_label(.+)
- action: replace
  replacement: $1
  separator: /
  source_labels:
  - __meta_kubernetes_namespace
  - service
    target_label: job
- action: replace
  source_labels:
  - __meta_kubernetes_namespace
    target_label: namespace
- action: replace
  source_labels:
  - __meta_kubernetes_pod_name
    target_label: pod
- action: replace
  source_labels:
  - __meta_kubernetes_pod_container_name
    target_label: container
- replacement: /var/log/pods/$1/.log
  separator: /
  source_labels:
  - __meta_kubernetes_pod_uid
  - __meta_kubernetes_pod_container_name
    target_label: path
job_name: kubernetes-pods-static
kubernetes_sd_configs:
- role: pod
  relabel_configs:
- action: drop
  regex: ''
  source_labels:
  - __meta_kubernetes_pod_annotation_kubernetes_io_config_mirror
- action: replace
  source_labels:
  - __meta_kubernetes_pod_label_component
    target_label: service
- source_labels:
  - __meta_kubernetes_pod_node_name
    target_label: host
- action: drop
  regex: ''
  source_labels:
  - service
- action: labelmap
  regex: _meta_kubernetes_pod_label(.+)
- action: replace
  replacement: $1
  separator: /
  source_labels:
  - __meta_kubernetes_namespace
  - service
    target_label: job
- action: replace
  source_labels:
  - __meta_kubernetes_namespace
    target_label: namespace
- action: replace
  source_labels:
  - __meta_kubernetes_pod_name
    target_label: pod
- action: replace
  source_labels:
  - __meta_kubernetes_pod_container_name
    target_label: container
- replacement: /var/log/pods/$1/.log
  separator: /
  source_labels:
  - __meta_kubernetes_pod_annotation_kubernetes_io_config_mirror
  - __meta_kubernetes_pod_container_name
    target_label: path
    `

Whyeasy · 2022-01-04T10:22:39Z

It feels like @Whyeasy and @liguozhong have different usage for this ?
@Whyeasy wanted to drop logs after a certain rate, @liguozhong you seem to use it to avoid spike of CPU but you keep the logs trading lag for CPU.

Logs are very important and do not need to be discarded. Local files can be cached for a long time. This PR code can work well in our monitoring

@liguozhong I understand logs are important and we can indeed buffer a lot. But what if there is a service that keeps on spamming and the buffer only grows? At some point we would like to drop those logs for that particular service. That's the reason it would be nice to do it based on tags. This way we can say we drop logs for everything non-prod if it spams to much, but we buffer it for prod.

liguozhong · 2022-01-04T11:33:39Z

It feels like @Whyeasy and @liguozhong have different usage for this ?
@Whyeasy wanted to drop logs after a certain rate, @liguozhong you seem to use it to avoid spike of CPU but you keep the logs trading lag for CPU.

Logs are very important and do not need to be discarded. Local files can be cached for a long time. This PR code can work well in our monitoring

@liguozhong I understand logs are important and we can indeed buffer a lot. But what if there is a service that keeps on spamming and the buffer only grows? At some point we would like to drop those logs for that particular service. That's the reason it would be nice to do it based on tags. This way we can say we drop logs for everything non-prod if it spams to much, but we buffer it for prod.

Your usage is probably to add a stage pipeline operator like this, and I will commit another PR to handle your usage.
but global rate limite is more important feature.

cyriltovena · 2022-01-04T12:31:41Z

That make sense to have 2 different feature set.

Still I'm not understanding what problem the global limit is solving for you.

liguozhong · 2022-01-04T12:57:03Z

That make sense to have 2 different feature set.

Still I'm not understanding what problem the global limit is solving for you.

Because of the limitation of the cpu limit of the pod in k8s.

1: When there is no rate limit, the cpu usage of the pod will exceed the cpu limit, causing the pod to enter a frozen state.

2: When with rate limit, the cpu usage of the pod will not exceed the cpu limit of the pod, so that the promtail pod will not enter the docker freeze state and work stably and continuously, so that there will be no data loss.

If there is no global rate limit function, our online environment will continue to receive the following pod cpu alarms, and the promtail pod will continue to be in an unstable state。

- alert: CPUThrottlingHigh      
  annotations:        
    description: '{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}.'        
    runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/cputhrottlinghigh        
    summary: Processes experience elevated CPU throttling.      
    expr: | 
      sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 75 / 100 )
    for: 15m      
    labels:        
      severity: info

cyriltovena · 2022-01-04T13:55:21Z

Right ok so it's for throttling the CPU to avoid hitting the CPU limit.

cyriltovena · 2022-01-04T13:56:29Z

clients/pkg/promtail/limit/config.go

+type Config struct {
+	ReadlineRate        float64 `yaml:"readline_rate" json:"readline_rate"`
+	ReadlineBurst       int     `yaml:"readline_burst" json:"readline_burst"`
+	ReadlineRateEnabled bool    `yaml:"readline_rate_enabled,omitempty"  json:"readline_rate_enabled"`


I think we should add a config to decided if we want to drop or not the log line. This makes it usable in multiple use case.

A bit like the stage one but global.

done,thanks .

cyriltovena · 2022-01-05T07:20:02Z

clients/pkg/promtail/limit/config.go

+	f.Float64Var(&cfg.ReadlineRate, prefix+"limit.readline-rate", 10000, "promtail readline Rate.")
+	f.IntVar(&cfg.ReadlineBurst, prefix+"limit.readline-burst", 10000, "promtail readline Burst.")
+	f.BoolVar(&cfg.ReadlineRateEnabled, prefix+"limit.readline-rate-enabled", true, "Set to false to disable readline rate limit.")
+	f.BoolVar(&cfg.ReadlineRateAsyn, prefix+"limit.readline-rate-asyn", false, "Set to true to drop log when rate limit.")


ReadlineRateDrop ? What does Asyn stand for here ?

done, "Drop" is better than "Asyn" .

clients/pkg/promtail/limit/config.go

Co-authored-by: Cyril Tovena <cyril.tovena@gmail.com>

liguozhong · 2022-01-06T08:41:12Z

It feels like @Whyeasy and @liguozhong have different usage for this ?
@Whyeasy wanted to drop logs after a certain rate, @liguozhong you seem to use it to avoid spike of CPU but you keep the logs trading lag for CPU.

Logs are very important and do not need to be discarded. Local files can be cached for a long time. This PR code can work well in our monitoring

@liguozhong I understand logs are important and we can indeed buffer a lot. But what if there is a service that keeps on spamming and the buffer only grows? At some point we would like to drop those logs for that particular service. That's the reason it would be nice to do it based on tags. This way we can say we drop logs for everything non-prod if it spams to much, but we buffer it for prod.

this PR，#5051

cyriltovena

LGTM

dannykopping

Great contribution, thanks @liguozhong 🎉

tablatronix · 2022-09-26T19:53:40Z

FYI for anyone finding this, the config name changed

#5707

This was driving me nuts

liguozhong · 2022-09-28T06:08:10Z

thanks

[new] promtail: add readline rate limit

b5df6d2

liguozhong requested a review from a team as a code owner January 4, 2022 09:22

pull-request-size bot added the size/M label Jan 4, 2022

[new] promtail: add readline rate limit grafana#5031

2a194df

[new] promtail: add readline rate limit grafana#5031

1495c6c

cyriltovena reviewed Jan 4, 2022

View reviewed changes

[new] promtail: add readline rate limit grafana#5031

ab24854

cyriltovena reviewed Jan 5, 2022

View reviewed changes

[new] promtail: add readline rate limit grafana#5031

3843b16

cyriltovena reviewed Jan 5, 2022

View reviewed changes

clients/pkg/promtail/limit/config.go Outdated Show resolved Hide resolved

cyriltovena reviewed Jan 5, 2022

View reviewed changes

clients/pkg/promtail/limit/config.go Outdated Show resolved Hide resolved

cyriltovena mentioned this pull request Jan 5, 2022

Promtail Rate Limit stage #5048

Open

liguozhong and others added 2 commits January 5, 2022 23:26

Update clients/pkg/promtail/limit/config.go

441e0a4

Co-authored-by: Cyril Tovena <cyril.tovena@gmail.com>

Update clients/pkg/promtail/limit/config.go

73cd3be

Co-authored-by: Cyril Tovena <cyril.tovena@gmail.com>

cyriltovena approved these changes Jan 6, 2022

View reviewed changes

owen-d approved these changes Jan 6, 2022

View reviewed changes

dannykopping approved these changes Jan 6, 2022

View reviewed changes

dannykopping merged commit 6b377b4 into grafana:main Jan 6, 2022

liguozhong mentioned this pull request Feb 10, 2022

promtail causing very high cpu load when running and stops #5350

Closed

franzwong mentioned this pull request Mar 23, 2022

Make configuration name consistent #5703

Closed

KMiller-Grafana mentioned this pull request Mar 30, 2022

Document global rate limit #5730

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[new] promtail: add readline rate limit #5031

[new] promtail: add readline rate limit #5031

liguozhong commented Jan 4, 2022 •

edited

Loading

liguozhong commented Jan 4, 2022

liguozhong commented Jan 4, 2022

Whyeasy commented Jan 4, 2022

liguozhong commented Jan 4, 2022

cyriltovena commented Jan 4, 2022

cyriltovena commented Jan 4, 2022 •

edited

Loading

cyriltovena commented Jan 4, 2022

liguozhong commented Jan 4, 2022

liguozhong commented Jan 4, 2022

liguozhong commented Jan 4, 2022

liguozhong commented Jan 4, 2022

Whyeasy commented Jan 4, 2022

liguozhong commented Jan 4, 2022 •

edited

Loading

cyriltovena commented Jan 4, 2022

liguozhong commented Jan 4, 2022 •

edited

Loading

cyriltovena commented Jan 4, 2022

cyriltovena Jan 4, 2022

liguozhong Jan 4, 2022

cyriltovena Jan 5, 2022

liguozhong Jan 5, 2022

liguozhong commented Jan 6, 2022

cyriltovena left a comment

dannykopping left a comment

tablatronix commented Sep 26, 2022

liguozhong commented Sep 28, 2022

[new] promtail: add readline rate limit #5031

[new] promtail: add readline rate limit #5031

Conversation

liguozhong commented Jan 4, 2022 • edited Loading

liguozhong commented Jan 4, 2022

liguozhong commented Jan 4, 2022

Whyeasy commented Jan 4, 2022

liguozhong commented Jan 4, 2022

cyriltovena commented Jan 4, 2022

cyriltovena commented Jan 4, 2022 • edited Loading

cyriltovena commented Jan 4, 2022

liguozhong commented Jan 4, 2022

liguozhong commented Jan 4, 2022

liguozhong commented Jan 4, 2022

liguozhong commented Jan 4, 2022

Whyeasy commented Jan 4, 2022

liguozhong commented Jan 4, 2022 • edited Loading

cyriltovena commented Jan 4, 2022

liguozhong commented Jan 4, 2022 • edited Loading

cyriltovena commented Jan 4, 2022

cyriltovena Jan 4, 2022

Choose a reason for hiding this comment

liguozhong Jan 4, 2022

Choose a reason for hiding this comment

cyriltovena Jan 5, 2022

Choose a reason for hiding this comment

liguozhong Jan 5, 2022

Choose a reason for hiding this comment

liguozhong commented Jan 6, 2022

cyriltovena left a comment

Choose a reason for hiding this comment

dannykopping left a comment

Choose a reason for hiding this comment

tablatronix commented Sep 26, 2022

liguozhong commented Sep 28, 2022

liguozhong commented Jan 4, 2022 •

edited

Loading

cyriltovena commented Jan 4, 2022 •

edited

Loading

liguozhong commented Jan 4, 2022 •

edited

Loading

liguozhong commented Jan 4, 2022 •

edited

Loading