Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[new] promtail: add readline rate limit #5031

Merged
merged 7 commits into from
Jan 6, 2022

Conversation

liguozhong
Copy link
Contributor

@liguozhong liguozhong commented Jan 4, 2022

refer issue : #2664
it is important feature,。
In our production environment, the cpu can be reduced from 3.5 core to 0.8 core, and business monitoring begins to stabilize。

image
image
image
image

@liguozhong liguozhong requested a review from a team as a code owner January 4, 2022 09:22
@liguozhong
Copy link
Contributor Author

@Whyeasy

@liguozhong
Copy link
Contributor Author

@cyriltovena review PR please ,this is important feature.

@Whyeasy
Copy link
Contributor

Whyeasy commented Jan 4, 2022

Thanks @liguozhong. Will it also be possible to limit based on tags? So we only limit one service and not all services within a promtail instance?

@liguozhong
Copy link
Contributor Author

Thanks @liguozhong. Will it also be possible to limit based on tags? So we only limit one service and not all services within a promtail instance?

done.

@cyriltovena
Copy link
Contributor

It feels like @Whyeasy and @liguozhong have different usage for this ?

@Whyeasy wanted to drop logs after a certain rate, @liguozhong you seem to use it to avoid spike of CPU but you keep the logs trading lag for CPU.

@cyriltovena
Copy link
Contributor

cyriltovena commented Jan 4, 2022

Thanks @liguozhong. Will it also be possible to limit based on tags? So we only limit one service and not all services within a promtail instance?

done.

I think @Whyeasy wants a rate limiter stage not a global stage.

@cyriltovena
Copy link
Contributor

@liguozhong can you expand a bit more on your usage please ?

@liguozhong
Copy link
Contributor Author

It feels like @Whyeasy and @liguozhong have different usage for this ?

@Whyeasy wanted to drop logs after a certain rate, @liguozhong you seem to use it to avoid spike of CPU but you keep the logs trading lag for CPU.

Logs are very important and do not need to be discarded. Local files can be cached for a long time. This PR code can work well in our monitoring

@liguozhong
Copy link
Contributor Author

@liguozhong can you expand a bit more on your usage please ?

We collect the envoy access log of the istio component in k8s to draw the golden metrics of service access traffic.
And Deploy promtail in k8s through daemonset, and later found that the monitoring data is steep。

@liguozhong
Copy link
Contributor Author

image
After investigation, we found that the vector agent has limited flow capability,
so we tried to add the flow limit to promtail to solve the problem of inaccurate monitoring.

@liguozhong
Copy link
Contributor Author

Because the default k8s promatil config file has 4 jobs, it is difficult for us to configure different current limiting rules for each job. But our business can accurately know the rate of access log generation per second. So global rate limiter is more suitable for promtail in k8s daemonset
`server:
http_listen_port: 9080
grpc_listen_port: 0
log_level: debug
positions:
filename: /run/promtail/positions.yaml
clients:

  • url: http://1.1.1.1:3100/loki/api/v1/push
    scrape_configs:
  • job_name: kubernetes-pods-name
    kubernetes_sd_configs:
    • role: pod
      relabel_configs:
    • source_labels:
      • __meta_kubernetes_pod_label_name
        target_label: service
    • source_labels:
      • __meta_kubernetes_pod_node_name
        target_label: host
    • action: drop
      regex: ''
      source_labels:
      • service
    • action: labelmap
      regex: _meta_kubernetes_pod_label(.+)
    • action: replace
      replacement: $1
      separator: /
      source_labels:
      • __meta_kubernetes_namespace
      • service
        target_label: job
    • action: replace
      source_labels:
      • __meta_kubernetes_namespace
        target_label: namespace
    • action: replace
      source_labels:
      • __meta_kubernetes_pod_name
        target_label: pod
    • action: replace
      source_labels:
      • __meta_kubernetes_pod_container_name
        target_label: container
    • replacement: /var/log/pods/$1/.log
      separator: /
      source_labels:
      • __meta_kubernetes_pod_uid
      • __meta_kubernetes_pod_container_name
        target_label: path
  • job_name: kubernetes-pods-app
    kubernetes_sd_configs:
    • role: pod
      relabel_configs:
    • action: drop
      regex: .+
      source_labels:
      • __meta_kubernetes_pod_label_name
    • source_labels:
      • __meta_kubernetes_pod_label_app
        target_label: service
    • source_labels:
      • __meta_kubernetes_pod_node_name
        target_label: host
    • action: drop
      regex: ''
      source_labels:
      • service
    • action: labelmap
      regex: _meta_kubernetes_pod_label(.+)
    • action: replace
      replacement: $1
      separator: /
      source_labels:
      • __meta_kubernetes_namespace
      • service
        target_label: job
    • action: replace
      source_labels:
      • __meta_kubernetes_namespace
        target_label: namespace
    • action: replace
      source_labels:
      • __meta_kubernetes_pod_name
        target_label: pod
    • action: replace
      source_labels:
      • __meta_kubernetes_pod_container_name
        target_label: container
    • replacement: /var/log/pods/$1/.log
      separator: /
      source_labels:
      • __meta_kubernetes_pod_uid
      • __meta_kubernetes_pod_container_name
        target_label: path
  • job_name: kubernetes-pods-direct-controllers
    kubernetes_sd_configs:
    • role: pod
      relabel_configs:
    • action: drop
      regex: .+
      separator: ''
      source_labels:
      • __meta_kubernetes_pod_label_name
      • __meta_kubernetes_pod_label_app
    • action: drop
      regex: '[0-9a-z-.]+-[0-9a-f]{8,10}'
      source_labels:
      • __meta_kubernetes_pod_controller_name
    • source_labels:
      • __meta_kubernetes_pod_controller_name
        target_label: service
    • source_labels:
      • __meta_kubernetes_pod_node_name
        target_label: host
    • action: drop
      regex: ''
      source_labels:
      • service
    • action: labelmap
      regex: _meta_kubernetes_pod_label(.+)
    • action: replace
      replacement: $1
      separator: /
      source_labels:
      • __meta_kubernetes_namespace
      • service
        target_label: job
    • action: replace
      source_labels:
      • __meta_kubernetes_namespace
        target_label: namespace
    • action: replace
      source_labels:
      • __meta_kubernetes_pod_name
        target_label: pod
    • action: replace
      source_labels:
      • __meta_kubernetes_pod_container_name
        target_label: container
    • replacement: /var/log/pods/$1/.log
      separator: /
      source_labels:
      • __meta_kubernetes_pod_uid
      • __meta_kubernetes_pod_container_name
        target_label: path
  • job_name: kubernetes-pods-indirect-controller
    kubernetes_sd_configs:
    • role: pod
      relabel_configs:
    • action: drop
      regex: .+
      separator: ''
      source_labels:
      • __meta_kubernetes_pod_label_name
      • __meta_kubernetes_pod_label_app
    • action: keep
      regex: '[0-9a-z-.]+-[0-9a-f]{8,10}'
      source_labels:
      • __meta_kubernetes_pod_controller_name
    • action: replace
      regex: '([0-9a-z-.]+)-[0-9a-f]{8,10}'
      source_labels:
      • __meta_kubernetes_pod_controller_name
        target_label: service
    • source_labels:
      • __meta_kubernetes_pod_node_name
        target_label: host
    • action: drop
      regex: ''
      source_labels:
      • service
    • action: labelmap
      regex: _meta_kubernetes_pod_label(.+)
    • action: replace
      replacement: $1
      separator: /
      source_labels:
      • __meta_kubernetes_namespace
      • service
        target_label: job
    • action: replace
      source_labels:
      • __meta_kubernetes_namespace
        target_label: namespace
    • action: replace
      source_labels:
      • __meta_kubernetes_pod_name
        target_label: pod
    • action: replace
      source_labels:
      • __meta_kubernetes_pod_container_name
        target_label: container
    • replacement: /var/log/pods/$1/.log
      separator: /
      source_labels:
      • __meta_kubernetes_pod_uid
      • __meta_kubernetes_pod_container_name
        target_label: path
  • job_name: kubernetes-pods-static
    kubernetes_sd_configs:
    • role: pod
      relabel_configs:
    • action: drop
      regex: ''
      source_labels:
      • __meta_kubernetes_pod_annotation_kubernetes_io_config_mirror
    • action: replace
      source_labels:
      • __meta_kubernetes_pod_label_component
        target_label: service
    • source_labels:
      • __meta_kubernetes_pod_node_name
        target_label: host
    • action: drop
      regex: ''
      source_labels:
      • service
    • action: labelmap
      regex: _meta_kubernetes_pod_label(.+)
    • action: replace
      replacement: $1
      separator: /
      source_labels:
      • __meta_kubernetes_namespace
      • service
        target_label: job
    • action: replace
      source_labels:
      • __meta_kubernetes_namespace
        target_label: namespace
    • action: replace
      source_labels:
      • __meta_kubernetes_pod_name
        target_label: pod
    • action: replace
      source_labels:
      • __meta_kubernetes_pod_container_name
        target_label: container
    • replacement: /var/log/pods/$1/.log
      separator: /
      source_labels:
      • __meta_kubernetes_pod_annotation_kubernetes_io_config_mirror
      • __meta_kubernetes_pod_container_name
        target_label: path
        `

@Whyeasy
Copy link
Contributor

Whyeasy commented Jan 4, 2022

It feels like @Whyeasy and @liguozhong have different usage for this ?
@Whyeasy wanted to drop logs after a certain rate, @liguozhong you seem to use it to avoid spike of CPU but you keep the logs trading lag for CPU.

Logs are very important and do not need to be discarded. Local files can be cached for a long time. This PR code can work well in our monitoring

@liguozhong I understand logs are important and we can indeed buffer a lot. But what if there is a service that keeps on spamming and the buffer only grows? At some point we would like to drop those logs for that particular service. That's the reason it would be nice to do it based on tags. This way we can say we drop logs for everything non-prod if it spams to much, but we buffer it for prod.

@liguozhong
Copy link
Contributor Author

liguozhong commented Jan 4, 2022

It feels like @Whyeasy and @liguozhong have different usage for this ?
@Whyeasy wanted to drop logs after a certain rate, @liguozhong you seem to use it to avoid spike of CPU but you keep the logs trading lag for CPU.

Logs are very important and do not need to be discarded. Local files can be cached for a long time. This PR code can work well in our monitoring

@liguozhong I understand logs are important and we can indeed buffer a lot. But what if there is a service that keeps on spamming and the buffer only grows? At some point we would like to drop those logs for that particular service. That's the reason it would be nice to do it based on tags. This way we can say we drop logs for everything non-prod if it spams to much, but we buffer it for prod.

Your usage is probably to add a stage pipeline operator like this, and I will commit another PR to handle your usage.
but global rate limite is more important feature.

image

@cyriltovena
Copy link
Contributor

That make sense to have 2 different feature set.

Still I'm not understanding what problem the global limit is solving for you.

@liguozhong
Copy link
Contributor Author

liguozhong commented Jan 4, 2022

That make sense to have 2 different feature set.

Still I'm not understanding what problem the global limit is solving for you.

Because of the limitation of the cpu limit of the pod in k8s.

1: When there is no rate limit, the cpu usage of the pod will exceed the cpu limit, causing the pod to enter a frozen state.

2: When with rate limit, the cpu usage of the pod will not exceed the cpu limit of the pod, so that the promtail pod will not enter the docker freeze state and work stably and continuously, so that there will be no data loss.

If there is no global rate limit function, our online environment will continue to receive the following pod cpu alarms, and the promtail pod will continue to be in an unstable state。

- alert: CPUThrottlingHigh      
  annotations:        
    description: '{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}.'        
    runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/cputhrottlinghigh        
    summary: Processes experience elevated CPU throttling.      
    expr: | 
      sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 75 / 100 )
    for: 15m      
    labels:        
      severity: info

@cyriltovena
Copy link
Contributor

Right ok so it's for throttling the CPU to avoid hitting the CPU limit.

type Config struct {
ReadlineRate float64 `yaml:"readline_rate" json:"readline_rate"`
ReadlineBurst int `yaml:"readline_burst" json:"readline_burst"`
ReadlineRateEnabled bool `yaml:"readline_rate_enabled,omitempty" json:"readline_rate_enabled"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a config to decided if we want to drop or not the log line. This makes it usable in multiple use case.

A bit like the stage one but global.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done,thanks .
image
image

f.Float64Var(&cfg.ReadlineRate, prefix+"limit.readline-rate", 10000, "promtail readline Rate.")
f.IntVar(&cfg.ReadlineBurst, prefix+"limit.readline-burst", 10000, "promtail readline Burst.")
f.BoolVar(&cfg.ReadlineRateEnabled, prefix+"limit.readline-rate-enabled", true, "Set to false to disable readline rate limit.")
f.BoolVar(&cfg.ReadlineRateAsyn, prefix+"limit.readline-rate-asyn", false, "Set to true to drop log when rate limit.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadlineRateDrop ? What does Asyn stand for here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, "Drop" is better than "Asyn" .

liguozhong and others added 2 commits January 5, 2022 23:26
Co-authored-by: Cyril Tovena <cyril.tovena@gmail.com>
Co-authored-by: Cyril Tovena <cyril.tovena@gmail.com>
@liguozhong
Copy link
Contributor Author

It feels like @Whyeasy and @liguozhong have different usage for this ?
@Whyeasy wanted to drop logs after a certain rate, @liguozhong you seem to use it to avoid spike of CPU but you keep the logs trading lag for CPU.

Logs are very important and do not need to be discarded. Local files can be cached for a long time. This PR code can work well in our monitoring

@liguozhong I understand logs are important and we can indeed buffer a lot. But what if there is a service that keeps on spamming and the buffer only grows? At some point we would like to drop those logs for that particular service. That's the reason it would be nice to do it based on tags. This way we can say we drop logs for everything non-prod if it spams to much, but we buffer it for prod.

this PR,#5051

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great contribution, thanks @liguozhong 🎉

@tablatronix
Copy link

FYI for anyone finding this, the config name changed

#5707

This was driving me nuts

@liguozhong
Copy link
Contributor Author

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants