Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve InvalidSequenceTokenException Being Logged Frequently #227

Open
ingshtrom opened this issue Feb 25, 2021 · 0 comments
Open

Improve InvalidSequenceTokenException Being Logged Frequently #227

ingshtrom opened this issue Feb 25, 2021 · 0 comments

Comments

@ingshtrom
Copy link

Problem

I know there are a lot of other tickets talking about this, but they either accept the log messages/metrics, or are very old.

We are getting a lot of InvalidSequenceTokenException errors in our logs and then our Prometheus metrics are also filled with these errors. I believe they are recovering fine with the retries that have been put in place, but I'm curious if there is a better way to work around this?

We use flush_threads: 4 in our buffer configuration and then run two replicas of this FluentD configuration. It seems that the only way to fix this is to run a single replica with a single thread, but then that wouldn't be redundant 😞

Lastly, I am curious what the difference between the concurrency parameter and the Buffer's parameter flush_threads is. They sound like they are doing the same thing, but maybe I should be using one over the other?

Steps to replicate


# that is a Kubernetes Pod's logs
# These are aggregated by a fluent-bit running on each
# Kubernetes node, and then forwarded to central processing, 
# which includes this configuration snippet

# NOTE: I have excluded prometheus and other non-essential pieces of the config

<source>
  @type forward
  port 24284
  bind 0.0.0.0
  tag pod.source
  @label @POD_SOURCE
</source>

<label @POD_SOURCE>
  <filter **>
    @type record_transformer
    enable_ruby true
    <record>
      namespace ${record["kubernetes"]["namespace_name"]}
      pod ${record["kubernetes"]["pod_name"]}
    </record>
  </filter>
  <match **>
    @type rewrite_tag_filter
    <rule>
      key     namespace
      pattern /(.+)/
      tag     $1
    </rule>
    @label @POD_STEP2
  </match>
</label>

<label @POD_STEP2>
  <match **>
    @type rewrite_tag_filter
    <rule>
      key     pod
      pattern /(.+)/
      tag     ${tag}_$1
    </rule>
    @label @POD_OUTPUT
  </match>
</label>

<label @POD_OUTPUT>
  <match **>
    @type copy
    <store>
      @type s3
      s3_bucket foobar
      s3_region us-east-1
      s3_object_key_format "#{ENV['ENVIRONMENT']}/eks-pod-logs/%Y-%m-%d/${tag}/%H_%{index}_%{uuid_flush}.%{file_extension}"
      <format>
        @type json
      </format>
      <buffer tag,time>
        timekey 1h
        @type memory
        flush_mode interval
        retry_type exponential_backoff
        flush_thread_count 4
        flush_interval 5s
        retry_forever false
        retry_max_interval 30
        chunk_limit_size 8MB
        chunk_full_threshold 0.90
        overflow_action throw_exception
        compress gzip
      </buffer>
    </store>
    <store>
      @type cloudwatch_logs
      log_group_name /infra/logs/eks/pods/stage
      log_stream_name %Y-%m-%d-%H-${tag}
      auto_create_stream true
      region us-east-1
      <buffer tag, time>
        timekey 1m
        @type memory
        flush_mode interval
        retry_type exponential_backoff
        flush_thread_count 4
        flush_interval 5s
        retry_forever false
        retry_max_interval 30
        chunk_limit_size 8MB
        chunk_full_threshold 0.90
        overflow_action throw_exception
        compress gzip
      </buffer>
    </store>
  </match>
</label>

Expected Behavior or What you need to ask

The s3 logging works just fine, but cloudwatch has a lot of errors about the token being out of sequence. This might come down to how AWS implements their services, but I feel like there must be a better way than retrying a bunch and having logs and metrics be filled up with errors, but maybe that would require a lot more work than it is worth? 🤷 Any suggestions are welcome, too.

Using Fluentd and CloudWatchLogs plugin versions

  • OS version: Docker image fluentd:v1.9.1-1.0
  • Bare Metal or within Docker or Kubernetes or others? within Kubernetes in AWS EKS
  • Fluentd v0.12 or v0.14/v1.0
    • paste result of fluentd --version or td-agent --version => fluentd 1.9.1
  • Dependent gem versions
    • paste boot log of fluentd or td-agent
    • paste result of fluent-gem list, td-agent-gem list or your Gemfile.lock
/ $ fluent-gem list

*** LOCAL GEMS ***

async (1.25.0)
async-http (0.50.0)
async-io (1.27.7)
async-pool (0.3.0)
aws-eventstream (1.1.0)
aws-partitions (1.366.0)
aws-sdk-cloudwatchlogs (1.36.0)
aws-sdk-core (3.105.0)
aws-sdk-kms (1.37.0)
aws-sdk-s3 (1.79.1)
aws-sdk-sqs (1.32.0)
aws-sigv4 (1.2.2)
bigdecimal (1.4.4)
cmath (default: 1.0.0)
concurrent-ruby (1.1.6)
console (1.8.2)
cool.io (1.6.0)
csv (default: 1.0.0)
date (default: 1.0.0)
etc (default: 1.0.0)
ext_monitor (0.1.2)
fcntl (default: 1.0.0)
fileutils (default: 1.0.2)
fluent-config-regexp-type (1.0.0)
fluent-plugin-cloudwatch-logs (0.10.2)
fluent-plugin-prometheus (1.8.3)
fluent-plugin-rewrite-tag-filter (2.3.0)
fluent-plugin-s3 (1.4.0)
fluentd (1.9.1)
http_parser.rb (0.6.0)
ipaddr (default: 1.2.0)
jmespath (1.4.0)
json (2.3.0)
msgpack (1.3.3)
nio4r (2.5.2)
oj (3.8.1)
openssl (default: 2.1.2)
prometheus-client (0.9.0)
protocol-hpack (1.4.2)
protocol-http (0.13.1)
protocol-http1 (0.10.2)
protocol-http2 (0.10.4)
psych (default: 3.0.2)
quantile (0.2.1)
scanf (default: 1.0.0)
serverengine (2.2.1)
sigdump (0.2.4)
stringio (default: 0.0.1)
strptime (0.2.3)
strscan (default: 1.0.0)
timers (4.3.0)
tzinfo (2.0.2)
tzinfo-data (1.2019.3)
webrick (default: 1.4.2)
yajl-ruby (1.4.1)
zlib (default: 1.0.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant