Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace can not be fetched after flush to storage #408

Closed
zhaoyao opened this issue Dec 13, 2020 · 3 comments
Closed

Trace can not be fetched after flush to storage #408

zhaoyao opened this issue Dec 13, 2020 · 3 comments
Labels
stale Used for stale issues / PRs

Comments

@zhaoyao
Copy link

zhaoyao commented Dec 13, 2020

Describe the bug
The trace can be correctly queried in the first few minutes after it is generated. After about 5 minutes, tempo-query will display 404 Not Found.

After checking the trace of the tempo read path, we can also find that when the trace can be queried from the ingestor, the query can be successful. When entering the storage read path, query trace will return 404.

The trace valid duration is the same as ingester.complete_block_timeout.

Can you give some follow-up troubleshooting suggestions?

To Reproduce
Steps to reproduce the behavior:

  1. Start Tempo (tempo, version e9892bd (branch: master, revision: e9892bd))
  2. Perform Operations (Read/Write/Others)

Expected behavior
Trace can be fetched after ingestor flush blocks to storage.

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: manually

Additional Context
I use ceph as s3 storage backend.

I deployed tempo using microservice mode:

  • tempo-distributor: unstateful deployment
  • tempo-ingestor: statefulset
  • tempo-querier: statefulset
  • tempo-query: unstateful deployment

Full tempo.yml shared between all components.

auth_enabled: false

server:
  http_listen_port: 3100
  log_level: debug

distributor:
  receivers:                           # this configuration will listen on all ports and protocols that tempo is capable of.
    jaeger:                            # the receives all come from the OpenTelemetry collector.  more configuration information can
      protocols:                       # be found there: https://github.com/open-telemetry/opentelemetry-collector/tree/master/receiver
        thrift_http:                   #
        grpc:                          # for a production deployment you should only enable the receivers you need!
        thrift_binary:
        thrift_compact:
    zipkin:
    otlp:
      protocols:
        http:
        grpc:
    opencensus:

ingester:
  trace_idle_period: 10s               # the length of time after a trace has not received spans to consider it complete and flush it
  traces_per_block: 100                # cut the head block when it his this number of traces or ...
  max_block_duration: 5m               #   this much time passes
  lifecycler:
    ring:
      kvstore:
        store: etcd
        etcd:
          endpoints: 
            - http://****:2379


compactor:
  compaction:
    compaction_window: 1h              # blocks in this time window will be compacted together
    max_compaction_objects: 1000000    # maximum size of compacted blocks
    block_retention: 1h
    compacted_block_retention: 10m

storage:
  trace:
    backend: s3                     # backend configuration to use
    wal:
      path: /data/tempo/wal             # where to store the the wal locally
      bloom_filter_false_positive: .05 # bloom filter false positive rate.  lower values create larger filters but fewer false positives
      index_downsample: 10             # number of traces per index record
    s3:
      bucket: ***
      endpoint: ***
      access_key: ****
      secret_key: ****
      insecure: true
    pool:
      max_workers: 100                 # the worker pool mainly drives querying, but is also used for polling the blocklist
      queue_depth: 10000
@zhaoyao
Copy link
Author

zhaoyao commented Dec 13, 2020

Attach two tempo trace.

  • trace-ok.json: trace can be fetch through ingestor.
  • trace-not-found.json: trace missing after flush and cleared by ingestor.

tempo-trace.zip

@joe-elliott
Copy link
Member

Try setting your complete_block_timeout to be 10m. This way we should be able to guarantee that the querier is aware of a block by the time it is flushed from the ingester.

Currently the poll cycle and complete block timeout both default to 5m which should probably be changed.

If that doesn't work can you share the querier logs at the time you are executing the query? It is possible that the list operation is behaving differently then expected in ceph which is causing the querier to be unaware of the backend blocks.

Please also check the tempodb_blocklist_length metric as exposed by the querier and make sure that it matches the number of blocks in your backend.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2022

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

@github-actions github-actions bot added the stale Used for stale issues / PRs label Dec 5, 2022
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Used for stale issues / PRs
Projects
None yet
Development

No branches or pull requests

2 participants