Trace can not be fetched after flush to storage #408

zhaoyao · 2020-12-13T15:39:39Z

Describe the bug
The trace can be correctly queried in the first few minutes after it is generated. After about 5 minutes, tempo-query will display 404 Not Found.

After checking the trace of the tempo read path, we can also find that when the trace can be queried from the ingestor, the query can be successful. When entering the storage read path, query trace will return 404.

The trace valid duration is the same as ingester.complete_block_timeout.

Can you give some follow-up troubleshooting suggestions?

To Reproduce
Steps to reproduce the behavior:

Start Tempo (tempo, version e9892bd (branch: master, revision: e9892bd))
Perform Operations (Read/Write/Others)

Expected behavior
Trace can be fetched after ingestor flush blocks to storage.

Environment:

Infrastructure: Kubernetes
Deployment tool: manually

Additional Context
I use ceph as s3 storage backend.

I deployed tempo using microservice mode:

tempo-distributor: unstateful deployment
tempo-ingestor: statefulset
tempo-querier: statefulset
tempo-query: unstateful deployment

Full tempo.yml shared between all components.

auth_enabled: false

server:
  http_listen_port: 3100
  log_level: debug

distributor:
  receivers:                           # this configuration will listen on all ports and protocols that tempo is capable of.
    jaeger:                            # the receives all come from the OpenTelemetry collector.  more configuration information can
      protocols:                       # be found there: https://github.com/open-telemetry/opentelemetry-collector/tree/master/receiver
        thrift_http:                   #
        grpc:                          # for a production deployment you should only enable the receivers you need!
        thrift_binary:
        thrift_compact:
    zipkin:
    otlp:
      protocols:
        http:
        grpc:
    opencensus:

ingester:
  trace_idle_period: 10s               # the length of time after a trace has not received spans to consider it complete and flush it
  traces_per_block: 100                # cut the head block when it his this number of traces or ...
  max_block_duration: 5m               #   this much time passes
  lifecycler:
    ring:
      kvstore:
        store: etcd
        etcd:
          endpoints: 
            - http://****:2379


compactor:
  compaction:
    compaction_window: 1h              # blocks in this time window will be compacted together
    max_compaction_objects: 1000000    # maximum size of compacted blocks
    block_retention: 1h
    compacted_block_retention: 10m

storage:
  trace:
    backend: s3                     # backend configuration to use
    wal:
      path: /data/tempo/wal             # where to store the the wal locally
      bloom_filter_false_positive: .05 # bloom filter false positive rate.  lower values create larger filters but fewer false positives
      index_downsample: 10             # number of traces per index record
    s3:
      bucket: ***
      endpoint: ***
      access_key: ****
      secret_key: ****
      insecure: true
    pool:
      max_workers: 100                 # the worker pool mainly drives querying, but is also used for polling the blocklist
      queue_depth: 10000

The text was updated successfully, but these errors were encountered:

zhaoyao · 2020-12-13T15:48:42Z

Attach two tempo trace.

trace-ok.json: trace can be fetch through ingestor.
trace-not-found.json: trace missing after flush and cleared by ingestor.

tempo-trace.zip

joe-elliott · 2020-12-13T17:41:30Z

Try setting your complete_block_timeout to be 10m. This way we should be able to guarantee that the querier is aware of a block by the time it is flushed from the ingester.

Currently the poll cycle and complete block timeout both default to 5m which should probably be changed.

If that doesn't work can you share the querier logs at the time you are executing the query? It is possible that the list operation is behaving differently then expected in ceph which is causing the querier to be unaware of the backend blocks.

Please also check the tempodb_blocklist_length metric as exposed by the querier and make sure that it matches the number of blocks in your backend.

github-actions · 2022-12-05T00:04:26Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

kradalby mentioned this issue Feb 26, 2021

error polling blocklist, fails to list blocks on Dell ECS #558

Closed

github-actions bot added the stale Used for stale issues / PRs label Dec 5, 2022

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trace can not be fetched after flush to storage #408

Trace can not be fetched after flush to storage #408

zhaoyao commented Dec 13, 2020

zhaoyao commented Dec 13, 2020

joe-elliott commented Dec 13, 2020

github-actions bot commented Dec 5, 2022

Trace can not be fetched after flush to storage #408

Trace can not be fetched after flush to storage #408

Comments

zhaoyao commented Dec 13, 2020

zhaoyao commented Dec 13, 2020

joe-elliott commented Dec 13, 2020

github-actions bot commented Dec 5, 2022