Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error polling blocklist, fails to list blocks on Dell ECS #558

Closed
kradalby opened this issue Feb 26, 2021 · 11 comments · Fixed by #561
Closed

error polling blocklist, fails to list blocks on Dell ECS #558

kradalby opened this issue Feb 26, 2021 · 11 comments · Fixed by #561

Comments

@kradalby
Copy link
Contributor

Describe the bug

I am playing around with Tempo for a proof of concept were I use ECS (from Dell) as my S3 compatible storage.
Currently I am experiencing the same issue as described #408 were traces are not available after being flushed to object store, however, as per suggestion, I checked the tempodb_blocklist_length, which is at 0. This was curious and I then realised that I see a lot of:

level=error ts=2021-02-26T11:29:12.002142062Z caller=tempodb.go:353 msg="error polling blocklist" tenantID=single-tenant err="error parsing uuid of obj, objectName: single-tenant/0023b193-abff-4127-b9e1-c5acd3936416/: invalid UUID length: 13"

every 5 minutes.

I queried the S3 storage to see how the layout was and the UUID exist and I think they are valid.

To Reproduce
Steps to reproduce the behavior:

  1. Start Tempo 0.6.0 (revision=a8bf22014)
  2. Wait for periodic check

Expected behavior

I expect tempo to be able to read back from the bucket after writing traces to blob storage.

Environment:

  • Infrastructure: Kubernetes + ECS for blob
  • Deployment tool: Helm 3

Additional Context

level=error ts=2021-02-26T11:24:11.316995438Z caller=tempodb.go:353 msg="error polling blocklist" tenantID=single-tenant err="error parsing uuid of obj, objectName: single-tenant/0023b193-abff-4127-b9e1-c5acd3936416/: invalid UUID length: 13"
level=info ts=2021-02-26T11:24:35.31391784Z caller=compactor.go:88 msg="starting compaction cycle" tenantID=single-tenant offset=0
level=info ts=2021-02-26T11:24:35.314041368Z caller=compactor.go:92 msg="compaction cycle complete. No more blocks to compact" tenantID=single-tenant
level=info ts=2021-02-26T11:25:05.313898155Z caller=compactor.go:88 msg="starting compaction cycle" tenantID=single-tenant offset=0
level=info ts=2021-02-26T11:25:05.313959923Z caller=compactor.go:92 msg="compaction cycle complete. No more blocks to compact" tenantID=single-tenant
level=info ts=2021-02-26T11:25:35.313851404Z caller=compactor.go:88 msg="starting compaction cycle" tenantID=single-tenant offset=0
level=info ts=2021-02-26T11:25:35.313903952Z caller=compactor.go:92 msg="compaction cycle complete. No more blocks to compact" tenantID=single-tenant
level=info ts=2021-02-26T11:26:05.313877031Z caller=compactor.go:88 msg="starting compaction cycle" tenantID=single-tenant offset=0
level=info ts=2021-02-26T11:26:05.313939847Z caller=compactor.go:92 msg="compaction cycle complete. No more blocks to compact" tenantID=single-tenant
level=info ts=2021-02-26T11:26:35.313842908Z caller=compactor.go:88 msg="starting compaction cycle" tenantID=single-tenant offset=0
level=info ts=2021-02-26T11:26:35.313909537Z caller=compactor.go:92 msg="compaction cycle complete. No more blocks to compact" tenantID=single-tenant
level=info ts=2021-02-26T11:27:05.313825937Z caller=compactor.go:88 msg="starting compaction cycle" tenantID=single-tenant offset=0
level=info ts=2021-02-26T11:27:05.313882433Z caller=compactor.go:92 msg="compaction cycle complete. No more blocks to compact" tenantID=single-tenant
level=info ts=2021-02-26T11:27:35.313898734Z caller=compactor.go:88 msg="starting compaction cycle" tenantID=single-tenant offset=0
level=info ts=2021-02-26T11:27:35.313955449Z caller=compactor.go:92 msg="compaction cycle complete. No more blocks to compact" tenantID=single-tenant
level=info ts=2021-02-26T11:28:05.313829471Z caller=compactor.go:88 msg="starting compaction cycle" tenantID=single-tenant offset=0
level=info ts=2021-02-26T11:28:05.313901838Z caller=compactor.go:92 msg="compaction cycle complete. No more blocks to compact" tenantID=single-tenant
level=info ts=2021-02-26T11:28:35.313849769Z caller=compactor.go:88 msg="starting compaction cycle" tenantID=single-tenant offset=0
level=info ts=2021-02-26T11:28:35.313908214Z caller=compactor.go:92 msg="compaction cycle complete. No more blocks to compact" tenantID=single-tenant
level=info ts=2021-02-26T11:29:05.314177869Z caller=compactor.go:88 msg="starting compaction cycle" tenantID=single-tenant offset=0
level=info ts=2021-02-26T11:29:05.314287994Z caller=compactor.go:92 msg="compaction cycle complete. No more blocks to compact" tenantID=single-tenant
level=error ts=2021-02-26T11:29:12.002142062Z caller=tempodb.go:353 msg="error polling blocklist" tenantID=single-tenant err="error parsing uuid of obj, objectName: single-tenant/0023b193-abff-4127-b9e1-c5acd3936416/: invalid UUID length: 13"
level=info ts=2021-02-26T11:29:35.313898863Z caller=compactor.go:88 msg="starting compaction cycle" tenantID=single-tenant offset=0
@annanay25
Copy link
Contributor

@kradalby 👋

Could you please paste the storage section of your config? Along with the directory nesting of files in your object storage?

Coincidentally, single-tenant has 13 letters, and I think Tempo is looking for blocks and finding tenants at that level in the bucket. I suspect a small configuration error.

@kradalby
Copy link
Contributor Author

Oh, that is quite embarresing, should learn to count:

    storage:
      trace:
        backend: s3                        # backend configuration to use
        wal:
          path: /tmp/tempo/wal             # where to store the the wal locally
          # bloom_filter_false_positive: .05 # bloom filter false positive rate.  lower values create larger filters but fewer false positives
          # index_downsample: 10             # number of traces per index record
        s3:
          bucket: tracing-staging                  # how to store data in s3
          endpoint: internal.example.com:9020
          # access_key: tempo
          # secret_key: supersecret
          insecure: true
          # For using AWS, select the appropriate regional endpoint and region
          # endpoint: s3.dualstack.us-west-2.amazonaws.com
          # region: us-west-2
        pool:
          max_workers: 100                 # the worker pool mainly drives querying, but is also used for polling the blocklist
          queue_depth: 10000```

Access and key comes from Minio env vars:

    MINIO_ACCESS_KEY:
      value: ecs-tracing-staging
    MINIO_SECRET_KEY:
      value: <redacted>

I could not really see single-tenant anywhere in my configuration, so I must have assumed it was a internal.

@annanay25
Copy link
Contributor

Haha :)

Yeah so the single-tenant is a dummy tenant name used when multi-tenancy is disabled on Tempo. This particular error seems to be coming from

return nil, errors.Wrapf(err, "error parsing uuid of obj, objectName: %s", cp.Prefix)

This line

res, err := rw.core.ListObjects(rw.cfg.Bucket, tenantID+"/", "", "/", 0)

should list files in with the "single-tenant/" prefix. I now suspect that the ECS API (for Listing Objects) deviates from the AWS S3 API in that it returns the full path of the object..

Spec for reference - http://doc.isilon.com/ECS/3.3/DataAccessGuide/vipr_r_oapi_s3_supported_features_3_3.html

@kradalby
Copy link
Contributor Author

kradalby commented Feb 26, 2021

@annanay25 Thanks for the thorough linking, I just mocked up some go to verify this, and if the expected behaviour of that S3/Minio query would be UUID only:

minio.CommonPrefix{Prefix:"3b35c585-32ab-43e2-ac7d-efee0500f7fb/"}

Then indeed, ECS isnt compliant:

minio.CommonPrefix{Prefix:"single-tenant/3b35c585-32ab-43e2-ac7d-efee0500f7fb/"}

I will dig around a bit, see if there is a convenient workaround.

@annanay25
Copy link
Contributor

Nice find! And sure, would love to hear how it worked for you.

@mdisibio
Copy link
Contributor

Thanks for digging into it. I'm not sure the expected behavior is the query returns UUID only, tempo is doing some work to extract the UUID from the response: https://github.com/grafana/tempo/blob/master/tempodb/backend/s3/s3.go#L283

Would it be possible to query your ECS install using the aws cli, and compare results with mine below?

$ aws s3api list-objects --bucket <mybucket> --prefix single-tenant/ --delimiter /
{
    "CommonPrefixes": [
        {
            "Prefix": "single-tenant/000dd31a-96d0-44aa-a5a1-8d41ff350686/"

@mdisibio
Copy link
Contributor

Adding another thought: Tempo is trimming the result using res.Prefix which is the same tenantID+"/" that was sent in the request. Could you check if ECS is populating that field in the response? If not then it could be as simple as not using that.

@mdisibio mdisibio changed the title error polling blocklist, fails to list blocks on ECS error polling blocklist, fails to list blocks on Dell ECS Feb 26, 2021
@kradalby
Copy link
Contributor Author

kradalby commented Feb 26, 2021

@mdisibio Getting the aws tooling to check.

res.Prefix is set to "single-tenant%2F" (assume %2F is / without checking) in the response object.

@kradalby
Copy link
Contributor Author

Follow up, here is a segment from aws cli:

--no-verify-ssl s3api list-objects --bucket tracing-staging --prefix single-tenant/ --delimiter / --endpoint http://test
{
    "CommonPrefixes": [
        {
            "Prefix": "single-tenant/0023b193-abff-4127-b9e1-c5acd3936416/"
        },

Am I to understand that in this case it gives the correct output @mdisibio ?

@mdisibio
Copy link
Contributor

res.Prefix is set to "single-tenant%2F" (assume %2F is / without checking) in the response object.

If this value really is url-encoded it is likely the cause, which prevents TrimPrefix from working as expected. From S3 testing I get single-tenant/. Dell ECS might be non-compliant here, as the AWS examples don't show the delimiter being url-encoded in the response (and shouldn't need to be in a response body). See last example here: https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjects.html

A change in Tempo could be to ignore res.Prefix and instead use the same tenantID+"/" from above.

@kradalby
Copy link
Contributor Author

@mdisibio Sounds like a reasonable diagnose and solution, I made a Draft PR if you want to have a look. I'm going get it in and test it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants